Take the ReplyResearch Customer Support Black Hole Survey

When AI Writes the Email, Who Is Really Replying?

What “Email in the Era of LLMs” reveals about the future of organisational responsiveness

The most important change in email may not be that AI can write it faster.

It may be that AI is beginning to decide what a “good” email sounds like.

That distinction matters. For years, the inbox problem has been understood as a problem of volume, delay, and neglect. A customer, prospect, partner, supplier, applicant, tenant, patient, journalist, or citizen reaches out to an organisation. The message waits. It is routed badly, answered vaguely, buried under internal correspondence, or never answered at all.

AI has been sold as an answer to that problem. It can draft faster. It can summarise threads. It can suggest replies. It can make workers spend less time on email. It can turn the blank page into a polished response.

But the question for ReplyResearch is not simply whether AI can help organisations send more emails.

The question is whether AI helps organisations become more responsive.

A new research paper, “Email in the Era of LLMs,” by Dang Nguyen, Harvey Yiyun Fu, Peter West, Chenhao Tan, and Ari Holtzman, gives us a useful way into that question.

The paper is available here:

https://arxiv.org/abs/2603.20231

An HTML version is available here:

https://arxiv.org/html/2603.20231v1

The study examines how large language models read, write, and judge email in socially complex workplace scenarios. The researchers introduce an “HR Simulator,” a game in which players act as a Human Resources officer and write emails to handle difficult workplace situations. They then analyse more than 600 human and LLM-written emails, including human-only emails, LLM-only emails, and human emails rewritten with LLM assistance.

The findings are fascinating.

LLM-written emails tend to be more formal and more empathetic. Human emails are more varied. LLM rewrites pull human emails toward a high-formality, high-empathy style. Larger LLMs become more similar to one another in how they judge email quality. And in some scenarios, human-plus-LLM collaboration outperforms both human-only and LLM-only communication.

This is not a paper about customer service queues, CRM systems, or unanswered sales leads. But it may be one of the more important papers for understanding the future of the organisational inbox.

Because if LLMs increasingly write, rewrite, judge, and optimise professional email, then they are not merely helping people reply.

They are changing what a reply is supposed to be.

The experiment: email as social strategy

Email looks simple from the outside. Someone writes. Someone replies.

Inside organisations, it is rarely that simple.

A good email often has to do more than communicate information. It has to preserve relationships, manage status, avoid unnecessary conflict, signal concern, protect the sender, reassure the recipient, document a position, and move a situation forward without creating new problems.

In many workplace contexts, the wording is not decorative. It is operational.

That is why “Email in the Era of LLMs” is interesting. The paper does not test whether AI can produce grammatically correct messages. That problem is largely solved. Instead, it looks at email as a socially loaded act.

The researchers’ HR Simulator places the writer in scenarios where the reply must navigate delicate workplace dynamics. A successful email may need to be direct but not blunt, warm but not weak, tactful but not evasive, firm but not cold.

That makes the paper relevant to the wider world of organisational responsiveness. Many inbound enquiries require exactly this kind of judgment.

A complaint needs acknowledgement without premature admission. A sales enquiry needs enthusiasm without overpromising. A rejected applicant deserves clarity without cruelty. A worried customer needs reassurance without empty language. A supplier needs a decision. A journalist needs a response that is both useful and careful. A tenant, patient, or citizen may need the organisation to recognise urgency beneath a plain message.

The hard part of responsiveness is often not the fact of reply.

It is the quality of recognition inside the reply.

The headline finding: AI emails become polished, formal, and empathetic

One of the paper’s central findings is that LLM emails are more formal and empathetic than human emails, while human emails are more diverse.

That is both encouraging and worrying.

It is encouraging because many bad organisational replies fail at exactly this level. They are abrupt, incomplete, careless, defensive, or tonally wrong. A tool that nudges replies toward clearer empathy and formality could improve many routine interactions.

Anyone who has received a brusque complaint response, a confusing HR message, or a cold administrative email can see the appeal. If AI helps workers pause, soften the tone, explain the next step, and acknowledge the recipient’s position, that is a real communication gain.

But there is also a risk.

A more formal and empathetic email is not necessarily a more responsive email.

Responsiveness is not just tone. It is action, ownership, relevance, and resolution. An email can sound beautifully considerate while still avoiding the question. It can say “we understand your concern” without showing that anyone has understood the concern. It can be polished, careful, and emotionally fluent while leaving the sender exactly where they started.

This is one of the coming dangers of AI-mediated communication: organisations may get better at sounding responsive before they get better at being responsive.

For ReplyResearch, that distinction is central.

The inbox black hole is not only a place where messages disappear. It is also a place where messages are answered in ways that do not move anything forward.

AI may reduce the first kind of failure while expanding the second.

The new generic: empathetic sameness

The paper finds that human emails are more varied than LLM emails. LLMs tend to move communication toward a narrower tonal zone: more formal, more empathetic, more polished.

In many organisations, this will be welcomed. Consistency is attractive. Brand voice teams like consistency. Legal teams often like caution. Customer service leaders like templates. Managers like messages that do not sound reckless.

But human variation has value.

Sometimes a good reply needs to be warm and informal. Sometimes it needs to be blunt. Sometimes it needs to be brief. Sometimes a message should sound like a person making a judgment, not a system applying a tone preset. Sometimes a recipient needs clarity more than empathy. Sometimes the most respectful thing an organisation can do is stop cushioning the answer and say what will happen next.

If AI pulls too much email toward the same high-formality, high-empathy style, organisational communication may become smoother but less precise.

That matters for unsolicited enquiries.

The person who reaches out to an organisation is often trying to discover whether there is a human being, or at least a responsible system, on the other side. A reply that sounds professionally empathetic but says little can deepen the sense of distance. The sender receives language that resembles care, but not the practical attention they needed.

That is the danger of empathetic sameness.

It may make the organisation look better while making the recipient feel more processed.

Human plus AI may be the strongest model

The most useful finding in the paper is not that LLMs can write good emails. It is that human-plus-LLM communication can outperform both human-only and LLM-only approaches in some scenarios.

That finding deserves attention because it points toward a better model of AI responsiveness.

The best future is probably not one where AI writes every reply. Nor is it one where humans ignore AI and keep drowning in inbox volume. The most promising model is collaborative: humans supply context, judgment, responsibility, and moral awareness; AI helps with wording, structure, tone, and alternatives.

For organisations, this distinction matters.

AI-only communication may be fast, but it can lack situational accountability. Human-only communication may be rich, but it is vulnerable to overload, delay, inconsistency, fatigue, and emotional shortcuts. Human-plus-AI communication has the potential to combine speed with judgment, but only if the human remains genuinely responsible for the reply.

That is not guaranteed.

There is a weak version of human-plus-AI: the worker clicks “make this sound better,” scans the result quickly, and sends it. In that model, AI becomes a politeness filter. It improves the surface while leaving the deeper response untouched.

There is a stronger version: the worker uses AI to test possible replies, identify missing context, check tone, clarify next steps, and consider how the recipient might read the message. In that model, AI becomes a thinking aid rather than a substitute for responsibility.

The ReplyResearch question is which version organisations will actually build into their workflows.

LLMs may become judges of email quality

One of the more unsettling parts of the paper is not just that LLMs write emails. It is that LLMs judge them.

The researchers use LLMs-as-judges to evaluate communication, and they find evidence that larger models become more homogeneous in their email quality judgments. In other words, as models scale, they may increasingly converge on shared ideas of what a good email looks like.

This has obvious research value. But it also points toward a likely workplace future.

If AI systems draft emails, score emails, rewrite emails, prioritise emails, evaluate sentiment, and recommend next actions, then organisations may start optimising communication for machine-readable quality.

That could be useful. It could reduce hostile, confusing, or careless messages. It could help organisations maintain minimum standards. It could make some internal and external communication more consistent.

But it could also create a new form of communicative conformity.

If the same kinds of models write the emails and judge the emails, then the organisation may begin to reward the emails that models like. Over time, this could produce a narrow official style: tactful, formal, empathetic, careful, and possibly evasive.

For customer and stakeholder communication, that is a serious issue.

A model may judge an email highly because it is balanced, polite, and emotionally appropriate. But the recipient may judge it poorly because it does not answer the question.

The machine may reward tone.

The sender may need resolution.

The problem of tact

The paper discusses “emergent tact”: stronger models appear to prefer more subtle and tactful communication, while weaker models may prefer more direct approaches.

This is one of the most interesting findings for organisational life.

Tact is not a trivial feature of email. It is often essential. A workplace email that is technically correct but socially clumsy can create unnecessary conflict. In customer communication, tact can preserve trust. In complaint handling, it can prevent escalation. In sensitive situations, it can protect dignity.

But tact can also become avoidance.

Anyone who has dealt with organisations knows the experience. The email is courteous. The wording is careful. The tone is sympathetic. But the message does not answer the question, make a decision, accept responsibility, or provide a path forward.

Tact becomes fog.

This is where AI-generated email may create a new responsiveness shortfall. Organisations may become better at tactful non-answers. They may produce replies that are less offensive, less risky, and more polished, while still leaving the sender stuck.

That is not a small problem. In some contexts, bluntness is bad. In others, directness is the service.

A customer asking whether a refund has been approved needs an answer. A tenant reporting a safety issue needs action. A journalist asking for comment needs a clear position. A prospective buyer asking for pricing needs useful information. A patient trying to contact a provider needs routing, not soothing prose.

Tact must serve the reply.

It cannot replace it.

What this means for the inbox problem

The inbox problem has never been just about unread messages. It is about mishandled intent.

A message arrives carrying a purpose: to buy, complain, clarify, report, apply, renew, cancel, escalate, partner, warn, or ask for help. A good organisation recognises that intent and acts on it. A bad one ignores it, misroutes it, delays it, or answers with language that does not move the situation forward.

LLMs may change every part of that process.

They may help identify intent. They may help draft replies. They may summarise long histories. They may rewrite rough messages into more polished ones. They may help employees avoid unnecessary emotional harm. They may reduce the burden of difficult correspondence.

But they may also create new failure modes.

The message may be answered quickly but generically. The tone may be empathetic but noncommittal. The email may be optimised for sounding good rather than resolving the sender’s need. The organisation may standardise on model-approved language that smooths over difference, urgency, and discomfort.

The result could be an inbox that looks healthier internally while still failing externally.

Fewer unread messages.

More polished replies.

Same unresolved people.

The cultural shift: email written for machines

There is another implication of the paper that deserves attention.

If LLMs increasingly write emails, and LLMs increasingly read and judge emails, then email becomes a machine-mediated social environment.

That changes the incentives for everyone.

Workers may write emails knowing an AI will polish them. Managers may read AI summaries rather than original messages. Customer service systems may classify sentiment and urgency before a human sees the text. Recipients may use AI to interpret the tone of replies. Senders may use AI to make complaints sound more compelling, enquiries more professional, or follow-ups more forceful.

At that point, an email is no longer simply a message from one person to another.

It is a negotiated artefact passing through multiple machine interpretations.

This is especially important for unsolicited enquiries. If everyone uses AI to make their message more polished, the organisation may receive more messages that sound credible, urgent, or emotionally calibrated. At the same time, the organisation may use AI to triage and respond. The entire interaction becomes a contest of machine-shaped signals.

That may improve communication.

It may also make genuine urgency harder to detect.

When every message sounds polished, what happens to the rough, awkward, badly written, but important enquiry?

When every reply sounds empathetic, how does the recipient know whether anyone actually cared?

The missing metric: did the reply work?

The research paper evaluates email quality through simulated workplace scenarios and LLM-based judging. That is useful for studying communication norms. But ReplyResearch would add another test.

Did the reply work for the person who reached out?

In real organisational communication, email quality is not only a matter of tone. A reply has a job to do. It should reduce uncertainty, answer the question, route the issue, preserve trust, and move the interaction toward resolution.

An AI-generated or AI-polished email should be judged against those outcomes.

Did the recipient understand what happens next?

Did the email answer the specific question?

Did it acknowledge the actual issue, rather than the generic category?

Did it make ownership clear?

Did it avoid unnecessary delay?

Did it reduce the need for follow-up?

Did it preserve enough humanity for the recipient to feel recognised?

These are the metrics that matter if the goal is responsiveness rather than merely communication quality.

The future of AI email should not be measured only by whether the message sounds better.

It should be measured by whether the exchange works better.

Practical lessons for organisations

The paper points toward several practical lessons for any organisation using LLMs in email.

First, do not confuse better tone with better responsiveness. AI may make replies more formal and empathetic, but organisations still need to check whether the reply answers the question and moves the issue forward.

Second, keep humans responsible for judgment. The strongest model may be human-plus-AI, but only if the human adds real context, ownership, and accountability rather than simply approving polished text.

Third, audit for sameness. If every reply begins to sound alike, the organisation may be losing useful variation. Some situations need warmth. Some need directness. Some need urgency. Some need apology. Some need a decision.

Fourth, test replies from the recipient’s perspective. A reply that scores well internally may still fail the person who sent the enquiry. Organisations should examine whether AI-assisted replies reduce follow-ups, complaints, confusion, and escalation.

Fifth, beware of tactful non-answers. Tact is valuable, but it should not become a way to avoid commitment. A good reply can be careful and clear at the same time.

Sixth, measure intent resolution. The central question should not be, “Was this email well written?” It should be, “Did this email recognise and resolve the sender’s intent?”

Seventh, protect human texture. Not every organisational email needs to sound like a corporate empathy template. Sometimes credibility comes from specificity, directness, and the sense that a real person has paid attention.

The ReplyResearch view

“Email in the Era of LLMs” is not a conventional customer-service study. That is why it matters.

It shows that LLMs are not merely entering the inbox as productivity tools. They are entering as writers, editors, judges, and norm-setters. They can make email more formal, more empathetic, and sometimes more effective. They can also make communication more homogeneous, more tactful, and potentially more detached from the messy specificity of the person who reached out.

For ReplyResearch, the lesson is clear.

The future of the inbox will not be decided only by whether organisations use AI to reply faster. It will be decided by whether AI helps organisations recognise intent, take responsibility, and respond in ways that actually move people closer to resolution.

A polished email is not enough.

A tactful email is not enough.

An empathetic email is not enough.

The real question is whether the reply does the work that made the sender reach out in the first place.

AI may improve the language of organisational responsiveness.

Now organisations have to prove it improves the substance.

Sign reading 'Nok Nok Footnote Zone' next to Charging Bull sculpture on city street
A sign designates a footnote-only zone near the Charging Bull statue in NYC

The Footnote Zone applies four diagnostic tools developed by Nok Nok, a specialist in online responsiveness tool design, to test whether AI-assisted email systems improve genuine organisational responsiveness rather than simply producing faster, smoother, or more empathetic replies.

  • Email Finder: When organisations hide contact options, rely on obstructive forms, or allow published mailboxes to become unclear or inconsistent, AI-written replies cannot solve the deeper access problem. Email Finder scans an organisation’s website for published email addresses and reports structural deficiencies, discrepancies, and contact-route gaps that prevent people from reaching the right inbox in the first place.
  • Reply Radar: When AI promises faster communication but human queues remain understaffed, delayed, or unevenly monitored, the real question is whether messages are actually answered. Reply Radar deploys targeted test emails and quantitatively measures reply rates, response latency, and follow-through, showing whether an organisation’s responsiveness is improving in practice rather than only in appearance.
  • Compliance Sniffer: When AI-generated emails create polished non-answers, empty empathy, hallucination loops, or degraded message quality, tone can mask a failure to resolve the sender’s intent. Compliance Sniffer analyzes incoming responses against objective quality and compliance benchmarks, identifying whether replies answer the question, provide clear next steps, avoid unsafe or misleading claims, and meet minimum standards of useful communication.
  • Mystery Shopper: When users encounter systemic UX breakdowns, aggressive gateway filters, defensive forms, confusing routing, or machine-shaped journeys that make genuine urgency harder to detect, responsiveness must be assessed end to end. Mystery Shopper executes a comprehensive responsiveness UX audit, testing the full path from initial contact discovery through message submission, filtering, routing, reply quality, and practical resolution.
A team of archaeologists excavates large, unusual artefacts in a sandy site, with one female scientist pointing and discussing findings. Various tools and equipment are visible in the background.

Sources and relevant reading for When AI Writes the Email Who Is Really Replying?

  • “Email in the Era of LLMs” – Dang Nguyen, Harvey Yiyun Fu, Peter West, Chenhao Tan, and Ari Holtzman, arXiv, 6 March 2026; latest version 6 April 2026
    https://arxiv.org/abs/2603.20231
    This is the central research paper discussed in the article. It directly supports the article’s claims about LLMs writing, rewriting, and judging email; the movement of AI-written email toward greater formality and empathy; and the possibility that human-plus-LLM communication may outperform either human-only or LLM-only approaches.
  • “Email in the Era of LLMs” – OpenReview, published 28 April 2026
    https://openreview.net/forum?id=YCDwwjW73F
    This provides an additional public research record for the same paper, including the authors’ framing of human-LLM co-writing, social reasoning, contextual benchmarks, and communication games. It is useful as a supporting source for the article’s emphasis on AI not merely as a drafting aid, but as an emerging participant in workplace communication norms.
  • “AI generates well-liked but templatic empathic responses” – Emma S. Gueorguieva, Hongli Zhan, Jina Suh, Javier Hernandez, Tatiana Lau, Junyi Jessy Li, and Desmond C. Ong, arXiv, 9 April 2026; revised 8 June 2026
    https://arxiv.org/abs/2604.08479
    This source strongly supports the article’s argument about “empathetic sameness.” The paper finds that LLMs often produce empathic responses using repeatable templates, which relates directly to the concern that organisations may become better at sounding caring while becoming less specific, less varied, or less genuinely responsive.
  • “A Literature Review of Personalized Large Language Models for Email Generation and Automation” – Rodrigo Novelo, Rodrigo Rocha Silva, and Jorge Bernardino, Future Internet, 2025
    https://www.mdpi.com/1999-5903/17/12/536
    This review is relevant to the article’s broader claim that LLMs are entering email as infrastructure, not just as convenience tools. It covers email automation, personalized response generation, context-aware messaging, privacy, security, and ethical challenges, all of which connect to the article’s concern that automated email systems must be judged by trustworthiness and responsiveness, not merely fluency.
  • “AI Hallucinations Are Quietly Undermining Customer Experience: Here’s How to Stay Ahead” – Shruti Tiwari, CMSWire, 17 July 2025
    https://www.cmswire.com/customer-experience/preventing-ai-hallucinations-in-customer-service-what-cx-leaders-must-know/
    This article supports the section of the piece concerned with AI-generated replies that sound plausible but fail the recipient. Its discussion of hallucinations in customer service is relevant to the risk that AI-assisted organisational communication may generate confident, polished, or legally risky answers that do not accurately resolve the issue.
  • “AI-Generated ‘Workslop’ Is Destroying Productivity” – Kate Niederhoffer, Gabriella Rosen Kellerman, Angela Lee, Alex Liebscher, Kristina Rapuano, and Jeffrey T. Hancock, Harvard Business Review, 22 September 2025; updated 25 September 2025
    https://hbr.org/2025/09/ai-generated-workslop-is-destroying-productivity
    This source is relevant to the article’s distinction between polished output and useful output. The “workslop” concept helps frame the danger that AI-generated emails may look professional and complete while lacking the substance, context, or decision-making needed to move a situation forward.
  • “AI ‘workslop’ sabotages productivity, study finds” – Axios, 24 September 2025
    https://www.axios.com/2025/09/24/ai-workslop-workplace-efficiency-study
    This article provides a concise journalistic account of the same workplace phenomenon: AI-generated content that appears polished but creates extra work for recipients. It is useful background for the article’s point that AI may improve the surface of communication while worsening the practical burden placed on the person receiving the message.
  • “How to manage the AI-to-human handoff” – Freshworks, 2026
    https://www.freshworks.com/theworks/ai-assisted-service/ai-human-handoff/
    This source relates to the article’s argument that the future of responsiveness depends on how organisations combine AI speed with human judgment. Its focus on AI-to-human handoff is relevant to the article’s distinction between weak human-plus-AI workflows, where people merely approve polished text, and stronger workflows, where humans retain ownership of judgment, escalation, and resolution.
  • “Chatbot Frustration is Real: Hidden Costs and Best Practices” – California Management Review, April 2026
    https://cmr.berkeley.edu/2026/04/chatbot-frustration-is-real-hidden-costs-and-best-practices/
    This source supports the article’s wider concern with machine-mediated customer journeys. It is relevant to the discussion of users being processed through automated systems that may reduce organisational costs while creating frustration, delay, or a failure to recognise what the person actually needs.
  • “Looking ahead at AI and work in 2026” – MIT Sloan, January 2026
    https://mitsloan.mit.edu/ideas-made-to-matter/looking-ahead-ai-and-work-2026
    This source provides broader context for the article’s workplace argument. Its focus on AI and work in 2026 helps situate AI-written email within a larger organisational shift in which companies are trying to understand where AI genuinely improves performance and where human judgment, accuracy, and accountability remain essential.
Peter Friedman's avatar

Peter Friedman