Why four AI models gave four different translations of the same Korean sentence

The Korean politeness paradox

Korean has a layered honorific system that has no direct equivalent in European languages. A single source sentence can require a formal deferential form, a polite standard form, a familiar form, or a plain informal form depending on the relationship between speaker and listener. None of these choices is wrong in the abstract. All of them are wrong in the wrong context.

This makes Korean-to-English translation a revealing stress test for large language models. When the source text is ambiguous about social register, different AI models make different assumptions. And because those assumptions are embedded in the model’s training data and reinforcement signals rather than in any explicit rule, the divergence is not random noise. It is structured disagreement.

Consider the Korean sentence: “제가 도와드리겠습니다” (je-ga do-wa-deu-ri-get-seum-ni-da). In a business email, this means the speaker is offering assistance in a formal, deferential register. It is grammatically unambiguous. But the English rendering is not.

Four leading AI models, each tested independently on the same input, returned four distinct English outputs:

  • GPT-4o: “I will assist you.”
  • Claude Sonnet: “I’d be happy to help you.”
  • Gemini 1.5 Pro: “Please allow me to help you.”
  • DeepL: “I will be of assistance.”

None of these outputs is incorrect English. But they are not interchangeable. “I will assist you” carries the register of a formal service interaction. “I’d be happy to help you” reads as conversational warmth. “Please allow me to help you” sounds like a literal rendering of the deferential phrasing. “I will be of assistance” edges toward bureaucratic formality.

In a customer-facing SaaS email, that difference is brand voice. And in a legal correspondence, it is tone that affects interpretation. In a corporate HR communication, it is register that signals intent.

This kind of model variance across tasks is well-documented in multimodal AI research, but it is underappreciated in the translation context, where most users assume that a fluent output is a correct output.

Why the four outputs diverge

The divergence is not a bug. It is a product of how each model was trained and fine-tuned.

GPT-4o was fine-tuned with a strong emphasis on professional, direct English prose. It tends to flatten honorific nuance into clear, transactional language. Claude was trained with reinforcement from human feedback that rewarded natural, warm register, which pushes it toward conversational phrasing even in formal source contexts. Gemini’s outputs for highly deferential source text tend to mirror the structure of the original more closely, producing translations that feel closer to a literal rendering of the politeness mechanism. DeepL, which has been specifically tuned for business translation quality in select language pairs, defaults to formal but stilted constructions when handling Korean deferential register into English.

According to independent subtitle translation benchmarks published in 2026, Japanese (which shares structural politeness encoding with Korean) was the language pair where every model tested showed measurable divergence, with researchers noting that post-editing was necessary regardless of which model was used. Korean presents a similar structural challenge, and the performance gap between models is not closing quickly.

Research published on arXiv (2410.06338) found that LLMs can produce unstable output structures even at zero temperature — the setting typically used to eliminate sampling randomness. That instability is amplified when the source language has register-encoding that the target language resolves through tone rather than grammar. Korean to English is exactly this case.

This connects to a broader pattern Textify has covered in its analysis of emergent AI behavior: as large language models grow more capable, they do not converge on a single correct answer for ambiguous tasks. They develop increasingly distinct interpretive tendencies. For translation, that means teams choosing between models are not choosing between better and worse quality. They are choosing between different interpretive frameworks, and they may not know which one their business context requires.

What this means for teams using AI in multilingual workflows

Most teams building AI-assisted multilingual workflows make one of two choices: they pick the model they trust most and use it consistently, or they manually compare outputs from two or three models before sending.

Both approaches have the same structural problem. The first creates systematic interpretive bias: every translation reflects the stylistic tendencies of a single model, which may not match the register your audience expects. The second creates a new operational cost: the manual comparison loop that most teams adopt to manage AI uncertainty is a verification backlog that grows proportionally with translation volume.

According to Lokalise’s 2026 research on the best LLMs for translation, the practical recommendation from testing is that no single model wins across all language pairs and task types. The right model depends on what you are translating, how much context you can provide, and how strictly you need to protect formatting, terminology, and brand voice. That conclusion is accurate, but it leaves teams with the problem of knowing which model to use for which task, which requires either expertise or trial and error.

For high-register language pairs like Korean, the manual comparison problem is especially acute. A team without Korean language expertise has no reliable way to evaluate four outputs that are all grammatically correct. They can choose the one that sounds most natural to an English speaker, but that is an aesthetic judgment, not a translation accuracy judgment.

How running multiple models against each other changes the reliability equation

The disagreement between GPT-4o, Claude, Gemini, and DeepL on the Korean sentence above is not a problem to eliminate. It is information. When four models agree on an output, that agreement is evidence of correctness that no single model’s confidence score can provide. When they disagree significantly, the disagreement flags a genuine ambiguity in the source text that requires human attention.

This is the logic behind consensus-based translation architectures. Instead of choosing a model and trusting it, the approach runs multiple models simultaneously and selects the output that the majority agree on. The disagreement cases, rather than being hidden, become visible: they surface ambiguities that warrant review rather than being silently passed through as fluent but potentially wrong outputs.

MachineTranslation.com, an AI translator, applies this approach through its SMART mechanism, which compares the outputs of 22 AI models and selects the translation that most of them agree on. Internal benchmarks show this reduces critical translation errors to under 2%, compared to error rates of 10% to 18% reported for individual top-tier models on complex source content. For a language pair like Korean, where register divergence between models is structural rather than occasional, the consensus approach surfaces the disagreement rather than arbitrarily resolving it in favor of one model’s interpretive framework.

Ofer Tirosh, CEO of Tomedes, which developed the platform, describes the principle directly: “The question teams should be asking is not which AI model is most accurate in the abstract. It is: when the models disagree on a Korean honorific, a Portuguese diminutive, or a German compound that has no English equivalent, which disagreement structure tells you the most about where the translation risk actually sits?”

Building a more defensible multilingual AI workflow

For teams operating in Korean, Japanese, or other register-sensitive languages, the practical guidance is not to find the perfect model. It is to build a workflow that makes model disagreement visible rather than invisible.

Three principles follow from the analysis above:

•       Test your specific language pair, not a benchmark ranking. Model performance rankings change with language pair, content type, and domain. A model that tops the leaderboard for Spanish marketing copy may perform poorly on Korean business correspondence.

•       Treat fluency as a necessary but insufficient signal. All four translations of the Korean sentence above were fluent. None of that fluency was evidence of register accuracy. Fluency evaluations done by non-speakers of the source language cannot detect the disagreements that matter most.

•       Make the verification cost visible before you optimize for speed. Internal data from teams that track translation workflow time shows that non-linguists using single-model AI spend 46% of their AI translation time manually comparing or correcting outputs. That is not a translation speed gain. It is a verification backlog that erases the speed advantage.

The Korean sentence that opened this piece has a correct translation. It depends on context the model does not have unless you provide it. The four divergent outputs are each a hypothesis about what that context might be. The team that can see all four hypotheses and understand why they diverge is in a better position than the team that received one output and trusted it.

Scroll to Top