DeepSeek V4 Document Translation Review: vs V3.2, GPT-5.4, Claude 4.7 & Gemini 3 Pro

Intro: Can DeepSeek V4 really translate your documents?

DeepSeek V4 hit the front page of every tech forum the day it dropped — solid benchmark scores, prices barely moved. But benchmark numbers and real document translation are two different things. The question we keep getting from our users: "Is V4 worth switching to? On real PDFs, contracts, and academic papers — how much better is V4 than V3? And how does it stack up against flagships like GPT-5.4, Claude 4.7, and Gemini 3 Pro?"

So we got DeepSeek V4 API access on day one (both deepseek-v4-pro and deepseek-v4-flash) and ran a rigorous head-to-head DeepSeek V4 translation review:

6 models, same arena: DeepSeek V4 Pro, V4 Flash, V3.2, GPT-5.4, Claude Opus 4.7, Gemini 3 Pro Preview
5 real document scenarios: academic papers, legal contracts, technical docs with code, literary prose, manga dialogue
Dual LLM-judge blind scoring: GPT-5.4 and Claude Opus 4.7, each scoring with independently shuffled labels
5 scoring dimensions: faithfulness, fluency, terminology, style, formatting preservation (1–5 scale)

Below is the full scoreboard, our methodology, every source text alongside all 6 candidate translations, and latency/cost data.

TL;DR (for readers in a hurry)

Rank	Model	Overall	Faithful	Fluency	Term	Style	Fmt	Avg latency
🥇 1	GPT-5.4	4.68	4.7	4.7	4.6	4.5	4.9	4.5 s
🥈 2	Claude Opus 4.7	4.62	4.2	4.8	4.4	4.7	5.0	—
🥉 3	Gemini 3 Pro Preview	4.56	4.4	4.7	4.5	4.4	4.8	14.2 s
4	DeepSeek V4 Pro	4.38	4.4	4.4	4.4	4.3	4.4	17.1 s
5	DeepSeek V4 Flash	4.38	4.2	4.3	4.4	4.0	5.0	4.7 s
6	DeepSeek V3.2	4.26	4.3	4.1	4.3	4.0	4.6	4.6 s

Three-sentence summary:

DeepSeek V4 is a real upgrade over V3.2, but a modest one (+0.12 on a 5-point scale). It still trails GPT-5.4 and Claude 4.7.
V4 Pro and V4 Flash tied overall. Pro gets reasoning-driven semantic depth, but Flash is 4× faster and much cheaper — Flash is enough for most users.
DeepSeek is still behind on Chinese-to-other translation, especially literary and manga. The flip side: on Chinese technical documentation, even DeepSeek V3.2 beat every flagship.

1. Methodology: how we kept it fair

1.1 The 6 models

Model ID	Type	Endpoint
`deepseek-v4-pro`	New flagship (reasoning)	DeepSeek official API
`deepseek-v4-flash`	New lightweight (shallow reasoning)	DeepSeek official API
`deepseek-v3.2`	Previous generation	Proxy API
`gpt-5.4`	OpenAI current flagship	Proxy API
`claude-opus-4-7`	Anthropic flagship	In-conversation
`gemini-3-pro-preview-r`	Google latest flagship preview	Proxy API

1.2 The 5 scenarios (short snippets, each probing one weakness)

Scenario	Direction	Challenge
Academic paper abstract	EN → ZH	Technical jargon, passive voice, formal register
Legal contract clause	EN → ZH	Long sentence, precision, legalese
Technical doc with inline code	EN → ZH	Preserve inline code, identifiers, numbers
Literary prose (Lu Xun, My Old Home)	ZH → EN	Rhythm, imagery, classical-flavored tone
Manga dialogue (shōnen)	JA → EN	Colloquial voice, Japanese sentence-end particles

1.3 A unified, minimal prompt

To eliminate prompt-engineering bias, every model got the exact same minimal instruction:

System:
You are a professional document translator. Translate the following text
from {SRC} to {TGT}. Preserve all inline code snippets (text inside backticks),
identifiers, numbers, mathematical notation, and paragraph breaks exactly
as they appear in the source. Output only the translation text, with no
explanations, no notes, and no additional commentary.

User: {source text}

All models: temperature=0.3, max_tokens=4096. Identical.

1.4 Dual blind judges

A single model would favor translations from its own family, so we used two judges:

Judge 1: GPT-5.4 (temperature=0, JSON output)
Judge 2: Claude Opus 4.7

For each scenario, the 6 candidates were shuffled with two different random seeds — each judge saw a different label order (A / B / C / D / E / F) and didn't know which model produced which output. Final score = average of both judges across 5 dimensions.

2. The 5 scenarios, one by one

Scenario 1: Academic paper abstract (EN → ZH)

Source: We propose a novel sparse attention mechanism that reduces memory complexity from O(n^2) to O(n log n) while preserving the expressiveness of full self-attention. Empirical results on GLUE benchmarks show our method retains 98.7% of the baseline performance at 42% of the memory cost, with negligible impact on downstream task accuracy.

Model	Translation
DeepSeek V4 Pro	我们提出了一种新颖的稀疏注意力机制，将内存复杂度从 O(n^2) 降低至 O(n log n)，同时保留了全自注意力的表达能力。在 GLUE 基准上的实证结果表明，我们的方法以 42% 的内存开销保持了 98.7% 的基线性能，且对下游任务准确率的影响可忽略不计。
DeepSeek V4 Flash	我们提出了一种新颖的稀疏注意力机制，将内存复杂度从 O(n^2) 降低至 O(n log n)，同时保持完整自注意力的表达能力。在 GLUE 基准上的实验结果表明，我们的方法在仅消耗 42% 内存成本的情况下，保留了基线模型 98.7% 的性能，且对下游任务准确率的影响可忽略不计。
DeepSeek V3.2	我们提出了一种新颖的稀疏注意力机制，将内存复杂度从O(n²)降低至O(n log n)，同时保持了完整自注意力的表达能力。在GLUE基准测试上的实证结果表明，该方法仅以42%的内存成本即保留了基线模型98.7%的性能，对下游任务准确率的影响微乎其微。
GPT-5.4	我们提出了一种新颖的稀疏注意力机制，在保留完整自注意力表达能力的同时，将内存复杂度从 O(n^2) 降低到 O(n log n)。在 GLUE 基准测试上的实证结果表明，我们的方法以 42% 的内存成本保留了基线性能的 98.7%，对下游任务准确率的影响可以忽略不计。
Claude Opus 4.7	我们提出一种新颖的稀疏注意力机制，将内存复杂度从 O(n^2) 降低至 O(n log n)，同时保留了完整自注意力的表达能力。在 GLUE 基准测试上的实证结果表明，我们的方法在仅占基线 42% 内存开销的情况下保留了 98.7% 的性能，对下游任务精度的影响可以忽略不计。
Gemini 3 Pro Preview	我们提出了一种新颖的稀疏注意力机制，将内存复杂度从 O(n^2) 降低至 O(n log n)，同时保留了完整自注意力的表达能力。在 GLUE 基准测试上的实验结果表明，我们的方法在仅需 42% 内存开销的情况下保留了 98.7% 的基线性能，且对下游任务准确率的影响可忽略不计。

🏆 Three-way tie: DeepSeek V4 Pro / GPT-5.4 / Gemini 3 Pro (all 4.9/5)

Academic writing is where DeepSeek V4 Pro shines. It ties with GPT-5.4 here.
The only clear deduction went to V3.2 — it silently changed O(n^2) into Unicode superscript O(n²), a minor formatting slip. This is also the most visible improvement V4 brings over V3.2: V4 now leaves math notation exactly as written.
Claude's "精度" instead of "准确率" is an acceptable synonym; not penalized.

Scenario 2: Legal contract clause (EN → ZH)

Source: The Licensor hereby grants to the Licensee a non-exclusive, non-transferable, revocable license to use the Software solely for internal business purposes, provided that the Licensee shall not, directly or indirectly, reverse-engineer, decompile, or disassemble the Software in whole or in part, except to the extent such restriction is expressly prohibited by applicable law.

Model	Translation
DeepSeek V4 Pro	许可方特此授予被许可方一项非排他性、不可转让、可撤销的许可，允许其仅为内部业务目的使用本软件，但被许可方不得直接或间接对本软件的全部或部分进行逆向工程、反编译或反汇编，除非适用法律明确禁止此类限制。
DeepSeek V4 Flash	许可方特此授予被许可方一项非独占、不可转让、可撤销的许可，允许其仅为内部业务目的使用本软件，但被许可方不得直接或间接地对本软件的全部或部分进行反向工程、反编译或反汇编，除非适用法律明确禁止此类限制。
DeepSeek V3.2	许可方在此授予被许可方一项非排他性、不可转让、可撤销的许可，允许被许可方仅出于内部业务目的使用本软件，但被许可方不得直接或间接对本软件进行全部或部分的反向工程、反编译或反汇编，除非适用法律明确禁止此类限制。
GPT-5.4	许可方特此授予被许可方一项非排他性、不可转让、可撤销的许可，仅可将本软件用于内部业务目的，前提是被许可方不得直接或间接对本软件全部或部分进行逆向工程、反编译或反汇编，但适用法律明确禁止此类限制的范围除外。
Claude Opus 4.7	许可方特此授予被许可方一项非独占性、不可转让、可撤销的许可，允许其仅出于内部业务目的使用本软件；但被许可方不得直接或间接地对本软件的全部或任何部分进行反向工程、反编译或反汇编，除非适用法律明确禁止作出此类限制。
Gemini 3 Pro Preview	许可方特此授予被许可方一项非排他、不可转让、可撤销的许可，允许其仅为内部业务目的使用软件，前提是被许可方不得直接或间接地对软件的全部或部分进行逆向工程、反编译或反汇编，除非适用法律明确禁止此类限制。

🏆 Winner: GPT-5.4 (perfect 5.0/5)

Legal translation lives or dies by preserving the conditional structure. The source has "provided that ... except to the extent ..." — two nested conditions.
Only GPT-5.4 rendered "provided that" as "前提是" and "except to the extent" as "但...除外", fully preserving the conditional grant + exception clause structure.
Every other model collapsed "provided that" into a plain "但", losing the conditional-grant force — harmless for casual reading, but a contract lawyer would flinch.
DeepSeek V4 Pro and Flash used valid legal terms (both "非排他性" and "非独占" are legally acceptable), but lost points on structural preservation.

Scenario 3: Technical doc with inline code (EN → ZH)

Source: Set cache.ttl to a value greater than 0 (in seconds) to enable persistent caching. When cache.ttl=-1, the cache never expires. Note: enabling cache.strict_mode bypasses the TTL and forces revalidation on every request. See docs/caching.md for advanced patterns.

Model	Translation
DeepSeek V4 Pro	将 `cache.ttl` 设置为大于 `0` 的值（单位：秒）以启用持久缓存。当 `cache.ttl=-1` 时，缓存永不过期。注意：启用 `cache.strict_mode` 将绕过 TTL 并强制每次请求进行重新验证。请参阅 `docs/caching.md` 了解高级模式。
DeepSeek V4 Flash	将 `cache.ttl` 设置为大于 `0` 的值（以秒为单位）以启用持久缓存。当 `cache.ttl=-1` 时，缓存永不过期。注意：启用 `cache.strict_mode` 会绕过 TTL 并强制在每个请求时重新验证。高级模式请参见 `docs/caching.md`。
DeepSeek V3.2	将 `cache.ttl` 设置为大于 `0` 的值（单位为秒）以启用持久化缓存。当 `cache.ttl=-1` 时，缓存永不过期。注意：启用 `cache.strict_mode` 将绕过 TTL 设置，强制每次请求时重新验证缓存。高级用法请参阅 `docs/caching.md`。
GPT-5.4	将 `cache.ttl` 设置为大于 `0` 的值（以秒为单位）以启用持久缓存。当 `cache.ttl=-1` 时，缓存永不过期。注意：启用 `cache.strict_mode` 会绕过 TTL，并在每次请求时强制重新验证。有关高级模式，请参阅 `docs/caching.md`。
Claude Opus 4.7	将 `cache.ttl` 设置为大于 `0` 的值（单位：秒）可启用持久化缓存。当 `cache.ttl=-1` 时，缓存永不过期。注意：启用 `cache.strict_mode` 会绕过 TTL，并在每次请求时强制重新校验。进阶用法请参见 `docs/caching.md`。
Gemini 3 Pro Preview	将 `cache.ttl` 设置为大于 `0` 的值（以秒为单位）以启用持久化缓存。当 `cache.ttl=-1` 时，缓存永不过期。注意：启用 `cache.strict_mode` 会绕过 TTL 并在每次请求时强制重新验证。有关高级模式，请参阅 `docs/caching.md`。

🏆 Winner: DeepSeek V3.2 (4.8/5)

The biggest surprise of the review: the previous-generation V3.2 actually beat every flagship on technical docs. Why:

All 6 models preserved inline code perfectly (cache.ttl etc.) — no differentiator there.
V3.2's Chinese fluency and technical vocabulary feel the most like documentation a Chinese engineer would actually write.
V4 Pro and V4 Flash each lost a point on fluency — V4 Pro's "强制每次请求进行重新验证" is a bit stiff; Flash's "在每个请求时" reads like translationese (the native form is "每次请求").
Our read: DeepSeek's Chinese technical-writing skill peaked back at V3.2. V4's improvements land elsewhere (reasoning, coding), and the reasoning chain actually adds slight friction in this scenario.

Scenario 4: Literary prose — Lu Xun's My Old Home (ZH → EN)

Source: 我冒了严寒，回到相隔二千余里，别了二十余年的故乡去。时候既然是深冬；渐近故乡时，天气又阴晦了，冷风吹进船舱中，呜呜的响，从篷隙向外一望，苍黄的天底下，远近横着几个萧索的荒村，没有一些活气。

Model	Translation
DeepSeek V4 Pro	Braving the bitter cold, I traveled back to my hometown, which lay over two thousand li away and from which I had been separated for more than twenty years. It was deep winter. As I approached my hometown, the weather turned gloomy again. A cold wind blew into the cabin of the boat, moaning. Through a chink in the awning, I looked out and saw a few desolate, lifeless villages lying scattered far and near under a pale yellow sky.
DeepSeek V4 Flash	Braving the bitter cold, I traveled back to my hometown, a place I had left over twenty years ago, more than a thousand li away. It was already deep winter; as we drew near my hometown, the weather turned gloomy, and the cold wind whistled through the cabin. Peering out from under the awning, I could see a bleak, yellowish sky, with several desolate villages scattered horizontally here and there, devoid of any signs of life.
DeepSeek V3.2	Braving the bitter cold, I traveled more than seven hundred miles back to the old home I had left over twenty years ago. It was late winter. As we drew near my former home the day became overcast and a cold wind blew into the cabin of our boat, while all one could see through the chinks in our bamboo awning were a few desolate villages, void of any sign of life, scattered far and near under the somber yellow sky.
GPT-5.4	Braving the severe cold, I returned to my hometown, more than two thousand li away and from which I had been separated for more than twenty years. It was already deep winter; and as I gradually drew near my hometown, the weather grew overcast again. The cold wind blew into the cabin, making a whistling sound. Looking out through a crack in the awning, beneath the dusky yellow sky, there lay here and there a few desolate villages, barren and lifeless.
Claude Opus 4.7	Braving the bitter cold, I returned to the home I had not seen in over twenty years, more than two thousand li away. It was already deep winter; as I drew near, the sky grew overcast, and a cold wind moaned through the cabin. Peering out through a gap in the awning, I saw, beneath the sallow sky, a few desolate villages scattered near and far, without a trace of life.
Gemini 3 Pro Preview	Braving the bitter cold, I returned to my hometown, over two thousand li away, which I had left more than twenty years ago. It was late winter; as I drew closer to my hometown, the weather turned gloomy. Cold winds blew into the cabin, howling. Looking out through the chinks in the awning, under a pale yellow sky, scattered far and near lay a few desolate, barren villages, devoid of any sign of life.

🏆 Winner: Claude Opus 4.7 (4.7/5)

Literary translation is about voice, image, and rhythm. Claude's "sallow sky" for 苍黄 is the single most carefully chosen word of the round — it captures the sickly, jaundiced quality the Chinese carries. "moaned through the cabin" beats "whistling" / "howling" at conveying the desolation of 呜呜.
DeepSeek V4 Pro did well too (4.6) — "moaning", "chink in the awning" are literary choices, showing V4 does gain real ground on literary prose over V3.
DeepSeek V4 Flash crashed hardest (3.7, last place):
- Translated "二千余里" as "more than a thousand li" — a factual error.
- "scattered horizontally" is a stiff, word-for-word rendering of 横着 that reads unnaturally in English.
DeepSeek V3.2 scored 4.4, and appears to be reproducing the canonical Yang Xianyi & Gladys Yang translation ("seven hundred miles... void of any sign of life... bamboo awning" are essentially quoted). Readers can decide whether that's "memorization" or "translation".

Scenario 5: Manga dialogue (JA → EN)

Source: 「くそっ…まだだ、まだ終わっちゃいねぇ！」「お前はもう限界だろう。素直に引き下がれ。」「うるせぇ！守りたいもんがある限り、俺は何度でも立ち上がるんだよ！」

Model	Translation
DeepSeek V4 Pro	「Damn it... Not yet, it's not over yet!」「You're already at your limit. Just back down gracefully.」「Shut up! As long as I have something to protect, I'll get back up as many times as it takes!」
DeepSeek V4 Flash	"Dammit... Not yet, it's not over yet!" "You're at your limit. Just back down obediently." "Shut up! As long as there's something I want to protect, I'll get up as many times as it takes!"
DeepSeek V3.2	"Damn it... not yet, it's not over yet!" "You've already reached your limit. Just back down honestly." "Shut up! As long as there's something I want to protect, I'll get back up as many times as it takes!"
GPT-5.4	"Damn it... not yet, it's not over yet!" "You've already reached your limit. Back down quietly." "Shut up! As long as there's something I want to protect, I'll get back up as many times as it takes!"
Claude Opus 4.7	"Damn it... not yet—it's not over yet!" "You're already past your limit. Just back down, nice and quiet." "Shut up! As long as there's something I want to protect, I'll stand back up as many times as it takes!"
Gemini 3 Pro Preview	"Damn it... not yet, it's not over yet!" "You're already at your limit. Just back down." "Shut up! As long as I have something to protect, I'll stand up as many times as it takes!"

🏆 Tied winners: GPT-5.4 / Gemini 3 Pro (both 4.5/5)

This scenario surfaced a very interesting DeepSeek V4 Pro quirk:

🚨 V4 Pro kept the Japanese corner brackets 「」 verbatim in the English output. This is an obvious formatting error — translating to English, they should become " (or "). V4 Pro was probably "too diligent" during its reasoning step, interpreting "preserve formatting" as "preserve quote characters". The formatting score dropped to 2/5, dragging its overall to 3.1 — last place for this scenario.

A real bug worth flagging to the DeepSeek team: reasoning models over-conservatively preserving source formatting, including punctuation that shouldn't survive translation.

Every other model handled quotes correctly.
On sentence-end particles, Claude's "nice and quiet" best captures the "just back down quietly" feel of 素直に引き下がれ. V4 Flash and V3.2 went with the literal "obediently/honestly", which reads like translationese.
"うるせぇ！" → "Shut up!" across the board — fine.

3. Latency, tokens, and cost

Model	Avg latency	Avg output tokens	Reasoning tokens	Note
DeepSeek V4 Flash	4.7 s	247	174	Shallow reasoning, best value in V4 family
DeepSeek V3.2	4.6 s	73	0	No reasoning, steady veteran
GPT-5.4	4.5 s	85	0	Reasoning hidden, most balanced
Gemini 3 Pro Preview	14.2 s	844	767	Heavy reasoning, slow but solid
DeepSeek V4 Pro	17.1 s	562	488	Heavy reasoning, slowest in the test
Claude Opus 4.7	—	—	—	Not routed through API; figures from public specs

Highlights:

V4 Pro is about 4× slower than V4 Flash, but not one point higher in quality (4.38 vs 4.38). For most translation work, Flash is plenty; Pro only earns its keep on long-context, deep-reasoning tasks.
Gemini 3 Pro Preview has the heaviest reasoning tax (767 reasoning tokens on average), but it does pay off — 3rd place overall.
GPT-5.4 is the sweet spot for latency/quality: 4.5 s response, no reasoning tokens exposed, ranked 1st overall.

⚠️ A bun benchmarking note: we initially ran our script with bun's fetch and DeepSeek V4 kept showing 170–250 ms latencies — absurdly fast. Switching to Node's fetch restored the expected 9–35 s range. We suspect Bun mis-measures performance.now() on certain streaming responses. All latency numbers here are from Node.

4. Which model should you actually pick?

Based on 6-model testing, our pick-by-use-case guide:

📜 Legal contracts, regulatory filings

Go with GPT-5.4. The only model that reliably preserves nested conditional structure. One wrong conditional turns a valid clause invalid.

🎓 Academic papers, technical reports

Three-way tie: GPT-5.4 / Gemini 3 Pro / DeepSeek V4 Pro. If you're cost-sensitive and translating into Chinese, DeepSeek V4 Pro gives the best price-to-quality in this category.

💻 Chinese technical docs, API manuals, Markdown

DeepSeek V3.2 or V4 Flash are plenty. Chinese technical writing has been DeepSeek's strength since V3.2 — V4 Pro actually feels a little stiffer because the reasoning chain overcomplicates simple doc prose. Rare case where an older model is the right choice.

📖 Literary translation, novels, essays

Claude Opus 4.7 is the pick. Best lexical taste and rhythm. DeepSeek V4 Pro is 2nd — a historical high for DeepSeek on literary content. Skip DeepSeek V4 Flash: the literal "more than a thousand li" factual error is disqualifying.

🎌 Manga, light novels, anime-adjacent content

GPT-5.4 or Gemini 3 Pro. DeepSeek V4 Pro has a clear "corner bracket bug" for JP → EN — don't use it for manga localization until DeepSeek ships a fix.

5. Verdict: is DeepSeek V4 worth switching to?

✅ Yes, if…

Your core use case is academic or legal translation into Chinese — V4 Pro is only 0.3 behind GPT-5.4, at a fraction of the price.
You're budget- or latency-sensitive — V4 Flash matched V4 Pro's overall score at 4.7 s latency, the stealth winner of this review.
You run long-context reasoning tasks — V4 Pro's reasoning chain is a genuine step up from V3.2.

⚠️ Hold off, if…

Your core use case is manga / light novels — wait for DeepSeek to fix the corner-bracket preservation bug.
You do high-end literary translation — Claude and V4 Pro both work, but Claude still has better lexical taste.
You value "most reliable" over "cheapest" — GPT-5.4 is #1 overall with the best latency/quality balance.

Try it on your own documents with BelinDoc

This review used 5 short snippets. Your documents are probably longer and more complex — contracts with numbered clauses, papers with formulas and figures, manga with margin notes. Short-sample conclusions don't always map 1:1 to real files.

The best way to decide: upload your own document and compare.

👉 Upload your PDF / EPUB / Word and start translating

BelinDoc lets you switch translation models on the fly, keeps your original layout intact, and lets you compare multiple models on the same file from a single upload.

[Review] 📊 GPT-5.2 Document Translation Review
[Review] 📊 Gemini 3 Pro Translation Performance Review
[Review] 📊 GPT-5 vs Gemini 2.5 — 5-Scenario Benchmark
[Guide] 🎯 AI Model Selection Guide by Document Type

DeepSeek V4 Document Translation Review: vs V3.2, GPT-5.4, Claude 4.7 & Gemini 3 Pro

Intro: Can DeepSeek V4 really translate your documents?

TL;DR (for readers in a hurry)