DeepSeek has released a preview version of DeepSeek V4, its new flagship model family.
The release includes two main models:
- DeepSeek-V4-Pro
- DeepSeek-V4-Flash
This is not a minor update to V3. DeepSeek is presenting V4 as a major architectural step forward, especially in long-context efficiency, reasoning modes, and agent-style workloads.
Two models, two scales
The DeepSeek V4 preview comes in two sizes.
DeepSeek-V4-Pro
- 1.6 trillion total parameters
- 49 billion activated parameters
- 1 million token context window
DeepSeek-V4-Flash
- 284 billion total parameters
- 13 billion activated parameters
- 1 million token context window
Both are Mixture-of-Experts models, which means only part of the model is activated for each token during inference. That is important because it lets DeepSeek push total parameter count much higher without paying the full dense-model cost on every step.
The biggest technical change: a new long-context architecture
The most important architecture change in V4 is DeepSeek's new hybrid attention design.
DeepSeek says V4 combines:
- Compressed Sparse Attention (CSA)
- Heavily Compressed Attention (HCA)
The purpose is to make very long contexts far more practical.
According to DeepSeek's model card, at the 1 million token context length, DeepSeek-V4-Pro needs only:
- 27% of the single-token inference FLOPs
- 10% of the KV cache
compared with DeepSeek-V3.2.
That is one of the most important technical claims in the release. DeepSeek is not just increasing context size. It is claiming a large improvement in the efficiency of using that context.
Other architecture and training changes
DeepSeek also highlights two other major technical upgrades.
1. Manifold-Constrained Hyper-Connections
DeepSeek says it uses Manifold-Constrained Hyper-Connections (mHC) to strengthen standard residual connections.
In practical terms, the company says this improves the stability of signal propagation across layers while preserving model expressivity. That suggests V4 is trying to improve both training stability and deep reasoning behavior at scale.
2. Muon optimizer
DeepSeek says V4 uses the Muon optimizer.
The claim here is faster convergence and more stable training. DeepSeek also says the V4 models were pre-trained on more than 32 trillion tokens before going through a multi-stage post-training pipeline.
Post-training and expert consolidation
DeepSeek's post-training description is unusually detailed.
The company says the pipeline has two stages:
- independent cultivation of domain-specific experts through supervised fine-tuning and reinforcement learning with GRPO
- unified model consolidation via on-policy distillation
The idea is that different domain experts are trained first, then merged into one stronger general model. That helps explain why DeepSeek is emphasizing both specialist capability and unified agent performance.
Reasoning modes
DeepSeek V4 supports three reasoning effort modes for both Pro and Flash:
- Non-think
- Think High
- Think Max
DeepSeek describes them this way:
Non-think
Fast, intuitive responses for routine tasks.
Think High
Slower but more deliberate reasoning for complex problem-solving and planning.
Think Max
The highest reasoning effort mode, designed to push the model to its limit.
This is an important part of the release because it shows DeepSeek is productizing reasoning depth rather than treating reasoning as one fixed default behavior.
Benchmark profile: Pro model
DeepSeek's strongest benchmark story is around DeepSeek-V4-Pro Max.
Published results include:
- MMLU-Pro: 87.5
- SimpleQA-Verified: 57.9
- Chinese-SimpleQA: 84.4
- GPQA Diamond: 90.1
- HLE: 37.7
- LiveCodeBench: 93.5
- Codeforces: 3206
- IMOAnswerBench: 89.8
- MRCR 1M: 83.5
- CorpusQA 1M: 62.0
- Terminal Bench 2.0: 67.9
- SWE Verified: 80.6
- SWE Pro: 55.4
- SWE Multilingual: 76.2
- BrowseComp: 83.4
- MCPAtlas Public: 73.6
- Toolathlon: 51.8
DeepSeek's own comparison table positions V4-Pro Max against frontier proprietary models from Anthropic, OpenAI, Google, Moonshot, and Zhipu. The results suggest V4-Pro is strongest in coding, math-heavy reasoning, million-token context tasks, and several agent benchmarks, while still trailing the strongest closed models on some top-end knowledge tasks.
Benchmark profile: base models
DeepSeek also published base-model results.
DeepSeek-V4-Flash-Base
- MMLU: 88.7
- MMLU-Pro: 68.3
- HumanEval: 69.5
- LongBench-V2: 44.7
DeepSeek-V4-Pro-Base
- MMLU: 90.1
- MMLU-Pro: 73.5
- HumanEval: 76.8
- LongBench-V2: 51.5
These numbers suggest the Pro model is not only larger, but meaningfully stronger in knowledge-heavy and long-context tasks.
Flash versus Pro
The relationship between the two models is straightforward.
DeepSeek-V4-Pro is the high-end flagship. DeepSeek-V4-Flash is the smaller and cheaper model.
What is notable is that DeepSeek says V4-Flash-Max can approach Pro-level reasoning performance when given a larger thinking budget. That suggests Flash is not meant to be a lightweight chat model only. It is still part of the serious reasoning line, just at a smaller scale.
Precision and deployment format
DeepSeek lists the deployed formats as:
- FP8 Mixed for base models
- FP4 + FP8 Mixed for instruct models
The note says the MoE expert parameters use FP4, while most other parameters use FP8.
That matters because it shows DeepSeek is aggressively optimizing deployment efficiency, not only training a large model and stopping there.
Local deployment notes
DeepSeek says both V4 models can be run locally and points users to its inference instructions.
The company also recommends:
- temperature = 1.0
- top_p = 1.0
for local deployment, and says Think Max should be run with at least 384K context available.
That is a useful signal about intended usage: V4 is clearly designed for very large working contexts and longer reasoning traces.
License
DeepSeek says the V4 repository and weights are under the MIT License.
That is significant because it makes the preview release much easier for the open-source ecosystem to test and integrate.
Bottom line
DeepSeek V4 is now real, but it is arriving first as a preview release.
The key technical facts are clear:
- DeepSeek-V4-Pro: 1.6T total parameters, 49B activated
- DeepSeek-V4-Flash: 284B total parameters, 13B activated
- 1 million token context window
- new CSA + HCA hybrid attention
- mHC residual upgrade
- Muon optimizer
- 32T+ pretraining tokens
- three reasoning modes: Non-think, Think High, Think Max
The biggest story is not only model size. It is that DeepSeek is trying to make million-token reasoning and agent work far more efficient than in V3.2. That is the main technical advance to watch.