/
Log in Create Account
← All posts

April 24, 2026 · TokenDock Team

DeepSeek V4 Preview Released With 1.6T Pro Model, 284B Flash Model, and 1M Context

DeepSeek V4 preview launches with Pro and Flash variants, a 1M-token context window, hybrid attention, and strong coding and agent benchmarks.


DeepSeek has released a preview version of DeepSeek V4, its new flagship model family.

The release includes two main models:

  • DeepSeek-V4-Pro
  • DeepSeek-V4-Flash

This is not a minor update to V3. DeepSeek is presenting V4 as a major architectural step forward, especially in long-context efficiency, reasoning modes, and agent-style workloads.

Two models, two scales

The DeepSeek V4 preview comes in two sizes.

DeepSeek-V4-Pro

  • 1.6 trillion total parameters
  • 49 billion activated parameters
  • 1 million token context window

DeepSeek-V4-Flash

  • 284 billion total parameters
  • 13 billion activated parameters
  • 1 million token context window

Both are Mixture-of-Experts models, which means only part of the model is activated for each token during inference. That is important because it lets DeepSeek push total parameter count much higher without paying the full dense-model cost on every step.

The biggest technical change: a new long-context architecture

The most important architecture change in V4 is DeepSeek's new hybrid attention design.

DeepSeek says V4 combines:

  • Compressed Sparse Attention (CSA)
  • Heavily Compressed Attention (HCA)

The purpose is to make very long contexts far more practical.

According to DeepSeek's model card, at the 1 million token context length, DeepSeek-V4-Pro needs only:

  • 27% of the single-token inference FLOPs
  • 10% of the KV cache

compared with DeepSeek-V3.2.

That is one of the most important technical claims in the release. DeepSeek is not just increasing context size. It is claiming a large improvement in the efficiency of using that context.

Other architecture and training changes

DeepSeek also highlights two other major technical upgrades.

1. Manifold-Constrained Hyper-Connections

DeepSeek says it uses Manifold-Constrained Hyper-Connections (mHC) to strengthen standard residual connections.

In practical terms, the company says this improves the stability of signal propagation across layers while preserving model expressivity. That suggests V4 is trying to improve both training stability and deep reasoning behavior at scale.

2. Muon optimizer

DeepSeek says V4 uses the Muon optimizer.

The claim here is faster convergence and more stable training. DeepSeek also says the V4 models were pre-trained on more than 32 trillion tokens before going through a multi-stage post-training pipeline.

Post-training and expert consolidation

DeepSeek's post-training description is unusually detailed.

The company says the pipeline has two stages:

  1. independent cultivation of domain-specific experts through supervised fine-tuning and reinforcement learning with GRPO
  2. unified model consolidation via on-policy distillation

The idea is that different domain experts are trained first, then merged into one stronger general model. That helps explain why DeepSeek is emphasizing both specialist capability and unified agent performance.

Reasoning modes

DeepSeek V4 supports three reasoning effort modes for both Pro and Flash:

  • Non-think
  • Think High
  • Think Max

DeepSeek describes them this way:

Non-think

Fast, intuitive responses for routine tasks.

Think High

Slower but more deliberate reasoning for complex problem-solving and planning.

Think Max

The highest reasoning effort mode, designed to push the model to its limit.

This is an important part of the release because it shows DeepSeek is productizing reasoning depth rather than treating reasoning as one fixed default behavior.

Benchmark profile: Pro model

DeepSeek's strongest benchmark story is around DeepSeek-V4-Pro Max.

Published results include:

  • MMLU-Pro: 87.5
  • SimpleQA-Verified: 57.9
  • Chinese-SimpleQA: 84.4
  • GPQA Diamond: 90.1
  • HLE: 37.7
  • LiveCodeBench: 93.5
  • Codeforces: 3206
  • IMOAnswerBench: 89.8
  • MRCR 1M: 83.5
  • CorpusQA 1M: 62.0
  • Terminal Bench 2.0: 67.9
  • SWE Verified: 80.6
  • SWE Pro: 55.4
  • SWE Multilingual: 76.2
  • BrowseComp: 83.4
  • MCPAtlas Public: 73.6
  • Toolathlon: 51.8

DeepSeek's own comparison table positions V4-Pro Max against frontier proprietary models from Anthropic, OpenAI, Google, Moonshot, and Zhipu. The results suggest V4-Pro is strongest in coding, math-heavy reasoning, million-token context tasks, and several agent benchmarks, while still trailing the strongest closed models on some top-end knowledge tasks.

Benchmark profile: base models

DeepSeek also published base-model results.

DeepSeek-V4-Flash-Base

  • MMLU: 88.7
  • MMLU-Pro: 68.3
  • HumanEval: 69.5
  • LongBench-V2: 44.7

DeepSeek-V4-Pro-Base

  • MMLU: 90.1
  • MMLU-Pro: 73.5
  • HumanEval: 76.8
  • LongBench-V2: 51.5

These numbers suggest the Pro model is not only larger, but meaningfully stronger in knowledge-heavy and long-context tasks.

Flash versus Pro

The relationship between the two models is straightforward.

DeepSeek-V4-Pro is the high-end flagship. DeepSeek-V4-Flash is the smaller and cheaper model.

What is notable is that DeepSeek says V4-Flash-Max can approach Pro-level reasoning performance when given a larger thinking budget. That suggests Flash is not meant to be a lightweight chat model only. It is still part of the serious reasoning line, just at a smaller scale.

Precision and deployment format

DeepSeek lists the deployed formats as:

  • FP8 Mixed for base models
  • FP4 + FP8 Mixed for instruct models

The note says the MoE expert parameters use FP4, while most other parameters use FP8.

That matters because it shows DeepSeek is aggressively optimizing deployment efficiency, not only training a large model and stopping there.

Local deployment notes

DeepSeek says both V4 models can be run locally and points users to its inference instructions.

The company also recommends:

  • temperature = 1.0
  • top_p = 1.0

for local deployment, and says Think Max should be run with at least 384K context available.

That is a useful signal about intended usage: V4 is clearly designed for very large working contexts and longer reasoning traces.

License

DeepSeek says the V4 repository and weights are under the MIT License.

That is significant because it makes the preview release much easier for the open-source ecosystem to test and integrate.

Bottom line

DeepSeek V4 is now real, but it is arriving first as a preview release.

The key technical facts are clear:

  • DeepSeek-V4-Pro: 1.6T total parameters, 49B activated
  • DeepSeek-V4-Flash: 284B total parameters, 13B activated
  • 1 million token context window
  • new CSA + HCA hybrid attention
  • mHC residual upgrade
  • Muon optimizer
  • 32T+ pretraining tokens
  • three reasoning modes: Non-think, Think High, Think Max

The biggest story is not only model size. It is that DeepSeek is trying to make million-token reasoning and agent work far more efficient than in V3.2. That is the main technical advance to watch.

Sources

DeepseekDeepseek v4Deepseek v4 proDeepseek v4 flashAi models