DeepSeek V4 Pro vs Flash: What Launched, What Changed, and the Huawei Chip
- Aisha Washington
- 7 hours ago
- 10 min read
DeepSeek V4 is out. After three delays stretching from February to late April 2026, the Chinese AI lab published its new model on April 24 with open-source weights on HuggingFace and ModelScope, live API endpoints, and a technical report.
The release covers two distinct variants: DeepSeek-V4-Pro, the flagship model targeting top-tier closed-source performance, and DeepSeek-V4-Flash, a faster and more economical alternative for latency-sensitive or high-volume workloads. Both support a 1 million token context window and both offer a thinking mode alongside a standard non-thinking mode.
One detail that preceded the official launch and that DeepSeek did not address in its announcement: Reuters and The Information reported in early April that V4 was trained on Huawei's Ascend chips rather than Nvidia. Nvidia's CEO Jensen Huang called that prospect, if true, "a horrible outcome" for America. That thread runs alongside the technical story and is worth understanding separately.
What DeepSeek Shipped: V4-Pro and V4-Flash
DeepSeek's official announcement describes V4-Pro as the primary performance model and positions V4-Flash as the cost-efficient counterpart.
DeepSeek-V4-Pro (the flagship):
Agent capability: reaches the best current open-source level on Agentic Coding benchmarks
Internal evaluation: the team reports that V4-Pro performs better than Sonnet 4.5 and approaches Opus 4.6 in non-thinking mode, with a remaining gap to Opus 4.6 in thinking mode
World knowledge: leads all open-source models, trailing only Gemini-Pro-3.1 among closed-source models
Math, STEM, and competitive coding: surpasses all publicly evaluated open-source models
DeepSeek-V4-Flash (the fast and affordable version):
Smaller parameter count and lower activation per token, resulting in faster inference and lower API cost
Reasoning performance approaches V4-Pro on most tasks
On simple agent tasks, matches V4-Pro; on high-difficulty tasks, a gap remains
Both models support thinking mode (with a reasoning_effort parameter accepting high or max) and non-thinking mode. For complex agent workflows, DeepSeek recommends thinking mode at max intensity.
API model names: deepseek-v4-pro and deepseek-v4-flash. The old model names deepseek-chat and deepseek-reasoner currently map to V4-Flash non-thinking and V4-Flash thinking mode respectively, and will be deprecated on July 24, 2026.
DeepSeek has explicitly optimized V4 for integration with major agent frameworks including Claude Code, OpenClaw, OpenCode, and CodeBuddy. Document and code generation tasks are noted as areas of meaningful improvement.
How to Access DeepSeek V4 Right Now
Chat interface: Log in at chat.deepseek.com or the official DeepSeek app and select the latest V4 model. Both V4-Pro and V4-Flash are available immediately.
API: The base URL is unchanged. Update the model parameter to deepseek-v4-pro or deepseek-v4-flash to access the new models. Both support OpenAI's ChatCompletions interface and Anthropic's interface, with a maximum context length of 1 million tokens. To enable thinking mode, add reasoning_effort: "high" or "max" to the request. DeepSeek recommends max for complex agent tasks.
Migration note: If your application currently uses deepseek-chat or deepseek-reasoner, it is already pointing to V4-Flash non-thinking and V4-Flash thinking mode respectively. These names will stop working on July 24, 2026. Switch to explicit model names now to avoid a forced migration.
Open-source weights: Both models are available for local deployment on HuggingFace at deepseek-ai/DeepSeek-V4-Pro and on ModelScope. The technical report is also published on HuggingFace alongside the weights.
The Architecture Behind the 1M Context Window
Standard transformer models struggle with long contexts because attention computation scales quadratically with sequence length. More tokens mean exponentially more computation and memory. The practical ceiling for most production models has historically sat well below 1 million tokens.
DeepSeek V4 introduces a new attention mechanism that compresses at the token dimension, combined with DSA (DeepSeek Sparse Attention), a sparse attention design that selectively processes relevant token relationships rather than the full cross-product. The result is that 1 million token context becomes the standard configuration for all official DeepSeek services, not a capability tier with a premium cost.
DeepSeek describes the architecture as achieving global leadership in long-context capability while substantially reducing the compute and memory requirements compared to traditional full-attention approaches. The practical implication: entire codebases, long research documents, or extended conversation histories can fit in a single context without chunking or retrieval workarounds.
FindSkill.ai's technical analysis noted that the MoE design activates only a fraction of total parameters per token inference, keeping per-request compute roughly equivalent to a 37 billion parameter dense model even as total model capacity scales into the trillions. The combination of sparse attention and sparse parameter activation is what allows V4 to offer frontier-level capability at pricing that is expected to remain dramatically lower than equivalent closed-source offerings.
The Huawei Chip Question
DeepSeek's official launch announcement does not mention hardware. The Huawei story comes from Reuters and The Information, who reported in early April that V4 was designed to run on Huawei's Ascend 950PR chips rather than Nvidia GPUs.
That claim, if accurate, matters for a specific reason. US export controls have restricted Nvidia's advanced chips from reaching Chinese buyers since 2019. The policy assumption has been that this would slow Chinese AI development by limiting access to the hardware these models require. A frontier-level model trained on Chinese silicon challenges that assumption directly.
According to The Next Web, Huang stated that if AI models become "optimised in a very different way than the American tech stack," China "will become superior," and characterized that scenario as "a horrible outcome" for America. The South China Morning Post reported that Huang's concern centers on CUDA as much as hardware: the software ecosystem that developers have built on for fifteen years. A Chinese lab that no longer depends on CUDA is no longer constrained by the CUDA supply chain.
TechWire Asia's analysis noted that the broader pattern of Chinese AI labs moving toward Huawei Ascend infrastructure is already underway, and V4 represents the furthest any lab has gone in deploying that infrastructure for a model at this capability level. Whether V4 proves the concept at scale or exposes the limits of Huawei's current offerings will be answered by the performance data that emerges in the weeks ahead.
Competitive Context: Where V4 Fits
DeepSeek's own internal benchmark framing is notable for its specificity. The team compared V4-Pro directly against Claude Sonnet 4.5 and Opus 4.6 rather than citing abstract leaderboard positions, which suggests they ran controlled evaluations against these specific models.
The picture that emerges: V4-Pro beats Sonnet 4.5 on internal agent coding evaluation. It approaches Opus 4.6 non-thinking mode. It trails Opus 4.6 thinking mode. Against the open-source field, it claims the leading position on math, STEM, and competitive coding. On world knowledge, it leads all open-source models and sits just behind Gemini-Pro-3.1.
V4-Flash, the smaller variant, delivers near-V4-Pro reasoning at substantially lower cost and latency. For the majority of production use cases where a few percentage points of benchmark difference matter less than throughput and price, V4-Flash is likely the default choice.
The open-source release compounds the cost story. Weights are available under an open license on HuggingFace and ModelScope. Developers who can run inference locally, or on their own cloud infrastructure, avoid API pricing entirely and gain full control over how the model is deployed. For teams that have built workflows on top of DeepSeek's API, one practical note: the old model names deepseek-chat and deepseek-reasoner now point to V4-Flash by default and will be fully deprecated on July 24, 2026. Updating to the explicit deepseek-v4-pro or deepseek-v4-flash model names now avoids an unplanned migration later.
DeepSeek V4 vs V3: What Actually Changed
For teams already using DeepSeek V3 in production, the practical question is not "what is V4" but "what specifically changed and does it warrant a migration."
Context window: V3 offered strong long-context performance, but 1M tokens was not the standard default. V4 makes 1M tokens the baseline for all official services, including the API. That changes what you can pass in a single call without chunking.
Architecture: V4 introduces DSA (DeepSeek Sparse Attention), a new attention mechanism that compresses at the token dimension before computing attention. V3 used a different approach. The practical result is better performance at long contexts with lower compute requirements per token, which is what enables the 1M token default at a price point that remains competitive.
Model structure: V3 was a single model. V4 ships as two: V4-Pro for maximum capability and V4-Flash for speed and cost efficiency. Teams using V3 for high-volume workloads may find V4-Flash is the closer equivalent, while teams using V3 for complex reasoning or agent tasks should evaluate V4-Pro.
Agent capability: The gap is the most significant upgrade. V4-Pro's internal evaluation results, which the team compares against Sonnet 4.5 and Opus 4.6, represent a substantial improvement over V3's agent performance. If you have been using V3 for agentic coding workflows and accepted its limitations, V4-Pro is worth testing directly.
API migration: The base URL is the same. The old model names (deepseek-chat, deepseek-reasoner) now point to V4-Flash and expire July 24, 2026. A V3-to-V4 migration is currently transparent for most applications, but the explicit model name switch is the one action item.
What This Means for Developers and Knowledge Workers
V4's open-weight release and 1M token context window are the two features that change practical developer workflows most immediately.
The context window means long documents that previously required splitting, chunking, or retrieval-augmented pipelines can now be passed in a single prompt. A codebase with hundreds of files, a full research document, a long project log: all of it fits. The model can reason across the entire input rather than a selected window.
The open-source license means the model can run locally for developers who handle data that cannot go to a third-party server: confidential codebases, client work under NDA, sensitive research. Quantized versions are expected to run on consumer-grade hardware, bringing a frontier-class model within reach of individual developers working in air-gapped or privacy-constrained environments.
Local deployment resolves one constraint but surfaces another. A locally deployed DeepSeek V4 can process whatever you put in the prompt. It cannot access what is not in the prompt: the meeting from last week, the research accumulated over six months of browsing, the context that makes your work specific. Every session starts from zero.
This is where remio's local-first knowledge base addresses the gap that frontier models leave open. remio passively captures your working context: websites browsed, meetings recorded and transcribed locally, local files indexed. When a DeepSeek V4 session starts, the relevant context can be retrieved from remio and passed directly to the model, without sending either the context or the model output to a cloud service.
A workflow that pairs locally deployed DeepSeek V4 with remio keeps everything on your hardware. remio builds the context. The model processes it. Neither step leaves your machine.
Consider a researcher who has spent six months accumulating papers, annotations, and meeting notes about a specific domain. A new DeepSeek V4 session has no knowledge of that history. But remio has indexed it locally. The researcher queries remio for the relevant context, passes it into a V4-Pro session with a 1M token window, and receives analysis that reflects six months of accumulated knowledge rather than a blank-slate response. The 1M context window is what makes this practical at scale: the entire accumulated record, not a summary or a selection, can go into the prompt.
What Comes Next
Independent benchmark verification is the immediate next step. DeepSeek's internal evaluations are a meaningful signal but are not equivalent to independently verified results published under standard conditions. The community will run V4-Pro and V4-Flash against the same benchmarks used for Claude Opus 4.7, GPT-5, and Gemini, and the results will either confirm or revise the positioning. The framing DeepSeek chose, comparing directly against Sonnet 4.5 and Opus 4.6 rather than abstract leaderboards, is an invitation to test those claims. That testing is already underway.
Developers migrating existing workflows also have a near-term deadline to manage. The old API model names deepseek-chat and deepseek-reasoner expire on July 24, 2026. Until then they route to V4-Flash, which means existing applications have picked up the new model without any changes, but the switch to explicit model names is worth making before the deprecation date to avoid unexpected behavior.
On the hardware question: if the Huawei chip training story is confirmed through independent reporting or a technical disclosure from DeepSeek, it becomes the more significant development in the longer run. Not because of what it says about V4 specifically, but because it establishes proof of concept for a frontier model development path that does not depend on Nvidia. Other Chinese labs will be watching closely, and the answer matters for the trajectory of US export controls as a policy tool.
DeepSeek has indicated that 1M context will be the baseline for all official services going forward. That sets a new expectation for every lab competing in the same space, and raises the floor for what "production-ready" long-context capability means in 2026. For the open-source community specifically, the combination of frontier-level performance, an open license, and a 1M context window is a package that was not available six months ago. The rate of adoption will be fast.
Frequently Asked Questions
What is the difference between DeepSeek V4-Pro and V4-Flash?
V4-Pro is the flagship model optimized for maximum capability: complex reasoning, agentic coding, and tasks where quality matters more than speed. V4-Flash is smaller and faster with lower API cost, matching V4-Pro on simple tasks and approaching it on most reasoning tasks. For high-volume or latency-sensitive applications, V4-Flash is the default choice. For complex agent workflows, use V4-Pro with thinking mode enabled.
Is DeepSeek V4 free to use?
The chat interface at chat.deepseek.com is free. API usage is billed per token. DeepSeek has not published final pricing for V4-Pro and V4-Flash at launch, but the model is expected to remain dramatically cheaper than equivalent closed-source offerings like Claude Opus or GPT-5. Open-source weights are available at no cost for self-hosted deployment.
Does deepseek-chat still work after the V4 launch?
Yes, for now. The model names deepseek-chat and deepseek-reasoner currently route to V4-Flash non-thinking mode and V4-Flash thinking mode respectively. They will be deprecated on July 24, 2026. Switch to deepseek-v4-pro or deepseek-v4-flash to be explicit and avoid an unplanned migration.
Can I run DeepSeek V4 locally?
Yes. Open-source weights for both V4-Pro and V4-Flash are available on HuggingFace and ModelScope. Quantized versions at INT4 or INT8 precision are expected to run on consumer-grade hardware. The exact hardware requirements depend on quantization level and model variant, but the full model can run on a multi-GPU setup accessible to most individual developers.
How does DeepSeek V4 compare to Claude and GPT-5?
Based on DeepSeek's internal evaluations, V4-Pro beats Claude Sonnet 4.5, approaches Claude Opus 4.6 in non-thinking mode, and trails Claude Opus 4.6 in thinking mode. On math, STEM, and competitive coding it surpasses all open-source models and sits near the top of the closed-source field. Independent benchmark verification is pending. The cost difference is significant: V4 is expected to be 10-50x cheaper per token than Opus-class closed-source models.
DeepSeek V4 is live, open-source, and competitive with the leading models on most tasks the team tested. The hardware story remains unverified from the source but has already shaped how the launch is being read by the industry. What the independent benchmarks show in the coming weeks will determine whether the internal numbers hold, and whether the model earns the position DeepSeek's team claims for it.
The practical takeaway for developers right now: update API integrations from deprecated model names to explicit V4 identifiers, evaluate V4-Flash for high-throughput workloads before reaching for V4-Pro, and pay attention to the 1M context window as a feature that changes what a single API call can accomplish. Models at this context length are still new enough that the workflows that fully exploit them are still being designed.
remio helps developers and knowledge workers build the context layer that frontier models need to produce outputs grounded in real work. Whether the model runs in the cloud or on your own hardware, the context problem is the same.