DeepSeek V4 Is Here—Its Pro Version Costs 98% Less Than GPT 5.5 Pro

In brief

DeepSeek released its new V4-Pro model with 1.6 trillion parameters.
It costs $1.74/$3.48 per million input/output tokens, roughly 1/20th the price of Claude Opus 4.7 and 98% less than GPT 5.5 Pro.
DeepSeek trained V4 partly on Huawei Ascend chips, circumventing U.S. export restrictions, and says that once 950 new supernodes come online later in 2026, the Pro model's already-low price will drop further.

DeepSeek is back, and it showed up a few hours after OpenAI dropped GPT-5.5. Coincidence? Maybe. But if you're a Chinese AI lab that the U.S. government has been trying to slow down with chip export bans for the past three years, your sense of timing gets pretty sharp.

The Hangzhou-based lab released preview versions of DeepSeek-V4-Pro and DeepSeek-V4-Flash today, both open-weight, both with one million token context windows. That means you can basically work with a context roughly the size of the Lord of the Rings Trilogy before the model collapses. Both are also priced well below anything comparable in the West, and both are free for those capable of running locally.

DeepSeek's last major disruption—R1 in January 2025—wiped $600 billion from Nvidia's market cap in a single day as investor questioned whether American companies really needed such huge investments to produce results that a small chinese lab achieved with a fraction of the cost. V4 is a different kind of move: quieter, more technical, and more focused on efficiency for anyone actually building with AI.

Two models, very different jobs

Of the two new models, DeepSeek's V4-Pro is the big one, with 1.6 trillion total parameters. To put that in perspective, parameters are the internal "settings" or "brain cells" that a model uses to store knowledge and recognize patterns—the more parameters a model has, the more complex information it can theoretically hold. That makes it the biggest open-source model in the LLM market to date. The size may sound ridiculous until you learn it only activates 49 billion of them per inference pass.

This is the Mixture-of-Experts trick DeepSeek has refined since V3: The full model sits there, but only the relevant slice of it wakes up for any given request. More knowledge, same compute bill.

“DeepSeek-V4-Pro-Max, the maximum reasoning effort mode of DeepSeek-V4-Pro, significantly advances the knowledge capabilities of open-source models, firmly establishing itself as the best open-source model available today,” Deepseek wrote in the model’s official card on Huggingface. “It achieves top-tier performance in coding benchmarks and significantly bridges the gap with leading closed-source models on reasoning and agentic tasks.”

V4-Flash is the practical one: 284 billion total parameters, 13 billion active. It’s designed to be faster, cheaper, and according to DeepSeek's own benchmarks, “achieves comparable reasoning performance to the Pro version when given a larger thinking budget.”

Both support one million tokens of context. That's roughly 750,000 words—roughly the entire “Lord of the Rings” trilogy plus change. And that’s as a standard feature, not a premium tier.

Deepseek’s (not so) secret sauce: Making attention not terrible at scale

Here's the technical part for nerds or those interested in the magic powering the model. Deepseek doesn’t hide its secrets, and everything is available for free—the full paper is available on Github.

Standard AI attention—the mechanism that lets a model understand relationships between words—has a brutal scaling problem. Every time you double the context length, the compute cost roughly quadruples. So running a model on a million tokens isn't just twice as expensive as 500,000 tokens. It's four times as expensive. This is why long context has historically been a checkbox labs add and then silently throttle behind rate limits.

DeepSeek invented two new attention types to get around this. The first, Compressed Sparse Attention, works in two steps. It first compresses groups of tokens—say, every 4 tokens—into a single entry. Then, instead of attending to all of those compressed entries, it uses a "Lightning Indexer" to pick only the most relevant results for any given query. Your model goes from attending to a million tokens to attending to a much smaller set of the most important chunks, kind of like a librarian who doesn't read every book but knows exactly which shelf to check.

The second, Heavily Compressed Attention, is more aggressive. It collapses every 128 tokens into a single entry—no sparse selection, just brutal compression. You lose fine-grained detail, but you get an extremely cheap global view. The two attention types run in alternating layers, so the model gets both the detail and the overview.

The result, from the technical paper: At one million tokens, V4-Pro uses 27% of the compute its predecessor (V3.2) needed. KV cache—the memory the model needs to track context—drops to just 10% of V3.2. V4-Flash pushes that further: 10% of compute, 7% of memory.

And this ended up with Deepseek being able to offer a much cheaper price per token than its competitors, while providing comparable results. To put that in dollar terms: GPT-5.5 launched yesterday at $5 input and $30 output per million tokens with GPT-5.5 Pro priced at $30 per million input tokens and $180 per million output tokens.

Deepseek V4-Pro is $1.74 input and $3.48 output. V4-Flash is $0.14 input and $0.28 output. Cline CEO Saoud Rizwan pointed out that if Uber had used DeepSeek instead of Claude, its 2026 AI budget—reportedly enough for four months of usage—would have lasted seven years.

The benchmarks

DeepSeek does something unusual in its technical report: It publishes the gaps. Most model releases cherry-pick the benchmarks where they win. DeepSeek ran the full comparison against GPT-5.4 and Gemini-3.1-Pro, found that V4-Pro's reasoning lags behind those models by about three to six months, and printed it anyway.

Where V4-Pro-Max actually wins: Codeforces, competitive programming benchmark, rated like human chess. V4-Pro scored 3,206, placing it around 23rd among actual human contest participants. On Apex Shortlist, a curated set of hard math and STEM problems, it scored a pass rate and hit 90.2% versus Opus 4.6's 85.9% and GPT-5.4's 78.1%. On SWE-Verified, which measures whether a model can resolve real GitHub issues pulled from actual open-source repositories, it scored 80.6%—matching Claude Opus 4.6.

Where it trails: multitasking benchmark MMLU-Pro (Gemini-3.1-Pro at 91.0% vs V4-Pro at 87.5%), expert knowledge benchmark GPQA Diamond (Gemini 94.3 vs V4-Pro 90.1), and Humanity's Last Exam, a graduate-level benchmark where Gemini-3.1-Pro's 44.4% still beats V4-Pro's 37.7%.

On long context specifically, V4-Pro leads open-source models and beats Gemini-3.1-Pro on the CorpusQA benchmark (a test simulating real document analysis at one million tokens), but loses to Claude Opus 4.6 on MRCR—a test measuring how well a model retrieves specific needles buried deep in a very long haystack.

Built to run agents, not just answer questions

The agentic stuff is where this release gets interesting for developers actually shipping products.

V4-Pro can run in Claude Code, OpenCode, and other AI coding tools. According to DeepSeek's internal survey of 85 developers who used V4-Pro as their primary coding agent, 52% said it was ready to be their default model, 39% leaned toward yes, and fewer than 9% said no. Internal employees said it outperforms Claude Sonnet and approaches Claude Opus 4.5 on agentic coding tasks.

Artificial Analysis, which runs independent evaluations of AI models on real-world tasks, ranked V4-Pro first among all open-weight models on GDPval-AA—a benchmark testing economically valuable knowledge work across finance, legal, and research tasks, scored via Elo. V4-Pro-Max scored 1,554 Elo, ahead of GLM-5.1 (1,535) and MiniMax's M2.7 (1,514). For reference, Claude Opus 4.6 scores 1,619 on the same benchmark—still ahead, but the gap is closing.

Deepseek’s V4 also introduces something called “interleaved thinking.” In previous models, if you were running an agent that made multiple tool calls—say, it searched the web, then ran some code, then searched again—the model's reasoning context got flushed between rounds. Each new step, the model had to rebuild its mental model from scratch. V4 retains the full chain of thought across tool calls, so a 20-step agent workflow doesn't suffer from amnesia halfway through. This matters more than it sounds for anyone running complex automated pipelines.

Deepseek and the U.S.-China AI war

The U.S. has been restricting high-end Nvidia chip exports to China since 2022. The stated goal was to slow Chinese AI development, but the chip ban didn't stop DeepSeek and instead made them invent a more efficient architecture and build out domestic hardware supply.

DeepSeek didn't release V4 in a vacuum—the AI space has been flush with activity as of late: Anthropic shipped Claude Opus 4.7 on April 16—a model Decrypt tested and found strong on coding and reasoning, with notably high token usage. The day before that, Anthropic was also sitting on Claude Mythos, a cybersecurity model it says it can't release publicly because it's too good at autonomous network attacks.

Xiaomi dropped MiMo V2.5 Pro on April 22, going full multimodal—image, audio, video. Costs $1 input and $3 output per million tokens. It matches Opus 4.6 on most coding benchmarks. Three months ago, nobody was talking about Xiaomi as a frontier AI company. Now it's shipping competitive models faster than most Western labs.

OpenAI's GPT-5.5 landed yesterday with costs spiking up to $180 per million tokens of output in the Pro version. It beats V4-Pro on Terminal Bench 2.0 (82.7% vs 70.0%), which tests complex command-line agent workflows. But it costs considerably more than V4-Pro for equivalent tasks. That same day Tencent released Hy3, another state-of-the-art model focused on efficiency.

What this means for you

So with so many new models available, the question developers are actually asking: When is the premium worth it?

For enterprise, the math may have changed. A model that leads open-source benchmarks at $1.74 per million input tokens means large-scale document processing, legal review, or code generation pipelines that were expensive six months ago are now much cheaper. The one-million-token context means you can feed entire codebases or regulatory filings in a single request instead of chunking them across multiple calls.

Besides, its open-source nature means it can not only be run for free on local hardware, but it can be customized and improved based on the company’s needs and use cases.

For developers and solo builders, V4-Flash is the one to watch. At $0.14 input and $0.28 output, it's cheaper than models that were considered budget options a year ago—and it handles most tasks the Pro version handles. DeepSeek's existing deepseek-chat and deepseek-reasoner endpoints already route to V4-Flash in non-thinking and thinking modes respectively, so if you're on the API, you're already using it.

The models are text-only for now. DeepSeek said it's working on multimodal capabilities, which means other big labs from Xiaomi to OpenAI still have that edge. Both models are MIT licensed and available on Hugging Face today. The old deepseek-chat and deepseek-reasoner endpoints retire on July 24, 2026.

Daily Debrief Newsletter

Start every day with the top news stories right now, plus original features, a podcast, videos and more.