DeepSeek's new models are so efficient they'll run on a toaster ... by which we mean Huawei's NPUs

Chinese AI darling DeepSeek is back with a new open weights large language model that promises performance to rival the best proprietary American LLMs. Perhaps more importantly, it claims to dramatically reduce inference costs and it extends support for Huawei's Ascend family of AI accelerators.

Unveiled on Friday, DeepSeek V4 is available for download on popular model repos like Hugging Face, the company's API, and web service in two new flavors. The first is a smaller 284 billion parameter Flash mixture-of-experts (MoE) model with 13 billion active parameters, while the larger of the two is a 1.6 trillion parameter model, 49 billion of which are in use at any given moment.

V4-Pro was trained on 33 trillion tokens and, if DeepSeek is to be believed, beats out every open weight LLM while rivaling the West's best proprietary models across its benchmark suite.

Here's how DeepSeek says its V4 model stacks up against the competition. - Click to enlarge

Of course, these claims should be taken with a grain of salt. While DeepSeek has had a strong track record with its V3 and R1 family of models that made the Chinese dev a household name, just because it performs well in canned benchmarks doesn't mean it'll hold up in real world applications.

We would expect DeepSeek V4-Pro to be much better than the company's prior efforts. The new model is nearly a trillion parameters larger and uses more active parameters during inference. But as was the case with DeepSeek V3, which showed that large frontier models could be trained using less compute than previously thought, benchmarks don't tell the full story.

Under the hood, DeepSeek V4 introduces several novel architectural changes that, according to developers, should make the model much less expensive to serve.

The first is rather simple. This time around, DeepSeek is releasing a second smaller Flash model, which requires less infrastructure to run and will deliver a more interactive user experience at a lower cost. Smaller models are simply cheaper to serve.

This in itself isn't a new strategy, but it's one that DeepSeek is only now embracing, at least as far as its in house models are concerned.

The bigger and more meaningful change comes to how DeepSeek is calculating attention. A model's attention mechanism impacts how it converts a prompt into key-value pairs that are used to generate the output tokens.

In a paper published alongside the new models, DeepSeek researchers describe a hybrid attention mechanism that combines two technologies: Compressed Sparse Attention and Heavy Compressed Attention to reduce the amount of compute required during inference and the memory required by the KV caches used to track model state.

The latter element is key to DeepSeek V4's efficiency, as these caches can be quite large. Inference providers also tend to offload these to system memory or flash to avoid cold start penalties. More heavily compressed KV caches mean less memory and storage is required for large-scale inference deployments.

Combined, these technologies mean the model can support a million token context window while using 9.5x-13.7x less memory than DeepSeek V3.2.

To further reduce the model's memory footprint, DeepSeek is continuing its tradition of using lower precision datatypes. DeepSeek V3 was among the first open weights models trained at FP8.

Now, both V4 models are using a mixture of FP8 and FP4 precision. Specifically, the model devs used quantization-aware training for the MoE expert weights.

As we've previously discussed, FP4 effectively halves the memory required to store model weights compared to FP8, making it a significant saving, if you can stomach the loss of precision.

DeepSeek's architectural improvements aren't limited to inference either. In V4, the model devs introduced a new optimizer called Muon, designed to speed up convergence and improve training stability.

A homegrown model for homegrown hardware

Perhaps the most interesting, least detailed element of the new models relates to the hardware they're running on. While DeepSeek V3 was heavily optimized for Hopper GPUs, V4 has been validated to run on both Nvidia and Huawei accelerators.

The DeepSeek V4 paper only mentions the chips in passing, noting that the company validated its "fine-grained EP [Expert Parallel] scheme on both Nvidia GPUs and Ascend NPU platforms."

To be clear, this does not mean the model was trained entirely on Huawei hardware, only that DeepSeek has validated the Chinese telecommunications giant's AI accelerators to serve the model.

It is possible DeepSeek used a combination of Nvidia GPUs for pre-training and Huawei accelerators for reinforcement learning. The latter is an inference-adjacent post-training step used to teach models new skills, behavior, and chain of thought reasoning. However, the paper doesn't directly address this.

Inference generally has a lower barrier to entry for new chipmakers. However, at one point, DeepSeek was trying to train its models on Huawei's chips as well. This effort was reportedly derailed by dodgy chips, glacial interconnects, and an immature software stack that ultimately drove DeepSeek back into Nvidia's embrace.

Finally, the use of 4-bit precision data types in V4 could lead some to assume DeepSeek got its hands on Nvidia's Blackwell accelerators, which the AI arms dealer isn't allowed to sell in China, but this isn't strictly necessary.

Hopper GPUs don't support FP4 hardware acceleration but can work with the data type in a weights-only fashion. This approach doesn't benefit floating point performance, but reduces the memory footprint and bandwidth required during both training and inference, making it a worthwhile trade-off in many use cases.

Priced to sell

DeepSeek V4 is currently in preview with both base and instruct tuned versions of the model available for download or via its API.

The company is unsurprisingly offering API access to the smaller model at a reduced rate of $0.14 per million input tokens (uncached) and $0.28 per million output tokens.

The larger Pro model is much more expensive at $1.74 per million input tokens and $3.48 per million output tokens, but that's still a fraction of what Western AI vendors are charging for access to their top models. For reference, OpenAI charges $5 per million input tokens and $30 per million output tokens for GPT-5.5. ®