DeepSeek-V4: Efficient Million-Token MoE LLM

Key Questions

What is DeepSeek-V4?

DeepSeek-V4 is an open Mixture-of-Experts (MoE) LLM with 1.6T total parameters and 284B active parameters, designed for efficient 1 million token context handling. It features hybrid CSA/HCA attention, reducing FLOPs by 27% and KV cache by 10% compared to V3.

What attention mechanism does DeepSeek-V4 use?

It employs a hybrid CSA/HCA attention stack, enabling efficient processing of 1M token contexts. This is highlighted in reviews explaining why million-token contexts require such efficient attention beyond just larger windows.

What optimizations are in DeepSeek-V4?

Key optimizations include mHC residuals, Muon optimizer, and Newton-Schulz iteration coefficients for spectrum normalization. It also has a vLLM implementation for inference.

How does DeepSeek-V4 perform?

It achieves near state-of-the-art results in reasoning, agents, and coding tasks. Community notes emphasize its efficiency gains over V3.

What is the community reaction to DeepSeek-V4?

It generated buzz on Hacker News with 21 points and tweets from influencers like Omar Sar and EMostaque. Reviews and discussions are building, with calls for reproductions and benchmarks against models like Gemma4 and GLM.

1.6T/284B open MoE with hybrid CSA/HCA attn for 1M ctx (27% FLOPs/10% KV vs V3), mHC residuals, Muon opt, Newton-Schulz coeffs for spectrum normalization; vLLM impl; near-SOTA reasoning/agents/coding. HN 21pts buzz; community tweets/reviews building (Omar Sar, EMostaque). Urgent: repros/benches vs Gemma4/GLM/HISA/TurboQuant.

Sources (5)

Updated Apr 25, 2026

AI Impact Daily

DeepSeek-V4: Efficient Million-Token MoE LLM

Key Questions

What is DeepSeek-V4?

What attention mechanism does DeepSeek-V4 use?

What optimizations are in DeepSeek-V4?

How does DeepSeek-V4 perform?

What is the community reaction to DeepSeek-V4?

@omarsar0: What are your thoughts on DeepSeek-V4? Going through the paper right now. https://t.co/V49sgPJbrG

DeepSeek-V4 Review: Why Million-Token Context Needs Efficient Attention, Not Just Larger Windows

DeepSeek V4 A Million Tokens

DeepSeek-V4 Explained: Hybrid CSA and HCA Attention That Cuts KV Cache to 10% at One Million Tokens

@EMostaque reposted: The Newton–Schulz iteration coefficients optimized by DeepSeek-V4 are surprising...