Recently, DeepSeek, a subsidiary of the Chinese company High-Flyer (幻方量化), introduced two notable models: DeepSeek-V3 and DeepSeek-R1. These models, characterized by their low cost and high efficiency, have had a profound impact on the global tech industry.
On January 27, tech stocks saw a sharp decline, with the Nasdaq dropping 3.1% and the S&P 500 falling 1.5%. Notably, AI chip supplier Nvidia (NVDA) plummeted nearly 17%, wiping out approximately $600 billion in market value in a single day. Concerns began to surface and amplify in the market: Could DeepSeek’s low-cost models undermine the dominance of U.S. tech giants in the AI race? And do they signal a weakening demand for computing power from suppliers like Nvidia?
Today, we will be exploring DeepSeek’s technology, cost structure, and its potential impact on the AI industry!
DeepSeek’s Core Technologies
Mixture-of-Experts (MoE)
DeepSeek-V3 adopts an architecture known as MoE, which selectively activates different expert modules based on input, enhancing model efficiency. Although the MoE framework contains approximately 671 billion parameters, only about 37 billion (around 5%) are activated during each inference. This design significantly reduces computing resource consumption while maintaining high performance.
Dynamic Knowledge Awakening
According to available data, DeepSeek-V3 achieved an accuracy rate of 87.1% on the Massive Multitask Language Understanding (MMLU) benchmark, significantly outperforming its predecessor, DeepSeek-V2, which scored 78.4%. This puts it close to proprietary models like GPT-4o (around 87.2%) and Claude-3.5-Sonnet (88.3%).
Its innovative "Dynamic Knowledge Awakening" mechanism dynamically adjusts the model’s attention distribution, automatically invoking different expert knowledge modules based on input content and context, thereby improving perception and accuracy. For example, when processing AIME 2024 math competition problems, the model prioritizes mathematical logic expert modules to ensure rigorous reasoning.
Long-Text Processing Capability
By combining hierarchical attention and context compression techniques, DeepSeek-V3 achieved a 92.7% accuracy rate in extracting key information from 100,000-character texts in the LongBench v2 test—14% higher than GPT-4o. This breakthrough is attributed to its innovative memory unit partitioning system, which physically isolates different types of contextual information (e.g., factual data, logical chains, task instructions) to prevent information interference.
Hierarchical Attention: This mechanism enables the model to capture both global and local information at different levels. It first segments text into larger chunks to identify key sections and then applies deeper attention calculations within those chunks to extract core information. This hierarchical approach helps the model understand both overarching context and finer details.
Context Compression: Directly analyzing long texts consumes significant resources. Context compression reduces this burden by condensing previous contextual information into a simplified representation that retains key content while discarding unnecessary details. This allows the model to efficiently utilize compressed memory when processing new information, reducing computational load while maintaining long-term contextual understanding.
Chinese Language Proficiency
According to existing data, DeepSeek-V3 has demonstrated strong performance in factual knowledge testing in Chinese. In the C-SimpleQA benchmark, it achieved an accuracy rate of 89.3%, outperforming Qwen2.5-72B by 8%. This achievement is largely attributed to its semantic grid technology, which enhances the model’s understanding of Chinese idioms, dialects, and technical terminology, bringing it to a level comparable to native experts.
Semantic Grid: This technology integrates semantic analysis with grid computing, aiming to enhance efficiency through semantic tagging and ontology-based associations. By improving resource discovery and interoperability, it enables more precise search, integration, and utilization of information, thereby optimizing computing and data processing efficiency. DeepSeek has also optimized its models for Nvidia’s Hopper architecture GPUs and has leveraged NVLink and RDMA scheduling to achieve 160GB/s intra-node bandwidth and 50GB/s inter-node bandwidth on H800 clusters.
DeepSeek Models
Based on the above core technology framework, DeepSeek has recently launched three representative models, showcasing its AI innovation capabilities:
DeepSeek-V3
DeepSeek-V3 is a general-purpose model built on a Mixture-of-Experts (MoE) architecture, with a total of 671 billion parameters. Each input activates only 37 billion parameters, ensuring efficient inference and computational performance. Additionally, the model introduces Multi-head Latent Attention (MLA), which employs low-rank joint compression to reduce key-value (KV) cache requirements, enhancing inference efficiency.
DeepSeek-R1-Zero
DeepSeek-R1-Zero is the first foundational reasoning model trained entirely through Reinforcement Learning (RL) without relying on Supervised Fine-Tuning (SFT) labeled data. The model exhibits self-verification and reflective capabilities, highlighting the potential of RL-driven reasoning AI. However, due to the absence of SFT, R1-Zero’s outputs may suffer from poor readability and incoherent language.
DeepSeek-R1
To address R1-Zero’s shortcomings, DeepSeek-R1 incorporates limited labeled data and a multi-stage RL process. Initially, a small set of high-quality reasoning chain data is used for fine-tuning, followed by a two-stage RL process to improve output readability and coherence. DeepSeek-R1 has demonstrated strong performance in various tests, with reasoning capabilities comparable to OpenAI’s o1 model.
These models showcase DeepSeek’s ability to enhance AI model performance while reducing costs, profoundly influencing the AI industry landscape.
Source: DeepSeek-R1 GitHub page
Technical Cost Analysis
Due to U.S. export restrictions on advanced H100 chips, DeepSeek claims that its V3 model was trained solely on older H800 chips (which have comparable computing power but lower network bandwidth). The total training cost was only $5.576 million, significantly lower than the costs of Meta’s Llama and Anthropic’s Claude models, which are more than ten times higher.
Brand | Model | Cost |
---|---|---|
DeepSeek | DeepSeek-V3 | $5.576M |
Meta | Llama3.1-405B | $92.52M |
ChatGPT | GPT-4 | $70.875M |
However, DeepSeek-V3’s response patterns bear a striking resemblance to ChatGPT, raising speculation that DeepSeek might have employed knowledge distillation during pre-training, using OpenAI models as Teacher Model to enhance its inference capabilities.
Knowledge Distillation: A machine learning technique that transfers knowledge from a large model (teacher) to a smaller model (student). The student model learns to approximate the teacher’s performance while being more efficient and lightweight.
David Sacks, a senior advisor at OpenAI, has suggested that DeepSeek may have replicated ChatGPT’s technology. Additionally, insiders claim OpenAI possesses evidence that some Chinese companies have attempted to clone models via distillation, potentially violating OpenAI’s terms of service.
Impact Analysis
AI Hardware Market
DeepSeek’s emergence has brought renewed attention to distillation techniques and the feasibility of achieving high performance with smaller models. It has also prompted a reassessment of AI hardware demand.
In the short term, DeepSeek’s use of lower-cost hardware like H800 chips and its innovations in reducing computing power requirements could weaken demand for high-end GPUs like Nvidia’s H100. This was a key factor behind Nvidia’s stock price drop following the announcement.
However, some experts remain optimistic, arguing that technological advancements lowering AI model costs and improving efficiency will drive more applications and demand—ultimately boosting AI hardware needs—a phenomenon consistent with Jevons Paradox.
Jevons Paradox: A concept in economics where technological improvements that increase resource efficiency lead to increased total consumption of that resource due to lower costs and higher demand. British economist William Jevons, in his work The Coal Question, presented his observations on steam engine technology. He noted that although James Watt had improved the efficiency of the steam engine, the consumption of coal in the United Kingdom had significantly increased as a result.
Market Response
AI Giants’ Reactions
OpenAI
On January 31, OpenAI launched its o3-mini model, which many interpreted as a response to DeepSeek-R1. Some OpenAI insiders accused DeepSeek of intellectual property infringement. However, CEO Sam Altman stated that the company has "no plans" to sue DeepSeek or other Chinese AI startups, emphasizing that OpenAI’s focus remains on building superior products and maintaining leadership through innovation.
Anthropic
Anthropic CEO Dario Amodei remarked: "DeepSeek’s models perform similarly to U.S. models from 7-10 months ago but at a lower cost. Cost reductions were expected; China just demonstrated them first." While he dismissed DeepSeek as a direct threat, he advocated for stricter chip export controls to solidify U.S. leadership in AI.
Check out the following AI-related articles: