DeepSeek

From Systems Analysis Wiki
Jump to navigation Jump to search

DeepSeek is a Chinese artificial intelligence research company that develops large language models (LLMs) and multimodal systems. The firm gained widespread recognition for openly distributing the weights of its models and their high cost-effectiveness, which triggered a price adjustment in the AI market in late 2024 and early 2025.[1]

History

DeepSeek's founder is entrepreneur and co-founder of the hedge fund High-Flyer, Liang Wenfeng. In the spring of 2023, High-Flyer spun off its AI research division, which became the company DeepSeek AI in May of the same year. By 2025, the staff had grown to ~160 employees.[2] From its inception, the company declared a commitment to openness—publishing model weights ("open-weight") under permissive licenses and focusing on fundamental AGI research.

Unlike most startups, DeepSeek is funded from High-Flyer's R&D budget, which, according to the founder, allows it to focus on long-term goals rather than immediate monetization.[3]

The company caused a significant stir in the technology and financial communities in January 2025 after releasing the DeepSeek-R1 model. The announcement that training a model comparable to GPT-4 cost less than $6 million (compared to estimates of $100+ million for GPT-4) caused a crash in the stock prices of tech giants and forced the industry to rethink the "more compute = better model" paradigm.[4]

Architectural Features

Mixture-of-Experts (DeepSeekMoE)
Most of DeepSeek's flagship models use a Mixture-of-Experts (MoE) architecture. Unlike "dense" models, where all parameters are activated to process a request, MoE models engage only a small subset of specialized subnetworks ("experts") for each token. DeepSeek developed its own MoE implementation with "shared" experts, fine-grained segmentation, and load balancing without auxiliary losses, which allows activating only a fraction of the hundreds of billions of parameters and drastically reducing computational costs.[5]
Multi-Head Latent Attention (MLA)
A method for compressing the KV cache into a latent vector, saving up to 93% of memory and enabling context windows of up to 128,000 tokens. This technology is key for efficiently processing long texts.[6]
FP8 Training and Multi-Token Prediction
The V3 family of models utilizes mixed-precision FP8 (8-bit floating-point numbers) and simultaneous prediction of multiple tokens, which accelerates both the training and inference processes.[7]

Model Family

  • DeepSeek LLM — base models with 7 and 67 billion parameters (2023), the first bilingual (EN/ZH) release, which surpassed LLaMA-2 70B on several tasks.[8]
  • DeepSeek-Coder (2023) — a line of models for programming (1.3–33 billion) and its successor Coder-V2 (16 billion / 236 billion MoE, 128K context, 338 coding languages).[9]
  • DeepSeek-V2 (May 2024) — a 236 billion (21 billion active) MoE LLM with MLA; trained on 8.1 trillion tokens.[10]
  • DeepSeek-V3 (December 2024) — 671 billion (37 billion active); training took ≈2.8 million GPU-hours on Nvidia H800s at a cost of ≈$5.5 million.[11]
  • DeepSeek-R1 (January 2025) — a line of models for logical reasoning; the R1-0528 version approached OpenAI o3 on AIME 2025 and LiveCodeBench.[12]
  • DeepSeek-VL / VL2 — multimodal VL models (up to 4.5 billion active) with dynamic mosaic processing of 1024×1024 images.[13]
  • DeepSeek-Math 7B — a specialized model with 51.7% accuracy on the MATH benchmark; close to GPT-4.[14]
  • DeepSeek-Prover-V2 — a 671 billion MoE model for theorem proving in Lean 4; 63.5% on miniF2F.
  • Distilled R1 models — open-weight versions from 1.5 to 70 billion parameters based on Llama and Qwen.[15]

Chronology of Key Releases

Date Release and Key Features
Nov 2, 2023 DeepSeek-Coder v1: First open-weight models for code.
Nov 29, 2023 DeepSeek LLM 7B/67B: Bilingual model trained on 2 trillion tokens.
Jan 11, 2024 DeepSeek-MoE 16B: Debut of the MoE architecture.
Feb 6, 2024 DeepSeek-Math 7B: Specialized model for mathematics (51.7% on MATH).
May 6, 2024 DeepSeek-V2 236B: Introduction of MLA and MoE architectures.
Jun 17, 2024 DeepSeek-Coder-V2: 128K context, support for 338 programming languages.
Dec 13, 2024 DeepSeek-VL2: MoE-based multimodal model.
Dec 27, 2024 DeepSeek-V3 671B: Flagship model trained for less than $6 million.
Jan 20, 2025 DeepSeek-R1 / R1-Zero: Reasoning models trained with RL.
Jan 27, 2025 Janus-Pro: Image generation model surpassing DALL-E 3.

Performance and Benchmarks

  • DeepSeek-V3 surpassed Llama 3.1 and Qwen 2.5 and approached the performance of GPT-4 on MMLU and GPQA-Diamond.[16]
  • DeepSeek-Coder-V2 scored 72.9% on Arena-Hard, achieving parity with GPT-4o and outperforming all open models except Claude-3.5-Sonnet.[17]
  • DeepSeek-Math 7B achieved 51.7% on MATH, close to Gemini-Ultra at 10 times smaller size.[18]
  • R1-Zero improved the AIME 2024 pass@1 score from 15.6% to 71% solely through RL training.[19]

Economics and API

DeepSeek offers a public API for its V3 and R1 models at prices ranging from $0.07 to $0.14 per million input tokens on a cache hit and from $1.10 to $2.19 per million output tokens—up to tens of times cheaper than GPT-4o rates.[20]

Licensing and Open Source

Most models are distributed under the MIT or Apache 2.0 license, which permits commercial use. The company publishes model weights on Hugging Face and GitHub but keeps the full datasets and training pipelines proprietary ("open weight, but not full open source").

Industry Impact

  • The launch of R1 caused a one-day drop in the stock prices of NVIDIA, Microsoft, and other companies amid news of a "GPT-4 class model for $6 million".[21]
  • The demonstration of successful training on Nvidia H800 chips under export restrictions spurred debate about the effectiveness of US sanctions and accelerated the development of Chinese AI accelerators (e.g., Huawei Ascend 910B).

Criticism and Limitations

  • Security: In the HarmBench test, the R1 model failed to block 100% of harmful prompts ("jailbreak").
  • Political Censorship: The chat versions filter topics "sensitive" to the Chinese government (e.g., the 1989 Tiananmen Square events, the status of Taiwan).
  • Data Storage: The storage of user data on servers in China limits API usage by Western corporations subject to GDPR and similar legal regimes.[22]

Literature

  • Dai, D. et al. (2024). DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models. arXiv:2401.06066.
  • Ding, Y. et al. (2024). LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens. arXiv:2402.13753.
  • Fedus, W.; Zoph, B.; Shazeer, N. (2021). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. arXiv:2101.03961.
  • He, L. et al. (2025). Scaling Instruction-Tuned LLMs to Million-Token Contexts via Hierarchical Synthetic Data Generation. arXiv:2504.12637.
  • Jegham, N. et al. (2025). Visual Reasoning Evaluation of Grok, Deepseek Janus, Gemini, Qwen, Mistral, and ChatGPT. arXiv:2502.16428.
  • Lepikhin, D. et al. (2020). GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. arXiv:2006.16668.
  • Peng, B. et al. (2023). YaRN: Efficient Context Window Extension of Large Language Models. arXiv:2309.00071.
  • Shen, Y. et al. (2025). Long-VITA: Scaling Large Multi-modal Models to 1 Million Tokens with Leading Short-Context Accuracy. arXiv:2502.05177.
  • Su, J. et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. arXiv:2104.09864.
  • Zhong, M. et al. (2024). Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective. arXiv:2406.13282.

References

  1. DeepSeek's low-cost AI spotlights billions spent by US tech // Reuters. 2025-01-27.
  2. Who is Liang Wenfeng, the founder of DeepSeek? // Reuters. 2025-01-28.
  3. Who is Liang Wenfeng, the founder of DeepSeek? // Reuters. 2025-01-28.
  4. DeepSeek's low-cost AI spotlights billions spent by US tech // Reuters. 2025-01-27.
  5. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model // Hugging Face. 2024.
  6. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model // Hugging Face. 2024.
  7. DeepSeek-V3: A Parameter-Efficient MoE Large Language Model with Better Performance // arXiv. 2024.
  8. DeepSeek LLM: Scaling Open-Source Language Models with Longtermism // arXiv. 2024.
  9. DeepSeek-Coder-V2: A More Powerful and Economical Coder // arXiv. 2024.
  10. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model // Hugging Face. 2024.
  11. DeepSeek-V3: A Parameter-Efficient MoE Large Language Model with Better Performance // arXiv. 2024.
  12. DeepSeek-R1: A 671B Parameter MoE LLM with Unprecedented Reasoning Capabilities // arXiv. 2025.
  13. GitHub - deepseek-ai/DeepSeek-VL: Towards Real-World Vision-Language Understanding // GitHub.
  14. DeepSeek-Math: Pushing the Limits of Mathematical Reasoning in Open-Source Models // arXiv. 2024.
  15. DeepSeek-R1: A 671B Parameter MoE LLM with Unprecedented Reasoning Capabilities // arXiv. 2025.
  16. DeepSeek-V3: A Parameter-Efficient MoE Large Language Model with Better Performance // arXiv. 2024.
  17. DeepSeek-Coder-V2: A More Powerful and Economical Coder // arXiv. 2024.
  18. DeepSeek-Math: Pushing the Limits of Mathematical Reasoning in Open-Source Models // arXiv. 2024.
  19. DeepSeek-R1: A 671B Parameter MoE LLM with Unprecedented Reasoning Capabilities // arXiv. 2025.
  20. DeepSeek Explained: Why This AI Model Is Gaining Popularity // DigitalOcean.
  21. DeepSeek's low-cost AI spotlights billions spent by US tech // Reuters. 2025-01-27.
  22. DeepSeek's low-cost AI spotlights billions spent by US tech // Reuters. 2025-01-27.

See Also