The consequences of DeepSeek rippled through the market earlier today as some people woke up to the possibility of large language models trained and generated through lower cost of inference. In my view, the guidance around major chipmakers is rightfully impacted. The overall takeaway is that there is less demand for the high end chips makers like Nvidia and Broadcom generate. Ironically, Nvidia also makes lower end chips that were likely used by the DeepSeek’s large language model.
Understanding DeepSeek R1: Key Facts and Clarifications
Recent coverage of DeepSeek R1 has led to some misconceptions that warrant clarification:
Regarding training costs and infrastructure, the base model’s computational requirements amounted to $5.5M in GPU hours, excluding additional testing, smaller models, data generation, and the complete DeepSeek R1 training process.
DeepSeek operates with substantial backing from High-Flyer, a Chinese hedge fund managing over $7 billion as of 2020. Their team includes Olympic medalists in mathematics, physics, and informatics.
Technical specifications:
- Their infrastructure encompasses approximately 50,000 GPUs
- The complete DeepSeek R1 is a 671B parameter MoE model requiring more than 16 H100 GPUs, each with 80GB memory
- They have developed 6 “distilled” versions based on fine-tuned Qwen and Llama using 800k samples (without reinforcement learning)
- While the smallest 1.5B parameter version can run locally, it differs significantly from the full R1 model
Users should note that according to the terms of service, the hosted version at chat.deepseek.com may utilize user data for future model training.
The advancement of open science and source code will ultimately benefit the broader community. Hugging Face is currently developing a fully open reproduction pipeline.