Improve model card with paper link and pipeline tag

#80
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +14 -2
README.md CHANGED
@@ -1,7 +1,9 @@
1
  ---
2
- license: mit
3
  library_name: transformers
 
 
4
  ---
 
5
  # DeepSeek-V3-0324
6
  <!-- markdownlint-disable first-line-h1 -->
7
  <!-- markdownlint-disable html -->
@@ -197,5 +199,15 @@ This repository and the model weights are licensed under the [MIT License](LICEN
197
  }
198
  ```
199
 
 
 
 
 
 
 
 
 
 
 
200
  ## Contact
201
- If you have any questions, please raise an issue or contact us at [[email protected]]([email protected]).
 
1
  ---
 
2
  library_name: transformers
3
+ license: mit
4
+ pipeline_tag: text-generation
5
  ---
6
+
7
  # DeepSeek-V3-0324
8
  <!-- markdownlint-disable first-line-h1 -->
9
  <!-- markdownlint-disable html -->
 
199
  }
200
  ```
201
 
202
+ ## Paper title and link
203
+
204
+ The model was presented in the paper [Insights into DeepSeek-V3: Scaling Challenges and Reflections on Hardware for AI Architectures](https://huggingface.co/papers/2505.09343).
205
+
206
+ ## Paper abstract
207
+
208
+ The abstract of the paper is the following:
209
+
210
+ The rapid scaling of large language models (LLMs) has unveiled critical limitations in current hardware architectures, including constraints in memory capacity, computational efficiency, and interconnection bandwidth. DeepSeek-V3, trained on 2,048 NVIDIA H800 GPUs, demonstrates how hardware-aware model co-design can effectively address these challenges, enabling cost-efficient training and inference at scale. This paper presents an in-depth analysis of the DeepSeek-V3/R1 model architecture and its AI infrastructure, highlighting key innovations such as Multi-head Latent Attention (MLA) for enhanced memory efficiency, Mixture of Experts (MoE) architectures for optimized computation-communication trade-offs, FP8 mixed-precision training to unlock the full potential of hardware capabilities, and a Multi-Plane Network Topology to minimize cluster-level network overhead. Building on the hardware bottlenecks encountered during DeepSeek-V3's development, we engage in a broader discussion with academic and industry peers on potential future hardware directions, including precise low-precision computation units, scale-up and scale-out convergence, and innovations in low-latency communication fabrics. These insights underscore the critical role of hardware and model co-design in meeting the escalating demands of AI workloads, offering a practical blueprint for innovation in next-generation AI systems.
211
+
212
  ## Contact
213
+ If you have any questions, please raise an issue or contact us at [[email protected]]([email protected]).