File size: 9,150 Bytes
---
library_name: transformers
license: mit
pipeline_tag: text-generation
tags:
- DeepSeek-R1-0528
- GPTQ
- Int4-Int8Mix
- 量化修复
- vLLM
base_model:
  - deepseek-ai/DeepSeek-R1-0528
base_model_relation: quantized
---
# DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Compact
Base mode [deepseek-ai/DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528)

This repository contains a mixed-precision (Int4 + selective Int8) GPTQ version of DeepSeek-R1-0528 for vLLM. We began with a standard 4-bit (AWQ/GPTQ) conversion that follows vLLM’s default quantization layout, but early tests showed that a fully-Int4 model could not meet the compute demands of this checkpoint and may produce unstable outputs.

Guided by this preliminary analysis, we introduced targeted, per-layer Int8 refinement: only the layers most sensitive to quantization are stored in lnt8 (the compadct version has more int8 layers), while the rest remain Int4. This keeps the file-size increase minimal compared with the pure 4-bit baseline while restoring response quality.

Currently, vllm==0.9.0 does not support per-layer quantization settings for the moe module. I've provided a basic implementation by adding the get_moe_quant_method function within the gptq_marlin.py file. Before the PR is merged, please replace the corresponding file with the attached one.

### 【Model Update Date】
``` 
2025-05-31
1. fast commit
```

### 【Dependencies】

```
vllm==0.9.0
transformers==4.52.3
```

<div style="
    background: rgba(255, 193, 61, 0.15);
    padding: 16px;
    border-radius: 6px;
    border: 1px solid rgba(255, 165, 0, 0.3);
    margin: 16px 0;
">
### 【💡Notes on New VLLM Versions💡】

#### 1. Recommend Using V0 Inference Mode
Before launching vLLM, set the environment variable
```
export VLLM_USE_V1=0
```
</div>

<div style="
    background: rgba(255, 0, 200, 0.15);
    padding: 16px;
    border-radius: 6px;
    border: 1px solid rgba(255, 0, 200, 0.3);
    margin: 16px 0;
">
### 【💡 Patch for gptq_marlin.py💡】

At present, vllm==0.9.0 lacks support for per-layer quantization configurations for the moe module, which will lead to errors when loading the model.
I have implemented a simple fix by adding the get_moe_quant_method function to the gptq_marlin.py file.

Until the PR is merged, please replace the gptq_marlin.py file in your installation with the attached version, placing it at:
```
.../site-packages/vllm/model_executor/layers/quantization/gptq_marlin.py
```

</div>


### 【Model List】

| FILE SIZE    | LATEST UPDATE TIME       |
|---------|--------------|
| `414GB` | `2025-06-01` |



### 【Model Download】

```python
from huggingface_hub import snapshot_download
snapshot_download('QuantTrio/DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Lite', cache_dir="local_path")
```


## DeepSeek-R1-0528
<!-- markdownlint-disable first-line-h1 -->
<!-- markdownlint-disable html -->
<!-- markdownlint-disable no-duplicate-header -->

<div align="center">
  <img src="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/logo.svg?raw=true" width="60%" alt="DeepSeek-V3" />
</div>
<hr>
<div align="center" style="line-height: 1;">
  <a href="https://www.deepseek.com/" target="_blank" style="margin: 2px;">
    <img alt="Homepage" src="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/badge.svg?raw=true" style="display: inline-block; vertical-align: middle;"/>
  </a>
  <a href="https://chat.deepseek.com/" target="_blank" style="margin: 2px;">
    <img alt="Chat" src="https://img.shields.io/badge/🤖%20Chat-DeepSeek%20R1-536af5?color=536af5&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a>
  <a href="https://huggingface.co/deepseek-ai" target="_blank" style="margin: 2px;">
    <img alt="Hugging Face" src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-DeepSeek%20AI-ffc107?color=ffc107&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a>
</div>

<div align="center" style="line-height: 1;">
  <a href="https://discord.gg/Tc7c45Zzu5" target="_blank" style="margin: 2px;">
    <img alt="Discord" src="https://img.shields.io/badge/Discord-DeepSeek%20AI-7289da?logo=discord&logoColor=white&color=7289da" style="display: inline-block; vertical-align: middle;"/>
  </a>
  <a href="https://github.com/deepseek-ai/DeepSeek-V2/blob/main/figures/qr.jpeg?raw=true" target="_blank" style="margin: 2px;">
    <img alt="Wechat" src="https://img.shields.io/badge/WeChat-DeepSeek%20AI-brightgreen?logo=wechat&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a>
  <a href="https://twitter.com/deepseek_ai" target="_blank" style="margin: 2px;">
    <img alt="Twitter Follow" src="https://img.shields.io/badge/Twitter-deepseek_ai-white?logo=x&logoColor=white" style="display: inline-block; vertical-align: middle;"/>
  </a>
</div>

<div align="center" style="line-height: 1;">
  <a href="LICENSE" style="margin: 2px;">
    <img alt="License" src="https://img.shields.io/badge/License-MIT-f5de53?&color=f5de53" style="display: inline-block; vertical-align: middle;"/>
  </a>
</div>
 

<p align="center">
  <a href="https://arxiv.org/pdf/2501.12948"><b>Paper Link</b>👁️</a>
</p>


## 1. Introduction

The DeepSeek R1 model has undergone a minor version upgrade, with the current version being DeepSeek-R1-0528. In the latest update, DeepSeek R1 has significantly improved its depth of reasoning and inference capabilities by leveraging increased computational resources and introducing algorithmic optimization mechanisms during post-training. The model has demonstrated outstanding performance across various benchmark evaluations, including mathematics, programming, and general logic. Its overall performance is now approaching that of leading models, such as O3 and Gemini 2.5 Pro.

<p align="center">
  <img width="80%" src="figures/benchmark.png">
</p>

Compared to the previous version, the upgraded model shows significant improvements in handling complex reasoning tasks. For instance, in the AIME 2025 test, the model’s accuracy has increased from 70% in the previous version to 87.5% in the current version. This advancement stems from enhanced thinking depth during the reasoning process: in the AIME test set, the previous model used an average of 12K tokens per question, whereas the new version averages 23K tokens per question.

Beyond its improved reasoning capabilities, this version also offers a reduced hallucination rate, enhanced support for function calling, and better experience for vibe coding.

## 2. Evaluation Results

### DeepSeek-R1-0528
 For all our models, the maximum generation length is set to 64K tokens. For benchmarks requiring sampling, we use a temperature of $0.6$, a top-p value of $0.95$, and generate 16 responses per query to estimate pass@1.
<div align="center">

| Category | Benchmark (Metric)               | DeepSeek R1     | DeepSeek R1 0528
|----------|----------------------------------|-----------------|---|
| General  |
|          | MMLU-Redux (EM)                   | 92.9            | 93.4
|          | MMLU-Pro (EM)                     | 84.0            | 85.0
|          | GPQA-Diamond (Pass@1)             | 71.5            | 81.0
|          | SimpleQA (Correct)                | 30.1            | 27.8
|          | FRAMES (Acc.)                     | 82.5            | 83.0
|          | Humanity's Last Exam (Pass@1)                     | 8.5            | 17.7
| Code |
|          | LiveCodeBench (2408-2505) (Pass@1)        | 63.5          | 73.3
|          | Codeforces-Div1 (Rating)          | 1530            | 1930
|          | SWE Verified (Resolved)           | 49.2            | 57.6
|          | Aider-Polyglot (Acc.)             | 53.3            | 71.6
| Math |
|          | AIME 2024 (Pass@1)                | 79.8            | 91.4
|          | AIME 2025 (Pass@1)                     | 70.0           | 87.5
|          | HMMT 2025 (Pass@1)            | 41.7 | 79.4 |
|          | CNMO 2024 (Pass@1)                | 78.8            | 86.9
| Tools |
|          | BFCL_v3_MultiTurn (Acc)     | -            | 37.0 |
|          | Tau-Bench   (Pass@1)       | -            | 53.5(Airline)/63.9(Retail)

</div>
Note: We use Agentless framework to evaluate model performance on SWE-Verified. We only evaluate text-only prompts in HLE testsets.  GPT-4.1 is employed to act user role in Tau-bench evaluation.

## 5. License
This code repository is licensed under [MIT License](LICENSE). The use of DeepSeek-R1 models is also subject to [MIT License](LICENSE). DeepSeek-R1 series (including Base and Chat) supports commercial use and distillation.

## 6. Citation
```
@misc{deepseekai2025deepseekr1incentivizingreasoningcapability,
      title={DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning}, 
      author={DeepSeek-AI},
      year={2025},
      eprint={2501.12948},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.12948}, 
}
```

## 7. Contact
If you have any questions, please raise an issue or contact us at [[email protected]]([email protected]).