tclf90 commited on
Commit
aa85e6a
·
verified ·
1 Parent(s): 6ddb807

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +15 -3
README.md CHANGED
@@ -15,11 +15,23 @@ base_model_relation: quantized
15
  # DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Compact
16
  Base mode [deepseek-ai/DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528)
17
 
18
- This repository contains a mixed-precision (Int4 + selective Int8) GPTQ version of DeepSeek-R1-0528 for vLLM. We began with a standard 4-bit (AWQ/GPTQ) conversion that follows vLLM’s default quantization layout, but early tests showed that a fully-Int4 model could not meet the compute demands of this checkpoint and may produce unstable outputs.
19
 
20
- Guided by this preliminary analysis, we introduced targeted, per-layer Int8 refinement: only the layers most sensitive to quantization are stored in lnt8 (the compadct version has more int8 layers), while the rest remain Int4. This keeps the file-size increase minimal compared with the pure 4-bit baseline while restoring response quality.
21
 
22
- Currently, vllm==0.9.0 does not support per-layer quantization settings for the moe module. I've provided a basic implementation by adding the get_moe_quant_method function within the gptq_marlin.py file. Before the PR is merged, please replace the corresponding file with the attached one.
 
 
 
 
 
 
 
 
 
 
 
 
23
 
24
  ### 【Model Update Date】
25
  ```
 
15
  # DeepSeek-R1-0528-GPTQ-Int4-Int8Mix-Compact
16
  Base mode [deepseek-ai/DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528)
17
 
18
+ This repository delivers an Int4 + selectively-Int8 GPTQ `DeepSeek-R1-0528` model: only layers that are highly sensitive to quantization remain in Int8, while the rest stay Int4—preserving generation quality with minimal file-size overhead.
19
 
20
+ Preliminary trials show that converting the entire model to pure Int4 (AWQ/GPTQ) under the quantization layout used in vLLM’s current DeepSeek-R1 implementation degrades inference accuracy and can produce faulty outputs. Layer-wise fine-grained quantization substantially mitigates this issue.
21
 
22
+ Temporary patch:
23
+ vLLM == 0.9.0 does not yet natively support per-layer quantization for MoE modules.
24
+ We added get_moe_quant_method to gptq_marlin.py as an interim fix.
25
+ Until the upstream PR is merged, please replace the original file with the one provided in this repo.
26
+
27
+ Variant Overview
28
+
29
+ | Variant | Characteristics | File Size | Recommended Scenario |
30
+ |-------------|-------------------------------------------------------------------------|-----------|----------------------------------------------------------|
31
+ | **Compact** | More Int8 layers, higher fidelity | 414 GB | Ample GPU memory & strict quality needs (e.g., 8 × A100) |
32
+ | **Lite** | Only the most critical layers upgraded to Int8; size close to pure Int4 | 355 GB | Resource-constrained, lightweight server deployments |
33
+
34
+ Choose the variant that best matches your hardware and quality requirements.
35
 
36
  ### 【Model Update Date】
37
  ```