AndrewYan commited on
Commit
6019b34
·
verified ·
1 Parent(s): acd3b97

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +88 -3
README.md CHANGED
@@ -1,3 +1,88 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## 📖 Introduction
2
+
3
+ # DistilQwen2.5-R1 Series: Advanced Reasoning Models
4
+
5
+ ## Overview
6
+ As large language models (LLMs) evolve toward deep reasoning capabilities, deploying them in resource-constrained environments (e.g., mobile devices, edge computing) remains challenging. The DistilQwen2.5-R1 series addresses this by transferring reasoning capabilities from ultra-large models (e.g., DeepSeek-R1) to compact models through innovative distillation techniques, achieving high performance while reducing computational costs.
7
+
8
+ ## Key Innovations
9
+ ### 1. Cognitive Trajectory Adaptation Framework
10
+ - **Challenge**: Discrepancies in reasoning paths between large and small models (e.g., small models struggle to comprehend large models' high-level problem-solving logic)
11
+ - **Solutions**:
12
+ - **Phase 1: CoT Data Optimization**
13
+ - Difficulty grading of large model reasoning chains (simple/medium/hard) via LLM-as-a-Judge
14
+ - Adaptive adjustments: Expand simple chains and simplify complex chains to create medium-difficulty datasets digestible by small models
15
+ - **Phase 2: Preference Optimization**
16
+ - Generate contrastive data pairs containing correct/incorrect reasoning paths
17
+ - Apply DPO algorithm with tailored configurations to enhance reasoning path discrimination
18
+
19
+ ### 2. Performance Highlights
20
+ - **DistilQwen2.5-R1-7B** outperforms comparable distilled models (e.g., OpenThinker-7B) across multiple benchmarks
21
+ - Successfully transfers high-order reasoning patterns originally dependent on large model parameter scales
22
+
23
+ ## Technical Advantages
24
+ - Dynamic data optimization eliminates cognitive trajectory discrepancies
25
+ - Two-stage training balances reasoning accuracy and computational efficiency
26
+ - Enables complex task reasoning in edge computing environments
27
+
28
+ ## 🚀 Quick Start
29
+
30
+ Here provides a code snippet with `apply_chat_template` to show you how to load the tokenizer and model and how to generate contents.
31
+
32
+ ```python
33
+ from transformers import AutoModelForCausalLM, AutoTokenizer
34
+ device = "cuda" # the device to load the model onto
35
+
36
+ model = AutoModelForCausalLM.from_pretrained(
37
+ "alibaba-pai/DistilQwen2.5-R1-32B",
38
+ torch_dtype="auto",
39
+ device_map="auto"
40
+ )
41
+ tokenizer = AutoTokenizer.from_pretrained("alibaba-pai/DistilQwen2.5-R1-32B")
42
+
43
+ prompt = "Give me a short introduction to large language model."
44
+ messages = [
45
+ {"role": "user", "content": prompt}
46
+ ]
47
+ text = tokenizer.apply_chat_template(
48
+ messages,
49
+ tokenize=False,
50
+ add_generation_prompt=True
51
+ )
52
+ model_inputs = tokenizer([text], return_tensors="pt").to(device)
53
+
54
+ generated_ids = model.generate(
55
+ model_inputs.input_ids,
56
+ max_new_tokens=2048,
57
+ )
58
+ generated_ids = [
59
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
60
+ ]
61
+
62
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
63
+ ```
64
+
65
+ ## 🔍 Evaluation
66
+
67
+ We compared DistilQwen2.5-R1 series with leading reasoning models across four benchmarks:
68
+
69
+ ### 7B Model Comparison
70
+ | Model | Training Data Size | AIME2024 | MATH-500 | GPQA Diamond | LiveCodeBench V2 |
71
+ |--------------------------------|--------------------|----------|----------|--------------|------------------|
72
+ | DeepSeek-R1-Distill-Qwen-7B | 800k | 55.5 | 92.8 | 49.1 | - |
73
+ | Bespoke-Stratos-7B | 17k | 20.0 | 82.0 | 37.8 | 36.1 |
74
+ | OpenThinker-7B | 114k | 31.3 | 83.0 | 42.4 | 39.9 |
75
+ | **DistilQwen2.5-R1-7B** | 105k | 43.33 | 88.4 | 42.93 | 46.38 |
76
+
77
+ ### 32B Model Comparison
78
+ | Model | Training Data Size | AIME2024 | MATH-500 | GPQA Diamond | LiveCodeBench V2 |
79
+ |--------------------------------|--------------------|----------|----------|--------------|------------------|
80
+ | DeepSeek-R1-Distill-Qwen-32B | 800k | 72.6 | 94.3 | 62.1 | - |
81
+ | Sky-T1-32B-Preview | 17k | 43.3 | 86.4 | 56.8 | - |
82
+ | OpenThinker-32B | 114k | 66.0 | 90.6 | 61.6 | 68.9 |
83
+ | **DistilQwen2.5-R1-32B** | 105k | 70.0 | 93.8 | 62.12 | 65.95 |
84
+
85
+ Key highlights:
86
+ - DistilQwen2.5-R1 models achieve superior performance while using **6.1× less training data** than DeepSeek-R1-Distill-Qwen series
87
+ - Maintains open-source training lineage using filtered OpenThoughts subsets
88
+ - Leads in LiveCodeBench V2 among open-source trained models