Transformers
Safetensors
English
lbourdois commited on
Commit
8dd4241
Β·
verified Β·
1 Parent(s): ddbcf79

Improve language tag

Browse files

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show
  1. README.md +138 -126
README.md CHANGED
@@ -1,126 +1,138 @@
1
- ---
2
- library_name: transformers
3
- license: mit
4
- datasets:
5
- - eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1
6
- language:
7
- - en
8
- base_model:
9
- - Qwen/Qwen2.5-0.5B
10
- ---
11
-
12
- # Model Card for `Qwen2.5-0.5B` Fine-Tuned on Enhanced GSM8K
13
-
14
- This model is a fine-tuned version of [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B) designed to solve math word problems with improved reasoning structure and response format. Fine-tuning was performed using **LoRA (Low-Rank Adaptation)** for parameter-efficient training and the **GRPO (Guided Reward Preference Optimization)** algorithm to optimize outputs based on reward-driven signals.
15
-
16
- ---
17
-
18
- ## 🧠 Model Objective
19
-
20
- The goal of this fine-tuned model is to enhance short-form math reasoning tasks such as those found in the GSM8K benchmark. The model has been encouraged to produce outputs with a specific format:
21
-
22
- ```text
23
- <think>...reasoning steps...</think><answer>...final answer...</answer>
24
- ```
25
-
26
- ---
27
-
28
- ## πŸ“š Dataset
29
-
30
- Training was done on:
31
-
32
- - [`eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1`](https://huggingface.co/datasets/eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1)
33
- This version enhances the original GSM8K dataset with structured reasoning chains and answer tokens for alignment and reward-based learning.
34
-
35
- ---
36
-
37
- ## πŸ› οΈ Training Methodology
38
-
39
- ### πŸ”„ LoRA Configuration
40
-
41
- The model was fine-tuned using LoRA on attention projection layers (`q_proj`, `v_proj`) for parameter-efficient adaptation:
42
-
43
- ```python
44
- from peft import get_peft_model, LoraConfig, TaskType
45
-
46
- lora_config = LoraConfig(
47
- task_type=TaskType.CAUSAL_LM,
48
- r=8,
49
- lora_alpha=16,
50
- target_modules=["q_proj", "v_proj"],
51
- lora_dropout=0.1,
52
- bias="none",
53
- )
54
-
55
- model = get_peft_model(model, lora_config)
56
- ```
57
-
58
- ---
59
-
60
- ### 🎯 Reward Functions for GRPO
61
-
62
- We used two custom reward functions to guide generation during training:
63
-
64
- - `reward_len`: Encourages generations close to 50 tokens for optimal reasoning length.
65
-
66
- ```python
67
- def reward_len(completions, **kwargs):
68
- return [-abs(50 - len(completion)) for completion in completions]
69
- ```
70
-
71
- - `reward_format`: Enforces formatting in the style `<think>...</think><answer>...</answer>`
72
-
73
- ```python
74
- import re
75
-
76
- def reward_format(completions, **kwargs):
77
- pattern = r"^<think>.*?</think><answer>.*?</answer>$"
78
- return [1.0 if re.match(pattern, c) else 0.0 for c in completions]
79
- ```
80
-
81
- ---
82
-
83
- ### βš™οΈ GRPO Training Configuration
84
-
85
- ```python
86
- training_args = GRPOConfig(
87
- output_dir="GRPO",
88
- learning_rate=2e-5,
89
- per_device_train_batch_size=8,
90
- gradient_accumulation_steps=1,
91
- max_prompt_length=512,
92
- max_completion_length=96,
93
- num_generations=8,
94
- optim="adamw_8bit",
95
- num_train_epochs=1,
96
- bf16=True,
97
- report_to="none",
98
- remove_unused_columns=False,
99
- logging_steps=1,
100
- )
101
- ```
102
-
103
- - **Mixed precision training**: Enabled (`bf16=True`)
104
- - **Optimizer**: `adamw_8bit` for memory-efficient optimization
105
- - **Sampling**: 8 candidate generations per prompt (`num_generations=8`)
106
- - **Reward-guided selection**: Candidates are scored using `reward_len` and `reward_format` before backpropagation.
107
-
108
- ---
109
-
110
- ## πŸ’‘ Use Cases
111
-
112
- - Educational tutoring systems for math reasoning
113
- - Automated math assistants and solvers
114
- - Research on structured reasoning and format-aware generation
115
-
116
- ---
117
-
118
- ## πŸ§ͺ Limitations
119
-
120
- This model is fine-tuned on a specialized subset of math problems and may not generalize to other reasoning tasks. It expects prompts aligned with math word problem structure and outputs in a strict reasoning/answer format.
121
-
122
- ---
123
-
124
- ## πŸ“œ License
125
-
126
- This model is released under the **MIT License**.
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ library_name: transformers
3
+ license: mit
4
+ datasets:
5
+ - eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1
6
+ language:
7
+ - zho
8
+ - eng
9
+ - fra
10
+ - spa
11
+ - por
12
+ - deu
13
+ - ita
14
+ - rus
15
+ - jpn
16
+ - kor
17
+ - vie
18
+ - tha
19
+ - ara
20
+ base_model:
21
+ - Qwen/Qwen2.5-0.5B
22
+ ---
23
+
24
+ # Model Card for `Qwen2.5-0.5B` Fine-Tuned on Enhanced GSM8K
25
+
26
+ This model is a fine-tuned version of [`Qwen/Qwen2.5-0.5B`](https://huggingface.co/Qwen/Qwen2.5-0.5B) designed to solve math word problems with improved reasoning structure and response format. Fine-tuning was performed using **LoRA (Low-Rank Adaptation)** for parameter-efficient training and the **GRPO (Guided Reward Preference Optimization)** algorithm to optimize outputs based on reward-driven signals.
27
+
28
+ ---
29
+
30
+ ## 🧠 Model Objective
31
+
32
+ The goal of this fine-tuned model is to enhance short-form math reasoning tasks such as those found in the GSM8K benchmark. The model has been encouraged to produce outputs with a specific format:
33
+
34
+ ```text
35
+ <think>...reasoning steps...</think><answer>...final answer...</answer>
36
+ ```
37
+
38
+ ---
39
+
40
+ ## πŸ“š Dataset
41
+
42
+ Training was done on:
43
+
44
+ - [`eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1`](https://huggingface.co/datasets/eagle0504/openai-gsm8k-enhanced-using-together-ai-deepseek-train8k-test1k-v1)
45
+ This version enhances the original GSM8K dataset with structured reasoning chains and answer tokens for alignment and reward-based learning.
46
+
47
+ ---
48
+
49
+ ## πŸ› οΈ Training Methodology
50
+
51
+ ### πŸ”„ LoRA Configuration
52
+
53
+ The model was fine-tuned using LoRA on attention projection layers (`q_proj`, `v_proj`) for parameter-efficient adaptation:
54
+
55
+ ```python
56
+ from peft import get_peft_model, LoraConfig, TaskType
57
+
58
+ lora_config = LoraConfig(
59
+ task_type=TaskType.CAUSAL_LM,
60
+ r=8,
61
+ lora_alpha=16,
62
+ target_modules=["q_proj", "v_proj"],
63
+ lora_dropout=0.1,
64
+ bias="none",
65
+ )
66
+
67
+ model = get_peft_model(model, lora_config)
68
+ ```
69
+
70
+ ---
71
+
72
+ ### 🎯 Reward Functions for GRPO
73
+
74
+ We used two custom reward functions to guide generation during training:
75
+
76
+ - `reward_len`: Encourages generations close to 50 tokens for optimal reasoning length.
77
+
78
+ ```python
79
+ def reward_len(completions, **kwargs):
80
+ return [-abs(50 - len(completion)) for completion in completions]
81
+ ```
82
+
83
+ - `reward_format`: Enforces formatting in the style `<think>...</think><answer>...</answer>`
84
+
85
+ ```python
86
+ import re
87
+
88
+ def reward_format(completions, **kwargs):
89
+ pattern = r"^<think>.*?</think><answer>.*?</answer>$"
90
+ return [1.0 if re.match(pattern, c) else 0.0 for c in completions]
91
+ ```
92
+
93
+ ---
94
+
95
+ ### βš™οΈ GRPO Training Configuration
96
+
97
+ ```python
98
+ training_args = GRPOConfig(
99
+ output_dir="GRPO",
100
+ learning_rate=2e-5,
101
+ per_device_train_batch_size=8,
102
+ gradient_accumulation_steps=1,
103
+ max_prompt_length=512,
104
+ max_completion_length=96,
105
+ num_generations=8,
106
+ optim="adamw_8bit",
107
+ num_train_epochs=1,
108
+ bf16=True,
109
+ report_to="none",
110
+ remove_unused_columns=False,
111
+ logging_steps=1,
112
+ )
113
+ ```
114
+
115
+ - **Mixed precision training**: Enabled (`bf16=True`)
116
+ - **Optimizer**: `adamw_8bit` for memory-efficient optimization
117
+ - **Sampling**: 8 candidate generations per prompt (`num_generations=8`)
118
+ - **Reward-guided selection**: Candidates are scored using `reward_len` and `reward_format` before backpropagation.
119
+
120
+ ---
121
+
122
+ ## πŸ’‘ Use Cases
123
+
124
+ - Educational tutoring systems for math reasoning
125
+ - Automated math assistants and solvers
126
+ - Research on structured reasoning and format-aware generation
127
+
128
+ ---
129
+
130
+ ## πŸ§ͺ Limitations
131
+
132
+ This model is fine-tuned on a specialized subset of math problems and may not generalize to other reasoning tasks. It expects prompts aligned with math word problem structure and outputs in a strict reasoning/answer format.
133
+
134
+ ---
135
+
136
+ ## πŸ“œ License
137
+
138
+ This model is released under the **MIT License**.