Safetensors
qwen2
lbourdois commited on
Commit
faa31f9
·
verified ·
1 Parent(s): d2e8290

Improve language tag

Browse files

Hi! As the model is multilingual, this is a PR to add other languages than English to the language tag to improve the referencing. Note that 29 languages are announced in the README, but only 13 are explicitly listed. I was therefore only able to add these 13 languages.

Files changed (1) hide show
  1. README.md +109 -97
README.md CHANGED
@@ -1,98 +1,110 @@
1
- ---
2
- license: apache-2.0
3
- datasets:
4
- - virtuoussy/Math-RLVR
5
- - virtuoussy/Multi-subject-RLVR
6
- language:
7
- - en
8
- base_model:
9
- - Qwen/Qwen2.5-7B-Instruct
10
- ---
11
-
12
- Model Details
13
-
14
- The generative reward model used in paper "Expanding RL with Verifiable Rewards Across Diverse Domains".
15
-
16
- Inputting the question, label and the response to be evaluated, the model will judge if the response is right.
17
-
18
- ## **Quick start**
19
-
20
- > ```python
21
- > # Load model directly
22
- > from transformers import AutoTokenizer, AutoModelForCausalLM
23
- >
24
- > tokenizer = AutoTokenizer.from_pretrained("virtuoussy/Qwen2.5-7B-Instruct-RLVR")
25
- > model = AutoModelForCausalLM.from_pretrained("virtuoussy/Qwen2.5-7B-Instruct-RLVR")
26
- >
27
- > PROMPT= '''
28
- > Given a problem, determine whether the final answer in the provided (incomplete) solution process matches the reference answer.
29
- > The reference answer may be one single option character (e.g., A, B, C, D), a numerical value, an expression, or a list of answers if multiple questions are involved.
30
- > **The reference answer may be in Chinese or another language, but your evaluation should be language-agnostic.**
31
- >
32
- > Your task:
33
- > - Compare the final output of the solution process with the reference answer.
34
- > - If they **match exactly**, output **YES**.
35
- > - If they **do not match**, output **NO**.
36
- > - If the solution process is unclear, incomplete, or ambiguous, assume it is incorrect and output **NO**.
37
- >
38
- > Your output must be strictly **'YES'** or **'NO'**, with no additional words, punctuation, or explanation.
39
- >
40
- > ---
41
- >
42
- > **Question:**
43
- > {question}
44
- >
45
- > **Solution Process (Final Step Only):**
46
- > {response}
47
- >
48
- > **Reference Answer:**
49
- > {reference}
50
- >
51
- > **Output:**
52
- > '''
53
- >
54
- >
55
- > question="The founder of China's first public kindergarten teacher training school - Jiangxi Experimental Kindergarten Teacher School is (  )."
56
- > label="Chen Heqin"
57
- > answer="heqin chen"
58
- >
59
- > prompt_question = PROMPT.format(question=question, reference=label, response=answer)
60
- > messages=[
61
- > {"role": "system", "content": "You are a helpful assistant."},
62
- > {"role": "user", "content": prompt_question},
63
- > ]
64
- > input_ids=tokenizer.apply_chat_template(messages,return_tensors="pt")
65
- > output=model.generate(input_ids,do_sample=False)
66
- > judgement=tokenizer.decode(output[0][input_ids.shape[1]:],skip_special_tokens=True)
67
- > print("Model judgement: ",judgement)
68
- > ```
69
-
70
- ## Use as a remote reward
71
-
72
- ```bash
73
- # launch a remote reward
74
- bash launch_reward.sh {MODEL_PATH} {ANSWER_PATH} {METRIC}
75
-
76
- # MODEL_PATH: the path of our generative reward model.
77
- # ANSWER_PATH: the path of the training data.
78
- # METRIC: greedy/prob
79
- # This will launch a reward at http://127.0.0.1:8000/get_reward
80
-
81
- # train
82
- bash train.sh {METHOD} {PRETRAIN_PATH} {DATA_PATH} {REWARD_API}
83
-
84
- # Both train.sh and launch_reward.sh can be found in the model directory.
85
- # We will release our github repo soon!
86
- ```
87
-
88
-
89
- ## Citation
90
-
91
- ```bibtex
92
- @article{su2025expanding,
93
- title={Expanding RL with Verifiable Rewards Across Diverse Domains},
94
- author={Su, Yi and Yu, Dian and Song, Linfeng and Li, Juntao and Mi, Haitao and Tu, Zhaopeng and Zhang, Min and Yu, Dong},
95
- journal={arXiv preprint arXiv:2503.23829},
96
- year={2025}
97
- }
 
 
 
 
 
 
 
 
 
 
 
 
98
  ```
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - virtuoussy/Math-RLVR
5
+ - virtuoussy/Multi-subject-RLVR
6
+ language:
7
+ - zho
8
+ - eng
9
+ - fra
10
+ - spa
11
+ - por
12
+ - deu
13
+ - ita
14
+ - rus
15
+ - jpn
16
+ - kor
17
+ - vie
18
+ - tha
19
+ - ara
20
+ base_model:
21
+ - Qwen/Qwen2.5-7B-Instruct
22
+ ---
23
+
24
+ Model Details
25
+
26
+ The generative reward model used in paper "Expanding RL with Verifiable Rewards Across Diverse Domains".
27
+
28
+ Inputting the question, label and the response to be evaluated, the model will judge if the response is right.
29
+
30
+ ## **Quick start**
31
+
32
+ > ```python
33
+ > # Load model directly
34
+ > from transformers import AutoTokenizer, AutoModelForCausalLM
35
+ >
36
+ > tokenizer = AutoTokenizer.from_pretrained("virtuoussy/Qwen2.5-7B-Instruct-RLVR")
37
+ > model = AutoModelForCausalLM.from_pretrained("virtuoussy/Qwen2.5-7B-Instruct-RLVR")
38
+ >
39
+ > PROMPT= '''
40
+ > Given a problem, determine whether the final answer in the provided (incomplete) solution process matches the reference answer.
41
+ > The reference answer may be one single option character (e.g., A, B, C, D), a numerical value, an expression, or a list of answers if multiple questions are involved.
42
+ > **The reference answer may be in Chinese or another language, but your evaluation should be language-agnostic.**
43
+ >
44
+ > Your task:
45
+ > - Compare the final output of the solution process with the reference answer.
46
+ > - If they **match exactly**, output **YES**.
47
+ > - If they **do not match**, output **NO**.
48
+ > - If the solution process is unclear, incomplete, or ambiguous, assume it is incorrect and output **NO**.
49
+ >
50
+ > Your output must be strictly **'YES'** or **'NO'**, with no additional words, punctuation, or explanation.
51
+ >
52
+ > ---
53
+ >
54
+ > **Question:**
55
+ > {question}
56
+ >
57
+ > **Solution Process (Final Step Only):**
58
+ > {response}
59
+ >
60
+ > **Reference Answer:**
61
+ > {reference}
62
+ >
63
+ > **Output:**
64
+ > '''
65
+ >
66
+ >
67
+ > question="The founder of China's first public kindergarten teacher training school - Jiangxi Experimental Kindergarten Teacher School is (  )."
68
+ > label="Chen Heqin"
69
+ > answer="heqin chen"
70
+ >
71
+ > prompt_question = PROMPT.format(question=question, reference=label, response=answer)
72
+ > messages=[
73
+ > {"role": "system", "content": "You are a helpful assistant."},
74
+ > {"role": "user", "content": prompt_question},
75
+ > ]
76
+ > input_ids=tokenizer.apply_chat_template(messages,return_tensors="pt")
77
+ > output=model.generate(input_ids,do_sample=False)
78
+ > judgement=tokenizer.decode(output[0][input_ids.shape[1]:],skip_special_tokens=True)
79
+ > print("Model judgement: ",judgement)
80
+ > ```
81
+
82
+ ## Use as a remote reward
83
+
84
+ ```bash
85
+ # launch a remote reward
86
+ bash launch_reward.sh {MODEL_PATH} {ANSWER_PATH} {METRIC}
87
+
88
+ # MODEL_PATH: the path of our generative reward model.
89
+ # ANSWER_PATH: the path of the training data.
90
+ # METRIC: greedy/prob
91
+ # This will launch a reward at http://127.0.0.1:8000/get_reward
92
+
93
+ # train
94
+ bash train.sh {METHOD} {PRETRAIN_PATH} {DATA_PATH} {REWARD_API}
95
+
96
+ # Both train.sh and launch_reward.sh can be found in the model directory.
97
+ # We will release our github repo soon!
98
+ ```
99
+
100
+
101
+ ## Citation
102
+
103
+ ```bibtex
104
+ @article{su2025expanding,
105
+ title={Expanding RL with Verifiable Rewards Across Diverse Domains},
106
+ author={Su, Yi and Yu, Dian and Song, Linfeng and Li, Juntao and Mi, Haitao and Tu, Zhaopeng and Zhang, Min and Yu, Dong},
107
+ journal={arXiv preprint arXiv:2503.23829},
108
+ year={2025}
109
+ }
110
  ```