ai-forever commited on
Commit
02bc6fe
·
verified ·
1 Parent(s): 0b4ef08

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +298 -3
README.md CHANGED
@@ -1,3 +1,298 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ language:
4
+ - ru
5
+ base_model:
6
+ - t-tech/T-lite-it-1.0
7
+ pipeline_tag: text-generation
8
+ library_name: transformers
9
+ tags:
10
+ - pytorch
11
+ ---
12
+ # pollux-judge-7b
13
+
14
+ <!-- Provide a quick summary of what the model is/does. -->
15
+
16
+ ![banner](images/logo_pollux_horiz_short_WHITEBG.png)
17
+
18
+ pollux-judge-7b is a 7-billion parameter generative language model specifically designed to evaluate the quality of other language models' responses in Russian.
19
+ The model assesses answer quality given input instruction, specific criteria and rubrics, providing automated LLM performance evaluation for Russian-language tasks.
20
+
21
+ ## Model Details
22
+
23
+ ### Model Description
24
+
25
+ <!-- Provide a longer summary of what this model is. -->
26
+
27
+ pollux-judge-7b is a part of POLLUX project, which is dedicated to evaluation of generative capabilities of Large Language Models (LLMs).
28
+ Part of this project is the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX) that introduces taxonomies of both generative tasks and evaluation criteria alongside quantitative and qualitative estimation of top-tier LLMs responses.
29
+ pollux-judge-7b is a decoder model based on [t-tech/T-lite-it-1.0](https://huggingface.co/t-tech/T-lite-it-1.0) and trained sequence-to-sequence to predict numerical score and textual rationale based on the given input instruction, LLM's answer, particular criterion, rubrics and reference answer if any.
30
+ Model technicaly works for any type of instruction and criterion formatted in an appropriate format, but has been trained on the instructions and criteria from the taxonomies of tasks and criteria from the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX).
31
+
32
+ pollux-judge-7b is an integral component of the POLLUX project, a comprehensive initiative dedicated to evaluating the generative capabilities of Large Language Models (LLMs).
33
+ At the heart of this project lies the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX), which introduces systematic taxonomies for both generative tasks and evaluation criteria, providing quantitative and qualitative assessments of responses from top-tier LLMs.
34
+
35
+ Built upon the [t-tech/T-lite-it-1.0](https://huggingface.co/t-tech/T-lite-it-1.0) architecture, pollux-judge-7b is a decoder-based 7 billion parameter model trained in a sequence-to-sequence fashion.
36
+ The model is designed to predict both numerical scores and detailed textual rationales based on the original instruction, the LLM's response, specific evaluation criteria, scoring rubrics, and reference answers when available.
37
+
38
+ While the model is technically capable of processing any type of instruction and criterion when properly formatted, its training has been specifically optimized using the generative tasks and evaluation criteria derived from the taxonomies established within the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX).
39
+
40
+
41
+ - **Model type:** decoder
42
+ - **Language(s) (NLP):** Russian
43
+ - **License:** MIT
44
+ - **Finetuned from model:** [t-tech/T-lite-it-1.0](https://huggingface.co/t-tech/T-lite-it-1.0)
45
+
46
+ ### Model Sources
47
+
48
+ <!-- Provide the basic links for the model. -->
49
+
50
+ - **Repository:** [POLLUX code base](https://github.com/ai-forever/POLLUX)
51
+ - **Paper:** [ArXiv preprint](https://arxiv.org/pdf/2505.24616)
52
+
53
+ ## Uses
54
+
55
+ <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
56
+
57
+ ### Direct Use
58
+
59
+ <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
60
+
61
+ pollux-judge-7b is specifically designed for assessing text responses against a single, predefined criterion per evaluation run.
62
+ The model operates optimally when provided with all essential components: the source instruction, the response to be evaluated (typically generated by another LLM), the specific evaluation criterion, and its corresponding scoring rubrics.
63
+
64
+
65
+ ### Out-of-Scope Use
66
+
67
+ <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
68
+
69
+ While the model may **technically** process multiple criteria simultaneously, such usage falls outside its intended design and may yield unpredictable results.
70
+ Similarly, the model is not designed to function autonomously in determining appropriate evaluation criteria—it requires explicit criterion specification to perform reliable assessments.
71
+
72
+ For optimal performance and reliable results, users should structure each evaluation session around one criterion at a time, providing all necessary contextual components to enable the model's comprehensive scoring and rationale generation capabilities.
73
+
74
+
75
+ ## MODEL OUTPUT DISCLAIMER AND LIMITATION OF LIABILITY
76
+
77
+ <!-- This section is meant to convey both technical and sociotechnical limitations. -->
78
+
79
+ All content, responses, and outputs generated by pollux-judge-7b (the "Model") are produced through automated computational processes based on statistical patterns learned from pre-training data.
80
+ Such outputs do not constitute statements, opinions, recommendations, or positions of the model developers, publishers, or affiliated entities (collectively, the "Developers").
81
+
82
+ The Model's outputs do not represent, reflect, or endorse any views, beliefs, policies, or positions held by the Developers.
83
+ Generated content should not be interpreted as official statements, advice, or guidance from the Developers.
84
+
85
+ While the Developers employed appropriate data curation practices during fine-tuning and avoided the intentional inclusion of inappropriate content, the Model's responses may reflect patterns present in the underlying pre-training datasets, which were sourced from publicly available internet content and other large-scale text corpora.
86
+
87
+ The Developers expressly disclaim responsibility for any content generated by the Model. Users acknowledge that:
88
+ - Generated outputs are probabilistic and may contain inaccuracies, biases, or inappropriate content
89
+ - The Developers cannot guarantee the accuracy, completeness, or appropriateness of any Model output
90
+ - Users assume full responsibility for evaluating and using Model-generated content
91
+
92
+ Users are solely responsible for reviewing, validating, and determining the appropriateness of any Model-generated content before use or distribution.
93
+
94
+
95
+ ## How to Get Started with the Model
96
+
97
+ Use the code below to get started with the model.
98
+
99
+ ```python
100
+ import torch
101
+ from transformers import AutoTokenizer, AutoModelForCausalLM
102
+
103
+ torch.manual_seed(42)
104
+
105
+ PROMPT_TEMPLATE = '''instruction: |
106
+ ### Задание для оценки:
107
+ {instruction}
108
+
109
+ reference_answer: |
110
+ ### Эталонный ответ:
111
+ {reference_answer}
112
+
113
+ response: |
114
+ ### Ответ для оценки:
115
+ {answer}
116
+
117
+ score_name: |
118
+ ### Критерий оценки:
119
+ {criteria_name}
120
+
121
+ score_rubrics: |
122
+ ### Шкала оценивания по критерию:
123
+ {criteria_rubrics}
124
+ '''
125
+
126
+ instruction = 'Сколько будет 2+2?'
127
+ reference_answer = ''
128
+ answer = 'Будет 4'
129
+ criteria_name = 'Правильность ответа'
130
+ criteria_rubrics = '''0: Дан неправильный ответ или ответ отсутствует.
131
+
132
+ 1: Ответ модели неполный (не на все вопросы задания получен ответ, в формулировке ответа отсутствует часть информации).
133
+
134
+ 2: Ответ модели совпадает с эталонным или эквивалентен ему.'''
135
+
136
+ prompt = PROMPT_TEMPLATE.format(instruction=instruction,
137
+ reference_answer=reference_answer,
138
+ answer=answer,
139
+ criteria_name=criteria_name,
140
+ criteria_rubrics=criteria_rubrics)
141
+
142
+ MODEL_PATH = "ai-forever/pollux-judge-7b"
143
+ tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
144
+ model = AutoModelForCausalLM.from_pretrained(
145
+ MODEL_PATH,
146
+ torch_dtype="auto",
147
+ device_map="auto"
148
+ )
149
+
150
+ messages = [
151
+ {"role": "user", "content": prompt}
152
+ ]
153
+ text = tokenizer.apply_chat_template(
154
+ messages,
155
+ tokenize=False,
156
+ add_generation_prompt=True
157
+ )
158
+ model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
159
+
160
+ generated_ids = model.generate(
161
+ **model_inputs,
162
+ max_new_tokens=4096
163
+ )
164
+ generated_ids = [
165
+ output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
166
+ ]
167
+
168
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
169
+
170
+ print(response)
171
+ ```
172
+
173
+ ## Training Details
174
+
175
+ ### Training Data
176
+
177
+ <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
178
+
179
+ It has been decided to employ synthetic data for training, because (i) acquiring the manually composed
180
+ training set of at least the same size as the POLLUX dataset requires the same amount of time and labor
181
+ and (ii) employing the same panels of experts potentially leads to data leakage
182
+
183
+ ### Training Procedure
184
+
185
+ <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
186
+
187
+ #### Preprocessing [optional]
188
+
189
+ [More Information Needed]
190
+
191
+
192
+ #### Training Hyperparameters
193
+
194
+ - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
195
+
196
+ #### Speeds, Sizes, Times [optional]
197
+
198
+ <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
199
+
200
+ [More Information Needed]
201
+
202
+ ## Evaluation
203
+
204
+ <!-- This section describes the evaluation protocols and provides the results. -->
205
+
206
+ ### Testing Data, Factors & Metrics
207
+
208
+ #### Testing Data
209
+
210
+ <!-- This should link to a Dataset Card if possible. -->
211
+
212
+ [More Information Needed]
213
+
214
+ #### Factors
215
+
216
+ <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
217
+
218
+ [More Information Needed]
219
+
220
+ #### Metrics
221
+
222
+ <!-- These are the evaluation metrics being used, ideally with a description of why. -->
223
+
224
+ [More Information Needed]
225
+
226
+ ### Results
227
+
228
+ [More Information Needed]
229
+
230
+ #### Summary
231
+
232
+
233
+
234
+ ## Model Examination [optional]
235
+
236
+ <!-- Relevant interpretability work for the model goes here -->
237
+
238
+ [More Information Needed]
239
+
240
+ ## Environmental Impact
241
+
242
+ <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
243
+
244
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
245
+
246
+ - **Hardware Type:** [More Information Needed]
247
+ - **Hours used:** [More Information Needed]
248
+ - **Cloud Provider:** [More Information Needed]
249
+ - **Compute Region:** [More Information Needed]
250
+ - **Carbon Emitted:** [More Information Needed]
251
+
252
+ ## Technical Specifications [optional]
253
+
254
+ ### Model Architecture and Objective
255
+
256
+ [More Information Needed]
257
+
258
+ ### Compute Infrastructure
259
+
260
+ [More Information Needed]
261
+
262
+ #### Hardware
263
+
264
+ [More Information Needed]
265
+
266
+ #### Software
267
+
268
+ [More Information Needed]
269
+
270
+ ## Citation [optional]
271
+
272
+ <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
273
+
274
+ **BibTeX:**
275
+
276
+ [More Information Needed]
277
+
278
+ **APA:**
279
+
280
+ [More Information Needed]
281
+
282
+ ## Glossary [optional]
283
+
284
+ <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
285
+
286
+ [More Information Needed]
287
+
288
+ ## More Information [optional]
289
+
290
+ [More Information Needed]
291
+
292
+ ## Model Card Authors [optional]
293
+
294
+ [More Information Needed]
295
+
296
+ ## Model Card Contact
297
+
298
+ [More Information Needed]