ai-forever
/

pollux-judge-7b

+---
+license: mit
+language:
+- ru
+base_model:
+- t-tech/T-lite-it-1.0
+pipeline_tag: text-generation
+library_name: transformers
+tags:
+- pytorch
+---
+# pollux-judge-7b
+<!-- Provide a quick summary of what the model is/does. -->
+![banner](images/logo_pollux_horiz_short_WHITEBG.png)
+pollux-judge-7b is a 7-billion parameter generative language model specifically designed to evaluate the quality of other language models' responses in Russian.
+The model assesses answer quality given input instruction, specific criteria and rubrics, providing automated LLM performance evaluation for Russian-language tasks.
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+pollux-judge-7b is a part of POLLUX project, which is dedicated to evaluation of generative capabilities of Large Language Models (LLMs).
+Part of this project is the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX) that introduces taxonomies of both generative tasks and evaluation criteria alongside quantitative and qualitative estimation of top-tier LLMs responses.
+pollux-judge-7b is a decoder model based on [t-tech/T-lite-it-1.0](https://huggingface.co/t-tech/T-lite-it-1.0) and trained sequence-to-sequence to predict numerical score and textual rationale based on the given input instruction, LLM's answer, particular criterion, rubrics and reference answer if any.
+Model technicaly works for any type of instruction and criterion formatted in an appropriate format, but has been trained on the instructions and criteria from the taxonomies of tasks and criteria from the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX).
+pollux-judge-7b is an integral component of the POLLUX project, a comprehensive initiative dedicated to evaluating the generative capabilities of Large Language Models (LLMs).
+At the heart of this project lies the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX), which introduces systematic taxonomies for both generative tasks and evaluation criteria, providing quantitative and qualitative assessments of responses from top-tier LLMs.
+Built upon the [t-tech/T-lite-it-1.0](https://huggingface.co/t-tech/T-lite-it-1.0) architecture, pollux-judge-7b is a decoder-based 7 billion parameter model trained in a sequence-to-sequence fashion.
+The model is designed to predict both numerical scores and detailed textual rationales based on the original instruction, the LLM's response, specific evaluation criteria, scoring rubrics, and reference answers when available.
+While the model is technically capable of processing any type of instruction and criterion when properly formatted, its training has been specifically optimized using the generative tasks and evaluation criteria derived from the taxonomies established within the [POLLUX dataset](https://huggingface.co/datasets/ai-forever/POLLUX).
+- **Model type:** decoder
+- **Language(s) (NLP):** Russian
+- **License:** MIT
+- **Finetuned from model:** [t-tech/T-lite-it-1.0](https://huggingface.co/t-tech/T-lite-it-1.0)
+### Model Sources
+<!-- Provide the basic links for the model. -->
+- **Repository:** [POLLUX code base](https://github.com/ai-forever/POLLUX)
+- **Paper:** [ArXiv preprint](https://arxiv.org/pdf/2505.24616)
+## Uses
+<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+### Direct Use
+<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+pollux-judge-7b is specifically designed for assessing text responses against a single, predefined criterion per evaluation run.
+The model operates optimally when provided with all essential components: the source instruction, the response to be evaluated (typically generated by another LLM), the specific evaluation criterion, and its corresponding scoring rubrics.
+### Out-of-Scope Use
+<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
+While the model may **technically** process multiple criteria simultaneously, such usage falls outside its intended design and may yield unpredictable results.
+Similarly, the model is not designed to function autonomously in determining appropriate evaluation criteria—it requires explicit criterion specification to perform reliable assessments.
+For optimal performance and reliable results, users should structure each evaluation session around one criterion at a time, providing all necessary contextual components to enable the model's comprehensive scoring and rationale generation capabilities.
+## MODEL OUTPUT DISCLAIMER AND LIMITATION OF LIABILITY
+<!-- This section is meant to convey both technical and sociotechnical limitations. -->
+All content, responses, and outputs generated by pollux-judge-7b (the "Model") are produced through automated computational processes based on statistical patterns learned from pre-training data.
+Such outputs do not constitute statements, opinions, recommendations, or positions of the model developers, publishers, or affiliated entities (collectively, the "Developers").
+The Model's outputs do not represent, reflect, or endorse any views, beliefs, policies, or positions held by the Developers.
+Generated content should not be interpreted as official statements, advice, or guidance from the Developers.
+While the Developers employed appropriate data curation practices during fine-tuning and avoided the intentional inclusion of inappropriate content, the Model's responses may reflect patterns present in the underlying pre-training datasets, which were sourced from publicly available internet content and other large-scale text corpora.
+The Developers expressly disclaim responsibility for any content generated by the Model. Users acknowledge that:
+- Generated outputs are probabilistic and may contain inaccuracies, biases, or inappropriate content
+- The Developers cannot guarantee the accuracy, completeness, or appropriateness of any Model output
+- Users assume full responsibility for evaluating and using Model-generated content
+Users are solely responsible for reviewing, validating, and determining the appropriateness of any Model-generated content before use or distribution.
+## How to Get Started with the Model
+Use the code below to get started with the model.
+```python
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+torch.manual_seed(42)
+PROMPT_TEMPLATE = '''instruction: |
+  ### Задание для оценки:
+  {instruction}
+reference_answer: |
+  ### Эталонный ответ:
+  {reference_answer}
+response: |
+  ### Ответ для оценки:
+  {answer}
+score_name: |
+  ### Критерий оценки:
+  {criteria_name}
+score_rubrics: |
+  ### Шкала оценивания по критерию:
+  {criteria_rubrics}
+'''
+instruction = 'Сколько будет 2+2?'
+reference_answer = ''
+answer = 'Будет 4'
+criteria_name = 'Правильность ответа'
+criteria_rubrics = '''0: Дан неправильный ответ или ответ отсутствует.
+1: Ответ модели неполный (не на все вопросы задания получен ответ, в формулировке ответа отсутствует часть информации).
+2: Ответ модели совпадает с эталонным или эквивалентен ему.'''
+prompt = PROMPT_TEMPLATE.format(instruction=instruction,
+                                reference_answer=reference_answer,
+                                answer=answer,
+                                criteria_name=criteria_name,
+                                criteria_rubrics=criteria_rubrics)
+MODEL_PATH = "ai-forever/pollux-judge-7b"
+tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
+model = AutoModelForCausalLM.from_pretrained(
+    MODEL_PATH,
+    torch_dtype="auto",
+    device_map="auto"
+)
+messages = [
+    {"role": "user", "content": prompt}
+]
+text = tokenizer.apply_chat_template(
+    messages,
+    tokenize=False,
+    add_generation_prompt=True
+)
+model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
+generated_ids = model.generate(
+    **model_inputs,
+    max_new_tokens=4096
+)
+generated_ids = [
+    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
+]
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+print(response)
+```
+## Training Details
+### Training Data
+<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
+It has been decided to employ synthetic data for training, because (i) acquiring the manually composed
+training set of at least the same size as the POLLUX dataset requires the same amount of time and labor
+and (ii) employing the same panels of experts potentially leads to data leakage
+### Training Procedure
+<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
+#### Preprocessing [optional]
+[More Information Needed]
+#### Training Hyperparameters
+- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
+#### Speeds, Sizes, Times [optional]
+<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
+[More Information Needed]
+## Evaluation
+<!-- This section describes the evaluation protocols and provides the results. -->
+### Testing Data, Factors & Metrics
+#### Testing Data
+<!-- This should link to a Dataset Card if possible. -->
+[More Information Needed]
+#### Factors
+<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
+[More Information Needed]
+#### Metrics
+<!-- These are the evaluation metrics being used, ideally with a description of why. -->
+[More Information Needed]
+### Results
+[More Information Needed]
+#### Summary
+## Model Examination [optional]
+<!-- Relevant interpretability work for the model goes here -->
+[More Information Needed]
+## Environmental Impact
+<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
+Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
+- **Hardware Type:** [More Information Needed]
+- **Hours used:** [More Information Needed]
+- **Cloud Provider:** [More Information Needed]
+- **Compute Region:** [More Information Needed]
+- **Carbon Emitted:** [More Information Needed]
+## Technical Specifications [optional]
+### Model Architecture and Objective
+[More Information Needed]
+### Compute Infrastructure
+[More Information Needed]
+#### Hardware
+[More Information Needed]
+#### Software
+[More Information Needed]
+## Citation [optional]
+<!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
+**BibTeX:**
+[More Information Needed]
+**APA:**
+[More Information Needed]
+## Glossary [optional]
+<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
+[More Information Needed]
+## More Information [optional]
+[More Information Needed]
+## Model Card Authors [optional]
+[More Information Needed]
+## Model Card Contact
+[More Information Needed]