pollux-judge-7b / README.md

ai-forever

Update README.md

02bc6fe verified 3 months ago

preview code

raw

history blame

11.8 kB

metadata

license: mit
language:
  - ru
base_model:
  - t-tech/T-lite-it-1.0
pipeline_tag: text-generation
library_name: transformers
tags:
  - pytorch

pollux-judge-7b

pollux-judge-7b is a 7-billion parameter generative language model specifically designed to evaluate the quality of other language models' responses in Russian. The model assesses answer quality given input instruction, specific criteria and rubrics, providing automated LLM performance evaluation for Russian-language tasks.

Model Details

Model Description

pollux-judge-7b is a part of POLLUX project, which is dedicated to evaluation of generative capabilities of Large Language Models (LLMs). Part of this project is the POLLUX dataset that introduces taxonomies of both generative tasks and evaluation criteria alongside quantitative and qualitative estimation of top-tier LLMs responses. pollux-judge-7b is a decoder model based on t-tech/T-lite-it-1.0 and trained sequence-to-sequence to predict numerical score and textual rationale based on the given input instruction, LLM's answer, particular criterion, rubrics and reference answer if any. Model technicaly works for any type of instruction and criterion formatted in an appropriate format, but has been trained on the instructions and criteria from the taxonomies of tasks and criteria from the POLLUX dataset.

pollux-judge-7b is an integral component of the POLLUX project, a comprehensive initiative dedicated to evaluating the generative capabilities of Large Language Models (LLMs). At the heart of this project lies the POLLUX dataset, which introduces systematic taxonomies for both generative tasks and evaluation criteria, providing quantitative and qualitative assessments of responses from top-tier LLMs.

Built upon the t-tech/T-lite-it-1.0 architecture, pollux-judge-7b is a decoder-based 7 billion parameter model trained in a sequence-to-sequence fashion. The model is designed to predict both numerical scores and detailed textual rationales based on the original instruction, the LLM's response, specific evaluation criteria, scoring rubrics, and reference answers when available.

While the model is technically capable of processing any type of instruction and criterion when properly formatted, its training has been specifically optimized using the generative tasks and evaluation criteria derived from the taxonomies established within the POLLUX dataset.

Model type: decoder
Language(s) (NLP): Russian
License: MIT
Finetuned from model: t-tech/T-lite-it-1.0

Model Sources

Repository: POLLUX code base
Paper: ArXiv preprint

Uses

Direct Use

pollux-judge-7b is specifically designed for assessing text responses against a single, predefined criterion per evaluation run. The model operates optimally when provided with all essential components: the source instruction, the response to be evaluated (typically generated by another LLM), the specific evaluation criterion, and its corresponding scoring rubrics.

Out-of-Scope Use

While the model may technically process multiple criteria simultaneously, such usage falls outside its intended design and may yield unpredictable results. Similarly, the model is not designed to function autonomously in determining appropriate evaluation criteria—it requires explicit criterion specification to perform reliable assessments.

For optimal performance and reliable results, users should structure each evaluation session around one criterion at a time, providing all necessary contextual components to enable the model's comprehensive scoring and rationale generation capabilities.

MODEL OUTPUT DISCLAIMER AND LIMITATION OF LIABILITY

All content, responses, and outputs generated by pollux-judge-7b (the "Model") are produced through automated computational processes based on statistical patterns learned from pre-training data. Such outputs do not constitute statements, opinions, recommendations, or positions of the model developers, publishers, or affiliated entities (collectively, the "Developers").

The Model's outputs do not represent, reflect, or endorse any views, beliefs, policies, or positions held by the Developers. Generated content should not be interpreted as official statements, advice, or guidance from the Developers.

While the Developers employed appropriate data curation practices during fine-tuning and avoided the intentional inclusion of inappropriate content, the Model's responses may reflect patterns present in the underlying pre-training datasets, which were sourced from publicly available internet content and other large-scale text corpora.

The Developers expressly disclaim responsibility for any content generated by the Model. Users acknowledge that:

Generated outputs are probabilistic and may contain inaccuracies, biases, or inappropriate content
The Developers cannot guarantee the accuracy, completeness, or appropriateness of any Model output
Users assume full responsibility for evaluating and using Model-generated content

Users are solely responsible for reviewing, validating, and determining the appropriateness of any Model-generated content before use or distribution.

How to Get Started with the Model

Use the code below to get started with the model.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

torch.manual_seed(42)

PROMPT_TEMPLATE = '''instruction: |
  ### Задание для оценки:
  {instruction}

reference_answer: |
  ### Эталонный ответ:
  {reference_answer}

response: |
  ### Ответ для оценки:
  {answer}

score_name: |
  ### Критерий оценки:
  {criteria_name}

score_rubrics: |
  ### Шкала оценивания по критерию:
  {criteria_rubrics}
'''

instruction = 'Сколько будет 2+2?'
reference_answer = ''
answer = 'Будет 4'
criteria_name = 'Правильность ответа'
criteria_rubrics = '''0: Дан неправильный ответ или ответ отсутствует.

1: Ответ модели неполный (не на все вопросы задания получен ответ, в формулировке ответа отсутствует часть информации).

2: Ответ модели совпадает с эталонным или эквивалентен ему.'''

prompt = PROMPT_TEMPLATE.format(instruction=instruction,
                                reference_answer=reference_answer,
                                answer=answer,
                                criteria_name=criteria_name,
                                criteria_rubrics=criteria_rubrics)

MODEL_PATH = "ai-forever/pollux-judge-7b"
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    torch_dtype="auto",
    device_map="auto"
)

messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=4096
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

print(response)

Training Details

Training Data

It has been decided to employ synthetic data for training, because (i) acquiring the manually composed training set of at least the same size as the POLLUX dataset requires the same amount of time and labor and (ii) employing the same panels of experts potentially leads to data leakage

Training Procedure

Preprocessing [optional]

[More Information Needed]

Training Hyperparameters

Training regime: [More Information Needed]

Speeds, Sizes, Times [optional]

[More Information Needed]

Evaluation

Testing Data, Factors & Metrics

Testing Data

[More Information Needed]

Factors

[More Information Needed]

Metrics

[More Information Needed]

Results

[More Information Needed]

Summary

Model Examination [optional]

[More Information Needed]

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019).

Hardware Type: [More Information Needed]
Hours used: [More Information Needed]
Cloud Provider: [More Information Needed]
Compute Region: [More Information Needed]
Carbon Emitted: [More Information Needed]

Technical Specifications [optional]

Model Architecture and Objective

[More Information Needed]

Compute Infrastructure

[More Information Needed]

Hardware

[More Information Needed]

Software

[More Information Needed]

Citation [optional]

BibTeX:

[More Information Needed]

APA:

[More Information Needed]

Glossary [optional]

[More Information Needed]

More Information [optional]

[More Information Needed]

Model Card Authors [optional]

[More Information Needed]

Model Card Contact

[More Information Needed]