prometheus-2-bloom-560m

Finetuned bloom-560m of prometheus-7b-v2.0 using bigscience/bloom-560m as the base model.

Training hyperparameters:

3 epoch
Learning rate 1e-5
Effective batch size 4
Cosine annealing
~5% warmup

Supports both feedback (likert-scale) evaluation and preference evaluation. Uses bloom-560m the same prompts as prometheus-7b-v2.0. See example information below.

Feedback Evaluation

ABSOLUTE_PROMPT = """###Task Description:
An instruction (might include an Input inside it), a response to evaluate, a reference answer that gets a score of 5, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of the response strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, write a score that is an integer between 1 and 5. You should refer to the score rubric.
3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (an integer number between 1 and 5)"
4. Please do not generate any other opening, closing, and explanations.

###The instruction to evaluate:
{}

###Response to evaluate:
{}

###Reference Answer (Score 5):
{}

###Score Rubrics:
{}

###Feedback: """

device = 'cuda:0'
model = AutoModelForCausalLM.from_pretrained("zli12321/prometheus2-560M").to(device)
tokenizer = AutoTokenizer.from_pretrained("zli12321/prometheus2-560M")

'''
Define your own instruction, response, reference, and rubric below
'''
prompt = ABSOLUTE_PROMPT.format(instruction, response, reference, rubric)
    
input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)
input_length = input_ids.shape[1]
outputs = model.generate(input_ids, output_logits=True, return_dict_in_generate=True, max_new_tokens=4096)
print(tokenizer.decode(outputs.sequences[0], skip_special_tokens=True))

Preference Evaluation Template

Follow the above to generate preference evaluation with the preference evaluation template.

###Task Description:
An instruction (might include an Input inside it), a response to evaluate, and a score rubric representing a evaluation criteria are given.
1. Write a detailed feedback that assess the quality of two responses strictly based on the given score rubric, not evaluating in general.
2. After writing a feedback, choose a better response between Response A and Response B. You should refer to the score rubric.
3. The output format should look as follows: "Feedback: (write a feedback for criteria) [RESULT] (A or B)"
4. Please do not generate any other opening, closing, and explanations.

###Instruction:
{}

###Response A:
{}

###Response B:
{}

###Reference Answer:
{}

###Score Rubric:
{}

###Feedback:

Citations

@misc{kim2024prometheus,
    title={Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models},
    author={Seungone Kim and Juyoung Suk and Shayne Longpre and Bill Yuchen Lin and Jamin Shin and Sean Welleck and Graham Neubig and Moontae Lee and Kyungjae Lee and Minjoon Seo},
    year={2024},
    eprint={2405.01535},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}

@inproceedings{li-etal-2024-pedants,
    title = "{PEDANTS}: Cheap but Effective and Interpretable Answer Equivalence",
    author = "Li, Zongxia  and
      Mondal, Ishani  and
      Nghiem, Huy  and
      Liang, Yijun  and
      Boyd-Graber, Jordan Lee",
    editor = "Al-Onaizan, Yaser  and
      Bansal, Mohit  and
      Chen, Yun-Nung",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2024",
    month = nov,
    year = "2024",
    address = "Miami, Florida, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.findings-emnlp.548/",
    doi = "10.18653/v1/2024.findings-emnlp.548",
    pages = "9373--9398",
    abstract = "Question answering (QA) can only make progress if we know if an answer is correct, but current answer correctness (AC) metrics struggle with verbose, free-form answers from large language models (LLMs). There are two challenges with current short-form QA evaluations: a lack of diverse styles of evaluation data and an over-reliance on expensive and slow LLMs. LLM-based scorers correlate better with humans, but this expensive task has only been tested on limited QA datasets. We rectify these issues by providing rubrics and datasets for evaluating machine QA adopted from the Trivia community. We also propose an efficient, and interpretable QA evaluation that is more stable than an exact match and neural methods (BERTScore)."
}

zli12321
/

prometheus2-560M

prometheus-2-bloom-560m

Feedback Evaluation

Preference Evaluation Template

Citations

Model tree for zli12321/prometheus2-560M

Datasets used to train zli12321/prometheus2-560M