Model Description

This model is fine-tuned on reward modeling data and has undergone two stages of training: Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). As a result, it is a post-DPO model optimized for reasoning and text generation tasks.

chat_message = [
  {"role": "user", "content": ...},
  {"role": "reason", "content": ...},
  {"role": "assistant", "content": ...},
]

Intended Use

While this model is specifically designed for reward modeling tasks, it also demonstrates adaptability to general-purpose tasks. Notably, it exhibits a degree of correctness and reliability across various applications.

Limitations

  • The model’s performance may vary depending on the domain and specificity of the input.
  • It may inherit biases present in the training data.

Code and Resources

The code and additional resources for this model are available on GitHub.

Downloads last month
14
Safetensors
Model size
14.8B params
Tensor type
BF16
·
Inference Examples
Unable to determine this model's library. Check the docs .

Model tree for jiulaikankan/Qwen2.5-14B-ReasonGenRM

Base model

Qwen/Qwen2.5-14B
Finetuned
(49)
this model