mR3-Qwen3-8B-tgt-prompt-tgt-thinking

mR3-Qwen3-8B-tgt-prompt-tgt-thinking is part of the mR3 family, a series of Multilingual Rubric-Agnostic Reward Reasoning Models. We perform SFT on the Qwen3 model family on the 4B, 8B, and 14B scales. Check out our paper for more information!

Model description

Model type: A reward model trained on a curated mR3 dataset collected from 72 languages that covers tasks such as classification, preference optimization, and question answering. Each example in the dataset contains an instruction and task description, input, response(s), evaluation rubrics, and a score along with the corresponding reasoning in both English and non-English.
Number of Language(s) (NLP): 72 languages
License: Apache 2.0
Finetuned from model: Qwen/Qwen3-8B

Model Sources

Project Page: https://rubricreward.github.io
Repository: https://github.com/rubricreward/mr3
Paper: https://arxiv.org/abs/2510.01146

Using the Model

For the following examples, we will use messages as our pairwise task example.

Click to reveal the example prompt

system_prompt = """# 지시
귀하는 공정한 심사위원으로서, 주어진 사용자 프롬프트에 대해 두 개의 보조 답변 중 어느 것이 더 나은지 평가해야 합니다. 다음 기준의 순서에 따라 체계적으로 답변을 비교하세요
# 평가기준
Assistant B: Assistant B는 전반적으로 더 나은 응답을 제공합니다. 판단 기준은 다음 순서대로 입니다 — 안전성 및 적절성, 유용성, 관련성, 간결성, 정중함, 그리고 포괄성입니다.
Assistant A: Assistant A는 전반적으로 더 나은 응답을 제공합니다. 판단 기준은 다음 순서대로 입니다 — 안전성 및 적절성, 유용성, 관련성, 간결성, 정중함, 그리고 포괄성입니다.
# 응답 형식
{'type': 'object', 'properties': {'explanation': {'type': 'string', 'description': '평가 요소를 기반으로 Assistant A와 Assistant B를 비교한 간단한 이유 설명입니다.'}, 'score': {'type': 'string', 'description': "판정 결과: 'Assistant A' 또는 'Assistant B' 중 하나입니다.", 'enum': ['Assistant A', 'Assistant B']}}, 'required': ['explanation', 'score']}
"""
user_prompt = """# 입력(사용자 프롬프트)
처형이란 무엇인가?
# Assistant A
처형"에 대한 정의를 제공할 수 없습니다. 맥락에 따라 의미가 다를 수 있기 때문입니다. 그러나 법적 처벌이든 폭력 행위이든 어떤 형태의 처형이든 갈등을 해결하거나 사회 문제를 해결하는 데 적합한 해결책이 아니라는 점을 강조하고 싶습니다. 모든 사람은 생명, 존엄성, 존중을 받을 권리가 있으며, 차이와 갈등을 해결하는 평화롭고 건설적인 방법을 찾는 것이 중요합니다. 다른 질문이나 우려 사항이 있으면 언제든지 문의하세요. 도움이 되고 안전한 답변을 제공하기 위해 최선을 다하겠습니다.
# Assistant B
사형 집행은 법원이 형량, 일반적으로 형사 유죄 판결에 대한 징역형을 집행하도록 명령하는 법적 절차입니다. 여기에는 관할권의 법률에 따라 치명적 주사, 감전사, 교수형 또는 다른 사형 방법이 포함될 수 있습니다. 사형이 아닌 경우, 사형 집행은 징역형을 집행하는 것을 의미하며, 여기에는 구금, 보호관찰 또는 기타 처벌이 포함될 수 있습니다.
# 당신의 응답
"""
# prepare the model input
messages = [
    {'role': 'system', 'content': system_prompt},
    {'role': 'user', 'content': user_prompt}
]

🧠 Using `transformers`

Below is an example of using our mR3-Qwen3-8B-tgt-prompt-tgt-thinking model by using an non-English prompt and a non-English reasoning using language forcing and 🤗 transformers:

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "rubricreward/mR3-Qwen3-8B-tgt-prompt-tgt-thinking"
# Load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)

### Key difference: Language forcing in Korean ###
text += "알겠습니다. 저는 제공된 모든 정보를 신중하게 검토하고 주어진 평가 기준에 따라 평가한 뒤, 요청된 형식에 맞춰 제 답변을 한국어로 명확하게 생각하며 제시하겠습니다."

model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# Conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=16384,
    temperature=0.6, top_p=0.95, min_p=0, top_k=20
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 
# Parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")
print(content)

⚡ Using `vLLM`

Alternatively, you may also use vLLM for faster inference by including language forcing:

from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
model_path = "rubricreward/mR3-Qwen3-8B-tgt-prompt-tgt-thinking"
tokenizer = AutoTokenizer.from_pretrained(model_path)
sampling_params = SamplingParams(temperature=0.6, top_p=0.95, max_tokens=16384, min_p=0, top_k=20)
llm = LLM(
  model=model_path,
  dtype="bfloat16",
  max_model_len=32768,
)
list_text = tokenizer.apply_chat_template(
  messages,
  tokenize=False,
  add_generation_prompt=True,
  enable_thinking=True # Switch between thinking and non-thinking modes. 
)

for index in range(len(list_text)):
    ### Key difference: Language forcing in Korean ###
    list_text[index] += "알겠습니다. 저는 제공된 모든 정보를 신중하게 검토하고 주어진 평가 기준에 따라 평가한 뒤, 요청된 형식에 맞춰 제 답변을 한국어로 명확하게 생각하며 제시하겠습니다."

outputs = llm.generate(list_text, sampling_params)
print(outputs[0].output.text)

License and use

mR3 is licensed under the Apache 2.0 license.

Citation

@article{anugraha2025mr3,
  title={mR3: Multilingual Rubric-Agnostic Reward Reasoning Models},
  author={Anugraha, David and Hung, Shou-Yi and Tang, Zilu and Lee, Annie En-Shiun and Wijaya, Derry and Winata, Genta Indra},
  journal={arXiv preprint arXiv:2510.01146},
  year={2025}
}

Downloads last month: 16

Safetensors

Model size

8.19B params

Tensor type

BF16

Model tree for rubricreward/mR3-Qwen3-8B-tgt-prompt-tgt-thinking

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Finetuned

(349)

this model

Dataset used to train rubricreward/mR3-Qwen3-8B-tgt-prompt-tgt-thinking

Collection including rubricreward/mR3-Qwen3-8B-tgt-prompt-tgt-thinking

mR3 Models

Collection

mR3 Models Trained with Different Strategies • 12 items • Updated 4 days ago