Model Card for Qwen3-0.6B-MNLP-DPO

This model is a Direct Preference Optimization (DPO) fine-tuned version of Qwen3-0.6B-Base using the Mehdi-Zogh/MNLP_M2_dpo_dataset. The goal was to improve the alignment of the base model's outputs with human preferences for educational assistance use cases.


Model Details

Model Description

This model was fine-tuned via the DPO (Direct Preference Optimization) algorithm on top of Qwen3-0.6B-Base. The dataset used for preference learning consists of query-response pairs with annotated preference labels, aiming to teach the model to generate more helpful, appropriate, and preferred responses in instructional contexts.


Uses

Direct Use

This model is trained to be an AI tutor that is specialized in course content at EPFL.

Downstream Use

It can serve as a base model for further alignment, personalization, or integration into interactive educational platforms or tutoring systems.

Out-of-Scope Use

  • Not recommended for use in high-stakes settings.
  • Not intended for use outside the English language.
  • Not intended for generating factual or up-to-date information (base model was not trained for retrieval-based tasks).

Get Started with the Model

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Mehdi-Zogh/MNLP_M2_dpo_model"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)

# prepare the model input
prompt = "explain gradient descent in simple terms."
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32768
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist() 

# parsing thinking content
try:
    # rindex finding 151668 (</think>)
    index = len(output_ids) - output_ids[::-1].index(151668)
except ValueError:
    index = 0

thinking_content = tokenizer.decode(output_ids[:index], skip_special_tokens=True).strip("\n")
content = tokenizer.decode(output_ids[index:], skip_special_tokens=True).strip("\n")

print("thinking content:", thinking_content)
print("content:", content)

Training Details

Training Data

The training data is the Mehdi-Zogh/MNLP_M2_dpo_dataset, which contains instructional prompts with ranked preferred and rejected completions. The dataset is specifically designed for alignment research using preference optimization methods.

Training Procedure

The model was fine-tuned using trl's DPOTrainer

Training Hyperparameters

Hyperparameter Value
Learning rate 1e-5
Epochs 3
Per-device train batch size 1
Per-device eval batch size 1
Gradient accumulation steps 4
Precision bf16
Early stopping patience 3

Evaluation

320 samples out of the dataset were used for validation.

Testing Data, Factors & Metrics

Testing Data

The model was tested on zechen-nlp/MNLP_dpo_demo

Metrics

  • Accuracy of Preference: Measures how often the model ranks the preferred response above the rejected one in held-out validation pairs.
  • This is a standard metric in DPO training to evaluate how well the model aligns with human preferences.

Results

  • The model achieved a preference accuracy of 84% ± 5.2% on the test set.
  • This indicates strong alignment between the model's outputs and the preferred responses provided in the dataset.
Downloads last month
54
Safetensors
Model size
596M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Mehdi-Zogh/MNLP_M2_dpo_model

Finetuned
(284)
this model

Dataset used to train Mehdi-Zogh/MNLP_M2_dpo_model