safouaneelg/gpt-oss-20b_DPO_ultrafeedback

The is the full precision model.

If you prefer the LoRA adapter only, check my other repo safouaneelg/gpt-oss-20b_DPO_ultrafeedback-lora

Model Description

This is a Direct Preference Optimization (DPO) fine-tuned version of the openai/gpt-oss-20b base model, aligned using stream argilla/ultrafeedback-binarized-preferences-cleaned dataset. The fine-tuning uses LoRA adapters. The model retains the causal language modeling capabilities of the base while improving alignment to human preferences.

Compute Infrastructure : The training was conducted on a linux server with 3x GPUs A6000 48GB each using the below frameworks:

transformers==4.56.2
trl==0.21.0
peft==0.17.1
torch==2.8.0+cu128

Loading and Inference with Transformers (Full Precision)

run the below code to use this model.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model_name = "safouaneelg/gpt-oss-20b_DPO_ultrafeedback"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

prompt = """
  Can you write a C++ program that prompts the user to enter the name of a country and checks if it borders the Mediterranean Sea? Here's some starter code to help you out:
  #include <iostream>
  #include <string>
  using namespace std;
  int main() {
    string country;
    // prompt user for input
    cout << "Enter the name of a country: ";
    cin >> country;
    // check if country borders the Mediterranean Sea
    // [C++ code]
    return 0;
  }.
""",

outputs = generator(
    prompt,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    pad_token_id=tokenizer.eos_token_id
)
print(outputs[0]["generated_text"])

Local Run Tips:

  • Requires ~40 GB VRAM (multi-GPU recommended; use device_map="auto" for sharding).
  • Chat-style: Wrap prompts in tokenizer.apply_chat_template([{"role": "user", "content": prompt}]).

Training details

Training data & results

The model was fine-tuned on the argilla/ultrafeedback-binarized-preferences-cleaned dataset, a cleaned subset of Ultrafeedback containing ~60k binarized preference pairs (prompt, chosen response, rejected response) for alignment. Preprocessing: Filtered for length (>20 chars, 10-512 tokens), formatted with chat templates. Full dataset card: Hugging Face.

Training Hyperparameters:

Parameter Value
LoRA rank (r) 16
LoRA alpha 32
LoRA dropout 0.1
Bias none
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Per device batch size 1
Gradient accumulation steps 16
Learning rate 5e-6
Number of epochs 1
Warmup ratio 0.1
Beta (DPO) 0.1
Max sequence length 512
Optimizer adamw_torch
LR scheduler cosine
Weight decay 0.01
Max grad norm 1.0
Gradient checkpointing True
BF16 True
Seed 42

Below the resulting curves of conducted training. The training lasted ~37 hours

training logs

Model Card Authors

Safouane El Ghazouali ([email protected])

Downloads last month
13
Safetensors
Model size
20.9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for safouaneelg/gpt-oss-20b_DPO_ultrafeedback

Base model

openai/gpt-oss-20b
Adapter
(73)
this model

Dataset used to train safouaneelg/gpt-oss-20b_DPO_ultrafeedback