safouaneelg/gpt-oss-20b_DPO_ultrafeedback

The is the full precision model.

If you prefer the LoRA adapter only, check my other repo safouaneelg/gpt-oss-20b_DPO_ultrafeedback-lora

Model Description

This is a Direct Preference Optimization (DPO) fine-tuned version of the openai/gpt-oss-20b base model, aligned using stream argilla/ultrafeedback-binarized-preferences-cleaned dataset. The fine-tuning uses LoRA adapters. The model retains the causal language modeling capabilities of the base while improving alignment to human preferences.

Compute Infrastructure : The training was conducted on a linux server with 3x GPUs A6000 48GB each using the below frameworks:

transformers==4.56.2
trl==0.21.0
peft==0.17.1
torch==2.8.0+cu128

Loading and Inference with Transformers (Full Precision)

run the below code to use this model.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model_name = "safouaneelg/gpt-oss-20b_DPO_ultrafeedback"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
)

generator = pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

prompt = """
  Can you write a C++ program that prompts the user to enter the name of a country and checks if it borders the Mediterranean Sea? Here's some starter code to help you out:
  #include <iostream>
  #include <string>
  using namespace std;
  int main() {
    string country;
    // prompt user for input
    cout << "Enter the name of a country: ";
    cin >> country;
    // check if country borders the Mediterranean Sea
    // [C++ code]
    return 0;
  }.
""",

outputs = generator(
    prompt,
    max_new_tokens=100,
    do_sample=True,
    temperature=0.7,
    top_p=0.9,
    pad_token_id=tokenizer.eos_token_id
)
print(outputs[0]["generated_text"])

Local Run Tips:

Requires ~40 GB VRAM (multi-GPU recommended; use device_map="auto" for sharding).
Chat-style: Wrap prompts in tokenizer.apply_chat_template([{"role": "user", "content": prompt}]).

Training details

Training data & results

The model was fine-tuned on the argilla/ultrafeedback-binarized-preferences-cleaned dataset, a cleaned subset of Ultrafeedback containing ~60k binarized preference pairs (prompt, chosen response, rejected response) for alignment. Preprocessing: Filtered for length (>20 chars, 10-512 tokens), formatted with chat templates. Full dataset card: Hugging Face.

Training Hyperparameters:

Parameter	Value
LoRA rank (r)	16
LoRA alpha	32
LoRA dropout	0.1
Bias	none
Target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Per device batch size	1
Gradient accumulation steps	16
Learning rate	5e-6
Number of epochs	1
Warmup ratio	0.1
Beta (DPO)	0.1
Max sequence length	512
Optimizer	adamw_torch
LR scheduler	cosine
Weight decay	0.01
Max grad norm	1.0
Gradient checkpointing	True
BF16	True
Seed	42

Below the resulting curves of conducted training. The training lasted ~37 hours

Model Card Authors

Safouane El Ghazouali ([email protected])

Downloads last month: 13

Safetensors

Model size

20.9B params

Tensor type

BF16

Model tree for safouaneelg/gpt-oss-20b_DPO_ultrafeedback

Base model

openai/gpt-oss-20b

Adapter

(73)

this model

safouaneelg
/

gpt-oss-20b_DPO_ultrafeedback