safouaneelg/gpt-oss-20b_DPO_ultrafeedback
The is the full precision model.
If you prefer the LoRA adapter only, check my other repo safouaneelg/gpt-oss-20b_DPO_ultrafeedback-lora
Model Description
This is a Direct Preference Optimization (DPO) fine-tuned version of the openai/gpt-oss-20b
base model, aligned using stream argilla/ultrafeedback-binarized-preferences-cleaned
dataset. The fine-tuning uses LoRA adapters. The model retains the causal language modeling capabilities of the base while improving alignment to human preferences.
Compute Infrastructure : The training was conducted on a linux server with 3x GPUs A6000 48GB each using the below frameworks:
transformers==4.56.2
trl==0.21.0
peft==0.17.1
torch==2.8.0+cu128
Loading and Inference with Transformers (Full Precision)
run the below code to use this model.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline
model_name = "safouaneelg/gpt-oss-20b_DPO_ultrafeedback"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True
)
generator = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
torch_dtype=torch.bfloat16,
device_map="auto"
)
prompt = """
Can you write a C++ program that prompts the user to enter the name of a country and checks if it borders the Mediterranean Sea? Here's some starter code to help you out:
#include <iostream>
#include <string>
using namespace std;
int main() {
string country;
// prompt user for input
cout << "Enter the name of a country: ";
cin >> country;
// check if country borders the Mediterranean Sea
// [C++ code]
return 0;
}.
""",
outputs = generator(
prompt,
max_new_tokens=100,
do_sample=True,
temperature=0.7,
top_p=0.9,
pad_token_id=tokenizer.eos_token_id
)
print(outputs[0]["generated_text"])
Local Run Tips:
- Requires ~40 GB VRAM (multi-GPU recommended; use
device_map="auto"
for sharding). - Chat-style: Wrap prompts in
tokenizer.apply_chat_template([{"role": "user", "content": prompt}])
.
Training details
Training data & results
The model was fine-tuned on the argilla/ultrafeedback-binarized-preferences-cleaned
dataset, a cleaned subset of Ultrafeedback containing ~60k binarized preference pairs (prompt, chosen response, rejected response) for alignment. Preprocessing: Filtered for length (>20 chars, 10-512 tokens), formatted with chat templates. Full dataset card: Hugging Face.
Training Hyperparameters:
Parameter | Value |
---|---|
LoRA rank (r) | 16 |
LoRA alpha | 32 |
LoRA dropout | 0.1 |
Bias | none |
Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
Per device batch size | 1 |
Gradient accumulation steps | 16 |
Learning rate | 5e-6 |
Number of epochs | 1 |
Warmup ratio | 0.1 |
Beta (DPO) | 0.1 |
Max sequence length | 512 |
Optimizer | adamw_torch |
LR scheduler | cosine |
Weight decay | 0.01 |
Max grad norm | 1.0 |
Gradient checkpointing | True |
BF16 | True |
Seed | 42 |
Below the resulting curves of conducted training. The training lasted ~37 hours
Model Card Authors
Safouane El Ghazouali ([email protected])
- Downloads last month
- 13
Model tree for safouaneelg/gpt-oss-20b_DPO_ultrafeedback
Base model
openai/gpt-oss-20b