pyamy
/

llama3-dpo-pairrm

preference-learning

Model card Files Files and versions Metrics Training metrics Community

Llama-3.2-1B DPO PairRM

This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct using Direct Preference Optimization (DPO).

Model Details

Base Model: meta-llama/Llama-3.2-1B-Instruct
Training Method: Direct Preference Optimization (DPO)
Preference Source: PairRM
LoRA Configuration:
- r: 8
- alpha: 16
- target_modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj']
Training Steps: 250
Learning Rate: 0.0002

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
model = PeftModel.from_pretrained(base_model, "pyamy/llama3-dpo-pairrm")

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

Training Details

Dataset: 50 instructions from LIMA
Responses per instruction: 5
Preference judgment: PairRM
Training framework: TRL DPOTrainer

Performance

See evaluation results in the repository for detailed performance metrics.

Downloads last month: 14

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pyamy/llama3-dpo-pairrm

Base model

meta-llama/Llama-3.2-1B-Instruct

Adapter

(313)

this model

Evaluation results

Metadata error: specify a dataset to view leaderboard