Llama-3.2-1B DPO PairRM
This model is a fine-tuned version of meta-llama/Llama-3.2-1B-Instruct using Direct Preference Optimization (DPO).
Model Details
- Base Model: meta-llama/Llama-3.2-1B-Instruct
- Training Method: Direct Preference Optimization (DPO)
- Preference Source: PairRM
- LoRA Configuration:
- r: 8
- alpha: 16
- target_modules: ['q_proj', 'k_proj', 'v_proj', 'o_proj']
- Training Steps: 250
- Learning Rate: 0.0002
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
model = PeftModel.from_pretrained(base_model, "pyamy/llama3-dpo-pairrm")
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
Training Details
- Dataset: 50 instructions from LIMA
- Responses per instruction: 5
- Preference judgment: PairRM
- Training framework: TRL DPOTrainer
Performance
See evaluation results in the repository for detailed performance metrics.
- Downloads last month
- 14
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for pyamy/llama3-dpo-pairrm
Base model
meta-llama/Llama-3.2-1B-Instruct