opt-125m-cluster-v2

This model is a fine-tuned version of facebook/opt-125m, trained on a mixed dataset consisting of OpenWebText, WikiText, and BookCorpus. It was trained on a single GPU (Quadro RTX 8000, 48GB VRAM) using Hugging Face Transformers and PyTorch.

📈 Evaluation Results

Final Training Loss: 2.9084
Final Perplexity (Eval): 19.10
Evaluation Steps: Every 5,000 training steps
Total Training Steps: 50,000

🧠 Model Description

This model was fine-tuned to reduce perplexity on general English text using causal language modeling (next-token prediction). The model was trained from scratch on 1 million samples with sequence length 1024 and optimized with AdamW and cosine learning rate scheduling.

✅ Intended Uses & Limitations

Intended uses:

Perplexity benchmarking
Research on training dynamics and convergence
Fine-tuning base for instruction tuning or domain adaptation

Limitations:

Not instruction-tuned
Not aligned for safe deployment
May reflect biases from internet text

📊 Training & Evaluation Data

A shuffled dataset combining:

60% OpenWebText
30% WikiText
10% BookCorpus

All data was pre-tokenized using the OPT tokenizer and capped at 1024 tokens per sample.

⚙️ Training Procedure

Batch size: 5 (accumulated to 40 via gradient_accumulation_steps=8)
Learning rate: 2e-4
Optimizer: AdamW with betas (0.9, 0.999), eps 1e-8
LR scheduler: Cosine decay with 1,000 warmup steps
Precision: Mixed (fp16 with AMP)
Steps: 50,000
Framework: Transformers 4.49.0, PyTorch 2.6.0

Let me know if you want this converted into a README.md format with YAML frontmatter as well.

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0002
train_batch_size: 5
eval_batch_size: 3
seed: 42
gradient_accumulation_steps: 8
total_train_batch_size: 40
optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 1000
training_steps: 50000
mixed_precision_training: Native AMP

Training results

📊 Training Results

steps	Perplexity	Cross-Entropy Loss
5k	24.07	3.1811
10k	23.28	3.1476
15k	22.44	3.1110
20k	21.63	3.0742
25k	20.97	3.0432
30k	20.33	3.0121
35k	19.73	2.9819
40k	19.32	2.9611
45k	19.11	2.9500
50k	19.10	2.9498

Framework versions

Transformers 4.49.0
Pytorch 2.6.0+cu124
Datasets 3.3.2
Tokenizers 0.21.1

mahojo
/

opt-125m-cluster-v2