opt-125m-cluster-v2

This model is a fine-tuned version of facebook/opt-125m, trained on a mixed dataset consisting of OpenWebText, WikiText, and BookCorpus. It was trained on a single GPU (Quadro RTX 8000, 48GB VRAM) using Hugging Face Transformers and PyTorch.

πŸ“ˆ Evaluation Results

  • Final Training Loss: 2.9084
  • Final Perplexity (Eval): 19.10
  • Evaluation Steps: Every 5,000 training steps
  • Total Training Steps: 50,000

🧠 Model Description

This model was fine-tuned to reduce perplexity on general English text using causal language modeling (next-token prediction). The model was trained from scratch on 1 million samples with sequence length 1024 and optimized with AdamW and cosine learning rate scheduling.

βœ… Intended Uses & Limitations

Intended uses:

  • Perplexity benchmarking
  • Research on training dynamics and convergence
  • Fine-tuning base for instruction tuning or domain adaptation

Limitations:

  • Not instruction-tuned
  • Not aligned for safe deployment
  • May reflect biases from internet text

πŸ“Š Training & Evaluation Data

A shuffled dataset combining:

  • 60% OpenWebText
  • 30% WikiText
  • 10% BookCorpus

All data was pre-tokenized using the OPT tokenizer and capped at 1024 tokens per sample.

βš™οΈ Training Procedure

  • Batch size: 5 (accumulated to 40 via gradient_accumulation_steps=8)
  • Learning rate: 2e-4
  • Optimizer: AdamW with betas (0.9, 0.999), eps 1e-8
  • LR scheduler: Cosine decay with 1,000 warmup steps
  • Precision: Mixed (fp16 with AMP)
  • Steps: 50,000
  • Framework: Transformers 4.49.0, PyTorch 2.6.0

Let me know if you want this converted into a README.md format with YAML frontmatter as well.

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 0.0002
  • train_batch_size: 5
  • eval_batch_size: 3
  • seed: 42
  • gradient_accumulation_steps: 8
  • total_train_batch_size: 40
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 1000
  • training_steps: 50000
  • mixed_precision_training: Native AMP

Training results

πŸ“Š Training Results

steps Perplexity Cross-Entropy Loss
5k 24.07 3.1811
10k 23.28 3.1476
15k 22.44 3.1110
20k 21.63 3.0742
25k 20.97 3.0432
30k 20.33 3.0121
35k 19.73 2.9819
40k 19.32 2.9611
45k 19.11 2.9500
50k 19.10 2.9498

Framework versions

  • Transformers 4.49.0
  • Pytorch 2.6.0+cu124
  • Datasets 3.3.2
  • Tokenizers 0.21.1
Downloads last month
9
Safetensors
Model size
125M params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for mahojo/opt-125m-cluster-v2

Base model

facebook/opt-125m
Finetuned
(98)
this model

Datasets used to train mahojo/opt-125m-cluster-v2

Space using mahojo/opt-125m-cluster-v2 1