See axolotl config
axolotl version: 0.13.0.dev0
# llama-8B-training.yaml
# =========================
# Model Configuration
# =========================
base_model: meta-llama/Llama-3.1-8B-Instruct
load_in_4bit: true # Use 4-bit quantization (saves VRAM on smaller GPUs like A100 40GB or L4)
adapter: qlora
bnb_4bit_use_double_quant: true # recommended for stability
bnb_4bit_quant_type: nf4
bnb_4bit_compute_dtype: bfloat16 # compute in bf16
trust_remote_code: true # Allow loading models with custom HF code
tokenizer_name: meta-llama/Llama-3.1-8B-Instruct
tokenizer_use_fast: true # Faster tokenization
# =========================
# Dataset Configuration
# =========================
datasets:
- path: Ivoyant/attr-mappings-training-v2
split: train
type: chat_template
chat_template: llama3 # Use built-in Llama 3 chat template
field_messages: conversations # Column containing conversation array
# Optional: Control which roles to train on (default: assistant only)
roles_to_train: ["assistant"]
# Optional: Control EOS token training
train_on_eos: turn # Options: "turn", "all", "last"
# val_set_size: 0.1
test_datasets:
- path: Ivoyant/attr-mappings-training-v2
split: validation
type: chat_template
chat_template: llama3 # Use built-in Llama 3 chat template
field_messages: conversations # Column containing conversation array
# Optional: Control which roles to train on (default: assistant only)
roles_to_train: ["assistant"]
# Optional: Control EOS token training
train_on_eos: turn # Options: "turn", "all", "last"
seed: 42 # Ensures reproducible splits
dataset_prepared_path: /workspace/data/prepared_dataset_v2
# =========================
# LoRA Configuration
# =========================
lora_r: 16
lora_alpha: 32
lora_dropout: 0.05
lora_target_modules:
- q_proj
- v_proj
- k_proj
- o_proj
- gate_proj
- up_proj
- down_proj
lora_fan_in_fan_out: false
# =========================
# Training Configuration
# =========================
micro_batch_size: 2
gradient_accumulation_steps: 8 # simulates batch size of 8
learning_rate: 5e-5 # standard LoRA LR
num_epochs: 8
lr_scheduler: cosine # smooth decay
warmup_steps: 100 # Add warmup for stability
save_strategy: steps
save_steps: 500
# saves_per_epoch: 1
# evals_per_epoch: 1
eval_strategy: steps # Evaluate more frequently
eval_steps: 50
save_total_limit: 3 # Keep more checkpoints for experimentation
bf16: true # A40 supports BF16
fp16: false # don't mix with bf16
optim: adamw_torch
gradient_checkpointing: true # saves VRAM at cost of compute
max_grad_norm: 1.0
weight_decay: 0.01
dataloader_num_workers: 2
# =========================
# Sequence Configuration
# =========================
sequence_len: 768
sample_packing: true
pad_to_sequence_len: true
special_tokens:
pad_token: "<|eot_id|>"
eos_token: "<|eot_id|>"
# =========================
# Output & Logging Configuration
# =========================
output_dir: /workspace/data/outputs/lora-llama-8b-activity-mappings_v2
logging_steps: 50
use_tensorboard: true
logging_strategy: steps
# =========================
# Memory & Performance Optimization
# =========================
dataloader_pin_memory: true # ✅ usually better perf unless CPU RAM issue
remove_unused_columns: true # ✅ recommended by reference
# Early stopping for efficiency
early_stopping_patience: 3
load_best_model_at_end: true
metric_for_best_model: eval_loss
greater_is_better: false
workspace/data/outputs/lora-llama-8b-activity-mappings_v2
This model is a fine-tuned version of meta-llama/Llama-3.1-8B-Instruct on the Ivoyant/attr-mappings-training-v2 dataset. It achieves the following results on the evaluation set:
- Loss: 0.0405
- Memory/max Active (gib): 7.8
- Memory/max Allocated (gib): 7.8
- Memory/device Reserved (gib): 9.25
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 2
- eval_batch_size: 2
- seed: 42
- gradient_accumulation_steps: 8
- total_train_batch_size: 16
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 100
- training_steps: 1920
Training results
Training Loss | Epoch | Step | Validation Loss | Active (gib) | Allocated (gib) | Reserved (gib) |
---|---|---|---|---|---|---|
No log | 0 | 0 | 1.6090 | 7.48 | 7.48 | 9.71 |
1.2574 | 0.2082 | 50 | 0.5494 | 7.8 | 7.8 | 9.52 |
0.3452 | 0.4164 | 100 | 0.1971 | 7.8 | 7.8 | 9.25 |
0.1527 | 0.6247 | 150 | 0.1101 | 7.8 | 7.8 | 9.25 |
0.0954 | 0.8329 | 200 | 0.0801 | 7.8 | 7.8 | 9.25 |
0.0723 | 1.0375 | 250 | 0.0698 | 7.8 | 7.8 | 9.25 |
0.06 | 1.2457 | 300 | 0.0665 | 7.8 | 7.8 | 9.25 |
0.0529 | 1.4539 | 350 | 0.0555 | 7.8 | 7.8 | 9.25 |
0.0452 | 1.6622 | 400 | 0.0524 | 7.8 | 7.8 | 9.25 |
0.0456 | 1.8704 | 450 | 0.0470 | 7.8 | 7.8 | 9.25 |
0.0417 | 2.0750 | 500 | 0.0422 | 7.8 | 7.8 | 9.25 |
0.0283 | 2.2832 | 550 | 0.0418 | 7.8 | 7.8 | 9.25 |
0.0318 | 2.4914 | 600 | 0.0417 | 7.8 | 7.8 | 9.25 |
0.0328 | 2.6996 | 650 | 0.0426 | 7.8 | 7.8 | 9.25 |
0.0269 | 2.9079 | 700 | 0.0396 | 7.8 | 7.8 | 9.25 |
0.0247 | 3.1124 | 750 | 0.0413 | 7.8 | 7.8 | 9.25 |
0.0199 | 3.3207 | 800 | 0.0405 | 7.8 | 7.8 | 9.25 |
0.0207 | 3.5289 | 850 | 0.0405 | 7.8 | 7.8 | 9.25 |
Framework versions
- PEFT 0.17.1
- Transformers 4.56.1
- Pytorch 2.7.1+cu126
- Datasets 4.0.0
- Tokenizers 0.22.1
- Downloads last month
- 10
Model tree for Ivoyant/attr-mappings-llama-3.1-8b-lora-r16-v2
Base model
meta-llama/Llama-3.1-8B
Finetuned
meta-llama/Llama-3.1-8B-Instruct