See axolotl config
axolotl version: 0.9.2
# Name 0615-sft_info_wc_multi_attrs-qwen3_8b_base
# axolotl train red_team_agent/run/t0615/sft_info_wc_multi_attrs-qwen3_8b_base.yaml
base_model: Qwen/Qwen3-8B-Base
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
trust_remote_code: false
# --- Dataset Configuration ---
datasets:
- path: nate-rahn/0615-wc_multi_attrs_info_sft_dset
type: chat_template # Use the chat_template processing strategy
# --- Custom Template & Role Mapping ---
chat_template: chatml # Specify we are using a custom jinja template below
field_messages: messages # Assumes your dataset has a "bad_messages" key with a list of dicts
message_property_mappings: # Assumes each dict in the list has "role" and "content" keys
role: role
content: content
roles: # Define the roles expected in your dataset for mapping
user: ["user"] # Map "user" role in data to internal "user"
assistant: ["assistant"] # Map "assistant" role in data to internal "assistant"
system: ["system"] # Map "system" role in data to internal "system"
# --- Training Target ---
roles_to_train: ["assistant"]
train_on_eos: turn # Train on the EOS token at the end of each 'user' turn
dataset_prepared_path: /home/ubuntu/out/red-team-agent/data/last_run_prepared
# --- Training Hyperparameters ---
sequence_len: 2048 # Adjust based on your dataset and GPU memory
sample_packing: true # Pack multiple sequences into one example for efficiency
eval_sample_packing: false
pad_to_sequence_len: true # Pad sequences to sequence_len
# Full Parameter Finetuning (No adapter specified)
# adapter: # This is intentionally left blank/removed for full finetuning
# Performance & Precision (H100s excel with bf16)
bf16: true
tf32: true
flash_attention: true # for qwen
# Batching (Adjust based on GPU memory)
# Effective global batch size = micro_batch_size * gradient_accumulation_steps * num_gpus (4)
# Start low for full finetuning, e.g., 1 * 16 * 4 = 64
micro_batch_size: 2
gradient_accumulation_steps: 32
eval_batch_size: 16 # Can often be slightly higher than micro_batch_size
# Optimizer & Scheduler
optimizer: adamw_torch_fused # Good choice for newer GPUs
learning_rate: 1e-5 # Common starting point for full SFT
weight_decay: 0.01
lr_scheduler: cosine # Standard scheduler
warmup_steps: 50
max_grad_norm: 1.0
# Training Duration & Evaluation/Saving
num_epochs: 7 # Adjust as needed, start with 1-3 for SFT
val_set_size: 0.005
logging_steps: 1
evals_per_epoch: 5
saves_per_epoch: 1 # Save 4 times per epoch (adjust based on dataset size)
save_total_limit: 1 # Keep only the last 1 checkpoints
# Memory Saving
gradient_checkpointing: true # Essential for full finetuning
gradient_checkpointing_kwargs:
use_reentrant: false # Prefer non-reentrant if possible
# --- FSDP Configuration (for 4xH100) ---
fsdp:
- full_shard
- auto_wrap
fsdp_config:
fsdp_offload_params: false # Should not be needed with H100 VRAM
fsdp_sync_module_states: true # Important for correctness
fsdp_use_orig_params: false # Recommended for memory saving with FSDP
fsdp_state_dict_type: SHARDED_STATE_DICT # Options: FULL_STATE_DICT or SHARDED_STATE_DICT (saves disk space)
# fsdp_transformer_layer_cls_to_wrap: 'Gemma3DecoderLayer'
fsdp_transformer_layer_cls_to_wrap: 'Qwen3DecoderLayer'
# fsdp_activation_checkpointing: true # Alternative way to enable activation checkpointing for FSDP
# --- Special Tokens ---
# Define based on your custom template's terminators. Qwen already uses <|im_end|>
special_tokens:
eos_token: "<|im_end|>"
# eos_token: "<end_of_turn>"
# --- Logging & Saving ---
output_dir: /home/ubuntu/out/red-team-agent/runs/0615-sft_info_wc_multi_attrs-qwen3_8b_base # Local output directory
# W&B Logging
wandb_project: "red-team-agent" # Name your W&B project
wandb_entity: "nate" # IMPORTANT: Replace with your W&B username or team name
wandb_name: "0615-sft_info_wc_multi_attrs-qwen3_8b_base-7_epochs" # Descriptive run name
# wandb_log_model: "checkpoint" # Log model checkpoints to W&B Artifacts
# Hugging Face Hub Upload
hub_model_id: "nate-rahn/0615-sft_info_wc_multi_attrs-qwen3_8b_base-7_epochs" # IMPORTANT: Replace with your desired HF repo ID
hub_strategy: "end" # Push checkpoints to the Hub (`"end"` pushes only the final model)
hf_use_auth_token: true # Required for pushing to the Hub (ensure you're logged in)
# --- Misc ---
seed: 42
0615-sft_info_wc_multi_attrs-qwen3_8b_base-7_epochs
This model is a fine-tuned version of Qwen/Qwen3-8B-Base on the nate-rahn/0615-wc_multi_attrs_info_sft_dset dataset. It achieves the following results on the evaluation set:
- Loss: 1.7348
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 2
- eval_batch_size: 16
- seed: 42
- distributed_type: multi-GPU
- num_devices: 8
- gradient_accumulation_steps: 32
- total_train_batch_size: 512
- total_eval_batch_size: 128
- optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 50
- num_epochs: 7.0
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
3.491 | 0.0120 | 1 | 3.5178 |
2.7799 | 0.2034 | 17 | 2.7559 |
2.3441 | 0.4067 | 34 | 2.3219 |
2.1435 | 0.6101 | 51 | 2.1353 |
1.973 | 0.8135 | 68 | 1.9836 |
1.9045 | 1.0239 | 85 | 1.9076 |
1.8501 | 1.2273 | 102 | 1.8612 |
1.8185 | 1.4307 | 119 | 1.8336 |
1.7946 | 1.6340 | 136 | 1.8113 |
1.7609 | 1.8374 | 153 | 1.7963 |
1.7365 | 2.0479 | 170 | 1.7824 |
1.7313 | 2.2512 | 187 | 1.7742 |
1.6962 | 2.4546 | 204 | 1.7668 |
1.702 | 2.6579 | 221 | 1.7558 |
1.6764 | 2.8613 | 238 | 1.7509 |
1.6688 | 3.0718 | 255 | 1.7453 |
1.6398 | 3.2751 | 272 | 1.7431 |
1.6158 | 3.4785 | 289 | 1.7399 |
1.6124 | 3.6819 | 306 | 1.7426 |
1.6182 | 3.8852 | 323 | 1.7356 |
1.6039 | 4.0957 | 340 | 1.7307 |
1.5667 | 4.2991 | 357 | 1.7386 |
1.5722 | 4.5024 | 374 | 1.7351 |
1.5989 | 4.7058 | 391 | 1.7323 |
1.5634 | 4.9092 | 408 | 1.7307 |
1.5628 | 5.1196 | 425 | 1.7321 |
1.5357 | 5.3230 | 442 | 1.7395 |
1.5577 | 5.5264 | 459 | 1.7363 |
1.556 | 5.7297 | 476 | 1.7299 |
1.5365 | 5.9331 | 493 | 1.7327 |
1.541 | 6.1436 | 510 | 1.7345 |
1.5164 | 6.3469 | 527 | 1.7430 |
1.5554 | 6.5503 | 544 | 1.7398 |
1.5391 | 6.7536 | 561 | 1.7349 |
1.5248 | 6.9570 | 578 | 1.7348 |
Framework versions
- Transformers 4.51.3
- Pytorch 2.6.0+cu126
- Datasets 3.5.1
- Tokenizers 0.21.1
- Downloads last month
- -
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for nate-rahn/0615-sft_info_wc_multi_attrs-qwen3_8b_base-7_epochs
Base model
Qwen/Qwen3-8B-Base