See axolotl config

axolotl version: 0.9.2

# Name 0615-sft_info_wc_multi_attrs-qwen3_8b_base

# axolotl train red_team_agent/run/t0615/sft_info_wc_multi_attrs-qwen3_8b_base.yaml


base_model: Qwen/Qwen3-8B-Base
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
trust_remote_code: false

# --- Dataset Configuration ---
datasets:
  - path: nate-rahn/0615-wc_multi_attrs_info_sft_dset
    type: chat_template # Use the chat_template processing strategy
    # --- Custom Template & Role Mapping ---
    chat_template: chatml # Specify we are using a custom jinja template below
    field_messages: messages # Assumes your dataset has a "bad_messages" key with a list of dicts
    message_property_mappings: # Assumes each dict in the list has "role" and "content" keys
      role: role
      content: content
    roles: # Define the roles expected in your dataset for mapping
      user: ["user"] # Map "user" role in data to internal "user"
      assistant: ["assistant"] # Map "assistant" role in data to internal "assistant"
      system: ["system"] # Map "system" role in data to internal "system"
    # --- Training Target ---
    roles_to_train: ["assistant"]
    train_on_eos: turn # Train on the EOS token at the end of each 'user' turn

dataset_prepared_path: /home/ubuntu/out/red-team-agent/data/last_run_prepared

# --- Training Hyperparameters ---
sequence_len: 2048 # Adjust based on your dataset and GPU memory
sample_packing: true # Pack multiple sequences into one example for efficiency
eval_sample_packing: false
pad_to_sequence_len: true # Pad sequences to sequence_len

# Full Parameter Finetuning (No adapter specified)
# adapter: # This is intentionally left blank/removed for full finetuning

# Performance & Precision (H100s excel with bf16)
bf16: true
tf32: true
flash_attention: true # for qwen

# Batching (Adjust based on GPU memory)
# Effective global batch size = micro_batch_size * gradient_accumulation_steps * num_gpus (4)
# Start low for full finetuning, e.g., 1 * 16 * 4 = 64
micro_batch_size: 2
gradient_accumulation_steps: 32
eval_batch_size: 16 # Can often be slightly higher than micro_batch_size

# Optimizer & Scheduler
optimizer: adamw_torch_fused # Good choice for newer GPUs
learning_rate: 1e-5 # Common starting point for full SFT
weight_decay: 0.01
lr_scheduler: cosine # Standard scheduler
warmup_steps: 50
max_grad_norm: 1.0

# Training Duration & Evaluation/Saving
num_epochs: 7 # Adjust as needed, start with 1-3 for SFT
val_set_size: 0.005
logging_steps: 1
evals_per_epoch: 5
saves_per_epoch: 1 # Save 4 times per epoch (adjust based on dataset size)
save_total_limit: 1 # Keep only the last 1 checkpoints

# Memory Saving
gradient_checkpointing: true # Essential for full finetuning
gradient_checkpointing_kwargs:
  use_reentrant: false # Prefer non-reentrant if possible

# --- FSDP Configuration (for 4xH100) ---
fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_offload_params: false # Should not be needed with H100 VRAM
  fsdp_sync_module_states: true # Important for correctness
  fsdp_use_orig_params: false # Recommended for memory saving with FSDP
  fsdp_state_dict_type: SHARDED_STATE_DICT # Options: FULL_STATE_DICT or SHARDED_STATE_DICT (saves disk space)
  # fsdp_transformer_layer_cls_to_wrap: 'Gemma3DecoderLayer'
  fsdp_transformer_layer_cls_to_wrap: 'Qwen3DecoderLayer'
  # fsdp_activation_checkpointing: true # Alternative way to enable activation checkpointing for FSDP

# --- Special Tokens ---
# Define based on your custom template's terminators. Qwen already uses <|im_end|>
special_tokens:
  eos_token: "<|im_end|>"
# eos_token: "<end_of_turn>"

# --- Logging & Saving ---
output_dir: /home/ubuntu/out/red-team-agent/runs/0615-sft_info_wc_multi_attrs-qwen3_8b_base # Local output directory

# W&B Logging
wandb_project: "red-team-agent" # Name your W&B project
wandb_entity: "nate" # IMPORTANT: Replace with your W&B username or team name
wandb_name: "0615-sft_info_wc_multi_attrs-qwen3_8b_base-7_epochs" # Descriptive run name
# wandb_log_model: "checkpoint" # Log model checkpoints to W&B Artifacts

# Hugging Face Hub Upload
hub_model_id: "nate-rahn/0615-sft_info_wc_multi_attrs-qwen3_8b_base-7_epochs" # IMPORTANT: Replace with your desired HF repo ID
hub_strategy: "end" # Push checkpoints to the Hub (`"end"` pushes only the final model)
hf_use_auth_token: true # Required for pushing to the Hub (ensure you're logged in)

# --- Misc ---
seed: 42

0615-sft_info_wc_multi_attrs-qwen3_8b_base-7_epochs

This model is a fine-tuned version of Qwen/Qwen3-8B-Base on the nate-rahn/0615-wc_multi_attrs_info_sft_dset dataset. It achieves the following results on the evaluation set:

Loss: 1.7348

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 2
eval_batch_size: 16
seed: 42
distributed_type: multi-GPU
num_devices: 8
gradient_accumulation_steps: 32
total_train_batch_size: 512
total_eval_batch_size: 128
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 50
num_epochs: 7.0

Training results

Training Loss	Epoch	Step	Validation Loss
3.491	0.0120	1	3.5178
2.7799	0.2034	17	2.7559
2.3441	0.4067	34	2.3219
2.1435	0.6101	51	2.1353
1.973	0.8135	68	1.9836
1.9045	1.0239	85	1.9076
1.8501	1.2273	102	1.8612
1.8185	1.4307	119	1.8336
1.7946	1.6340	136	1.8113
1.7609	1.8374	153	1.7963
1.7365	2.0479	170	1.7824
1.7313	2.2512	187	1.7742
1.6962	2.4546	204	1.7668
1.702	2.6579	221	1.7558
1.6764	2.8613	238	1.7509
1.6688	3.0718	255	1.7453
1.6398	3.2751	272	1.7431
1.6158	3.4785	289	1.7399
1.6124	3.6819	306	1.7426
1.6182	3.8852	323	1.7356
1.6039	4.0957	340	1.7307
1.5667	4.2991	357	1.7386
1.5722	4.5024	374	1.7351
1.5989	4.7058	391	1.7323
1.5634	4.9092	408	1.7307
1.5628	5.1196	425	1.7321
1.5357	5.3230	442	1.7395
1.5577	5.5264	459	1.7363
1.556	5.7297	476	1.7299
1.5365	5.9331	493	1.7327
1.541	6.1436	510	1.7345
1.5164	6.3469	527	1.7430
1.5554	6.5503	544	1.7398
1.5391	6.7536	561	1.7349
1.5248	6.9570	578	1.7348

Framework versions

Transformers 4.51.3
Pytorch 2.6.0+cu126
Datasets 3.5.1
Tokenizers 0.21.1

nate-rahn
/

0615-sft_info_wc_multi_attrs-qwen3_8b_base-7_epochs

You need to agree to share your contact information to access this model

0615-sft_info_wc_multi_attrs-qwen3_8b_base-7_epochs

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for nate-rahn/0615-sft_info_wc_multi_attrs-qwen3_8b_base-7_epochs

Dataset used to train nate-rahn/0615-sft_info_wc_multi_attrs-qwen3_8b_base-7_epochs

Evaluation results