See axolotl config

axolotl version: 0.9.2

base_model: giux78/test_544000
# Automatically upload checkpoint and final model to HF
# hub_model_id: username/custom_model_name

strict: false

chat_template: qwen3
datasets:
  - path: FairMind/bank-gpt-sft-alpha-v0.1.3
    type: chat_template
    field_messages: conversations
    message_property_mappings:
      role: from
      content: value


val_set_size: 0.01
output_dir: ./ale_outputs/pre-bankgpt-v1

#do_bench_eval: true
#bench_dataset: /leonardo_work/EUHPC_A04_045/training/examples/qwen3/eval_mix_train.json

sequence_len: 4096
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true


gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 2
#max_steps: 50
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 4e-5

bf16: auto
tf32: true

wandb_mode: "offline"
wandb_project: pre-bankgpt-v1
wandb_entity: mii-llm
wandb_name: sft

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
resume_from_checkpoint:
logging_steps: 1
flash_attention: true

warmup_ratio: 0.1
evals_per_epoch: 5
saves_per_epoch: 5
weight_decay: 0.01

fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_limit_all_gathers: true
  fsdp_sync_module_states: true
  fsdp_offload_params: true
  fsdp_use_orig_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD
special_tokens:
  pad_token: <|end_of_text|>

ale_outputs/pre-bankgpt-v1

This model is a fine-tuned version of giux78/test_544000 on the FairMind/bank-gpt-sft-alpha-v0.1.3 dataset. It achieves the following results on the evaluation set:

Loss: nan

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 4e-05
train_batch_size: 1
eval_batch_size: 1
seed: 42
distributed_type: multi-GPU
num_devices: 32
gradient_accumulation_steps: 8
total_train_batch_size: 256
total_eval_batch_size: 32
optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 34
num_epochs: 2.0

Training results

Training Loss	Epoch	Step	Validation Loss
3.5521	0.0057	1	nan
3.4578	0.2001	35	nan
3.0539	0.4003	70	nan
2.9865	0.6004	105	nan
2.8058	0.8006	140	nan
5.5672	1.0057	175	nan
2.7383	1.2059	210	nan
2.7784	1.4060	245	nan
2.744	1.6061	280	nan
2.6877	1.8063	315	nan

Framework versions

Transformers 4.51.3
Pytorch 2.5.1+cu121
Datasets 3.5.1
Tokenizers 0.21.1

giux78
/

pre-bgpt-v.0.1

ale_outputs/pre-bankgpt-v1

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for giux78/pre-bgpt-v.0.1

Evaluation results