Built with Axolotl

See axolotl config

axolotl version: 0.9.2

base_model: giux78/test_544000
# Automatically upload checkpoint and final model to HF
# hub_model_id: username/custom_model_name

strict: false

chat_template: qwen3
datasets:
  - path: FairMind/bank-gpt-sft-alpha-v0.1.3
    type: chat_template
    field_messages: conversations
    message_property_mappings:
      role: from
      content: value


val_set_size: 0.01
output_dir: ./ale_outputs/pre-bankgpt-v1

#do_bench_eval: true
#bench_dataset: /leonardo_work/EUHPC_A04_045/training/examples/qwen3/eval_mix_train.json

sequence_len: 4096
sample_packing: true
eval_sample_packing: false
pad_to_sequence_len: true


gradient_accumulation_steps: 8
micro_batch_size: 1
num_epochs: 2
#max_steps: 50
optimizer: adamw_torch_fused
lr_scheduler: cosine
learning_rate: 4e-5

bf16: auto
tf32: true

wandb_mode: "offline"
wandb_project: pre-bankgpt-v1
wandb_entity: mii-llm
wandb_name: sft

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
resume_from_checkpoint:
logging_steps: 1
flash_attention: true

warmup_ratio: 0.1
evals_per_epoch: 5
saves_per_epoch: 5
weight_decay: 0.01

fsdp:
  - full_shard
  - auto_wrap
fsdp_config:
  fsdp_limit_all_gathers: true
  fsdp_sync_module_states: true
  fsdp_offload_params: true
  fsdp_use_orig_params: false
  fsdp_cpu_ram_efficient_loading: true
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
  fsdp_state_dict_type: FULL_STATE_DICT
  fsdp_sharding_strategy: FULL_SHARD
special_tokens:
  pad_token: <|end_of_text|>

ale_outputs/pre-bankgpt-v1

This model is a fine-tuned version of giux78/test_544000 on the FairMind/bank-gpt-sft-alpha-v0.1.3 dataset. It achieves the following results on the evaluation set:

  • Loss: nan

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 4e-05
  • train_batch_size: 1
  • eval_batch_size: 1
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 32
  • gradient_accumulation_steps: 8
  • total_train_batch_size: 256
  • total_eval_batch_size: 32
  • optimizer: Use OptimizerNames.ADAMW_TORCH_FUSED with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 34
  • num_epochs: 2.0

Training results

Training Loss Epoch Step Validation Loss
3.5521 0.0057 1 nan
3.4578 0.2001 35 nan
3.0539 0.4003 70 nan
2.9865 0.6004 105 nan
2.8058 0.8006 140 nan
5.5672 1.0057 175 nan
2.7383 1.2059 210 nan
2.7784 1.4060 245 nan
2.744 1.6061 280 nan
2.6877 1.8063 315 nan

Framework versions

  • Transformers 4.51.3
  • Pytorch 2.5.1+cu121
  • Datasets 3.5.1
  • Tokenizers 0.21.1
Downloads last month
14
Safetensors
Model size
254M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for giux78/pre-bgpt-v.0.1

Base model

giux78/test_544000
Finetuned
(1)
this model