See axolotl config

axolotl version: 0.5.0

#base_model: mistralai/Mistral-7b-v0.1
base_model: Qwen/Qwen2.5-1.5B-Instruct
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer

trust_remote_code: true

# load_in_8bit: true
# load_in_4bit: false
# strict: false

datasets:
  - path: open-ita-llms/OpenSFT-ita
    type: chat_template
    field_messages: messages
    message_field_role: role
    message_field_content: content

chat_template: chatml

dataset_prepared_path: last_run_prepared
val_set_size: 0.1
output_dir: ./outputs/qwen15B-opensft

# adapter: lora
# lora_model_dir:

sequence_len: 16392
sample_packing: true
eval_sample_packing: true
pad_to_sequence_len: true

unfrozen_parameters:
- ^lm_head.weight$
- ^model.embed_tokens.weight$
# input_layernorm layers
- model.layers.0.input_layernorm
- model.layers.1.input_layernorm
- model.layers.2.input_layernorm
- model.layers.3.input_layernorm
- model.layers.4.input_layernorm
- model.layers.5.input_layernorm
- model.layers.6.input_layernorm
# lm_head layers
# mlp.down_proj layers
- model.layers.2.mlp.down_proj
- model.layers.19.mlp.down_proj
- model.layers.1.mlp.down_proj
- model.layers.27.mlp.down_proj
- model.layers.3.mlp.down_proj
- model.layers.0.mlp.down_proj
- model.layers.6.mlp.down_proj
# mlp.gate_proj layers
- model.layers.6.mlp.gate_proj
- model.layers.1.mlp.gate_proj
- model.layers.4.mlp.gate_proj
- model.layers.3.mlp.gate_proj
- model.layers.7.mlp.gate_proj
- model.layers.2.mlp.gate_proj
- model.layers.9.mlp.gate_proj
# mlp.up_proj layers
- model.layers.6.mlp.up_proj
- model.layers.5.mlp.up_proj
- model.layers.3.mlp.up_proj
- model.layers.7.mlp.up_proj
- model.layers.4.mlp.up_proj
- model.layers.2.mlp.up_proj
- model.layers.14.mlp.up_proj
# model.embed_tokens layers
# model.norm layers
# post_attention_layernorm layers
- model.layers.0.post_attention_layernorm
- model.layers.1.post_attention_layernorm
- model.layers.2.post_attention_layernorm
- model.layers.3.post_attention_layernorm
- model.layers.4.post_attention_layernorm
- model.layers.5.post_attention_layernorm
- model.layers.6.post_attention_layernorm
# self_attn.k_proj layers
- model.layers.25.self_attn.k_proj
- model.layers.4.self_attn.k_proj
- model.layers.2.self_attn.k_proj
- model.layers.22.self_attn.k_proj
- model.layers.3.self_attn.k_proj
- model.layers.0.self_attn.k_proj
- model.layers.6.self_attn.k_proj
# self_attn.o_proj layers
- model.layers.0.self_attn.o_proj
- model.layers.14.self_attn.o_proj
- model.layers.19.self_attn.o_proj
- model.layers.18.self_attn.o_proj
- model.layers.8.self_attn.o_proj
- model.layers.22.self_attn.o_proj
- model.layers.7.self_attn.o_proj
# self_attn.q_proj layers
- model.layers.14.self_attn.q_proj
- model.layers.20.self_attn.q_proj
- model.layers.26.self_attn.q_proj
- model.layers.17.self_attn.q_proj
- model.layers.18.self_attn.q_proj
- model.layers.27.self_attn.q_proj
- model.layers.9.self_attn.q_proj
# self_attn.v_proj layers
- model.layers.0.self_attn.v_proj
- model.layers.2.self_attn.v_proj
- model.layers.3.self_attn.v_proj
- model.layers.4.self_attn.v_proj
- model.layers.5.self_attn.v_proj
- model.layers.8.self_attn.v_proj
- model.layers.10.self_attn.v_proj




wandb_project: axolotl
wandb_entity:
wandb_watch:
wandb_name: qwen2.5-1.5B-opensft
wandb_log_model:

gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 3
optimizer: adamw_bnb_8bit #adamw_torch_fused #adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 1.0e-04 # varia da 1e-3 a 1e-6

train_on_inputs: false
group_by_length: false
bf16: true
fp16:
tf32: false

gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 20
xformers_attention:
flash_attention: true

# loss_watchdog_threshold: 5.0
# loss_watchdog_patience: 3

warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 256
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.01
fsdp:
fsdp_config:
special_tokens:
  pad_token: "<|im_end|>"
  eos_token: "<|im_end|>"

outputs/qwen15B-opensft

This model is a fine-tuned version of Qwen/Qwen2.5-1.5B-Instruct on the open-ita-llms/OpenSFT-ita dataset. It achieves the following results on the evaluation set:

Loss: 0.6571

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 0.0001
train_batch_size: 1
eval_batch_size: 1
seed: 42
gradient_accumulation_steps: 4
total_train_batch_size: 4
optimizer: Use adamw_bnb_8bit with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
lr_scheduler_type: cosine
lr_scheduler_warmup_steps: 10
num_epochs: 3

Training results

Training Loss	Epoch	Step	Validation Loss
No log	0.0005	1	0.8033
0.8489	0.2503	538	0.6900
0.8416	0.5005	1076	0.6753
0.7929	0.7508	1614	0.6673
0.8003	1.0005	2152	0.6572
0.7125	1.2507	2690	0.6583
0.7049	1.5010	3228	0.6528
0.6987	1.7513	3766	0.6529
0.7025	2.0009	4304	0.6498
0.6387	2.2512	4842	0.6575
0.6495	2.5015	5380	0.6568
0.6711	2.7517	5918	0.6571

Framework versions

Transformers 4.48.0.dev0
Pytorch 2.5.0+cu124
Datasets 3.1.0
Tokenizers 0.21.0

open-ita-llms
/

Qwen2.5-1.5B-OpenSFT-ita

outputs/qwen15B-opensft

Model description

Intended uses & limitations

Training and evaluation data

Training procedure

Training hyperparameters

Training results

Framework versions

Model tree for open-ita-llms/Qwen2.5-1.5B-OpenSFT-ita

Dataset used to train open-ita-llms/Qwen2.5-1.5B-OpenSFT-ita

Evaluation results