See axolotl config
axolotl version: 0.5.0
#base_model: mistralai/Mistral-7b-v0.1
base_model: Qwen/Qwen2.5-1.5B-Instruct
model_type: AutoModelForCausalLM
tokenizer_type: AutoTokenizer
trust_remote_code: true
# load_in_8bit: true
# load_in_4bit: false
# strict: false
datasets:
- path: open-ita-llms/OpenSFT-ita
type: chat_template
field_messages: messages
message_field_role: role
message_field_content: content
chat_template: chatml
dataset_prepared_path: last_run_prepared
val_set_size: 0.1
output_dir: ./outputs/qwen15B-opensft
# adapter: lora
# lora_model_dir:
sequence_len: 16392
sample_packing: true
eval_sample_packing: true
pad_to_sequence_len: true
unfrozen_parameters:
- ^lm_head.weight$
- ^model.embed_tokens.weight$
# input_layernorm layers
- model.layers.0.input_layernorm
- model.layers.1.input_layernorm
- model.layers.2.input_layernorm
- model.layers.3.input_layernorm
- model.layers.4.input_layernorm
- model.layers.5.input_layernorm
- model.layers.6.input_layernorm
# lm_head layers
# mlp.down_proj layers
- model.layers.2.mlp.down_proj
- model.layers.19.mlp.down_proj
- model.layers.1.mlp.down_proj
- model.layers.27.mlp.down_proj
- model.layers.3.mlp.down_proj
- model.layers.0.mlp.down_proj
- model.layers.6.mlp.down_proj
# mlp.gate_proj layers
- model.layers.6.mlp.gate_proj
- model.layers.1.mlp.gate_proj
- model.layers.4.mlp.gate_proj
- model.layers.3.mlp.gate_proj
- model.layers.7.mlp.gate_proj
- model.layers.2.mlp.gate_proj
- model.layers.9.mlp.gate_proj
# mlp.up_proj layers
- model.layers.6.mlp.up_proj
- model.layers.5.mlp.up_proj
- model.layers.3.mlp.up_proj
- model.layers.7.mlp.up_proj
- model.layers.4.mlp.up_proj
- model.layers.2.mlp.up_proj
- model.layers.14.mlp.up_proj
# model.embed_tokens layers
# model.norm layers
# post_attention_layernorm layers
- model.layers.0.post_attention_layernorm
- model.layers.1.post_attention_layernorm
- model.layers.2.post_attention_layernorm
- model.layers.3.post_attention_layernorm
- model.layers.4.post_attention_layernorm
- model.layers.5.post_attention_layernorm
- model.layers.6.post_attention_layernorm
# self_attn.k_proj layers
- model.layers.25.self_attn.k_proj
- model.layers.4.self_attn.k_proj
- model.layers.2.self_attn.k_proj
- model.layers.22.self_attn.k_proj
- model.layers.3.self_attn.k_proj
- model.layers.0.self_attn.k_proj
- model.layers.6.self_attn.k_proj
# self_attn.o_proj layers
- model.layers.0.self_attn.o_proj
- model.layers.14.self_attn.o_proj
- model.layers.19.self_attn.o_proj
- model.layers.18.self_attn.o_proj
- model.layers.8.self_attn.o_proj
- model.layers.22.self_attn.o_proj
- model.layers.7.self_attn.o_proj
# self_attn.q_proj layers
- model.layers.14.self_attn.q_proj
- model.layers.20.self_attn.q_proj
- model.layers.26.self_attn.q_proj
- model.layers.17.self_attn.q_proj
- model.layers.18.self_attn.q_proj
- model.layers.27.self_attn.q_proj
- model.layers.9.self_attn.q_proj
# self_attn.v_proj layers
- model.layers.0.self_attn.v_proj
- model.layers.2.self_attn.v_proj
- model.layers.3.self_attn.v_proj
- model.layers.4.self_attn.v_proj
- model.layers.5.self_attn.v_proj
- model.layers.8.self_attn.v_proj
- model.layers.10.self_attn.v_proj
wandb_project: axolotl
wandb_entity:
wandb_watch:
wandb_name: qwen2.5-1.5B-opensft
wandb_log_model:
gradient_accumulation_steps: 4
micro_batch_size: 1
num_epochs: 3
optimizer: adamw_bnb_8bit #adamw_torch_fused #adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 1.0e-04 # varia da 1e-3 a 1e-6
train_on_inputs: false
group_by_length: false
bf16: true
fp16:
tf32: false
gradient_checkpointing: true
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 20
xformers_attention:
flash_attention: true
# loss_watchdog_threshold: 5.0
# loss_watchdog_patience: 3
warmup_steps: 10
evals_per_epoch: 4
eval_table_size:
eval_max_new_tokens: 256
saves_per_epoch: 1
debug:
deepspeed:
weight_decay: 0.01
fsdp:
fsdp_config:
special_tokens:
pad_token: "<|im_end|>"
eos_token: "<|im_end|>"
outputs/qwen15B-opensft
This model is a fine-tuned version of Qwen/Qwen2.5-1.5B-Instruct on the open-ita-llms/OpenSFT-ita dataset. It achieves the following results on the evaluation set:
- Loss: 0.6571
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0001
- train_batch_size: 1
- eval_batch_size: 1
- seed: 42
- gradient_accumulation_steps: 4
- total_train_batch_size: 4
- optimizer: Use adamw_bnb_8bit with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 10
- num_epochs: 3
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
No log | 0.0005 | 1 | 0.8033 |
0.8489 | 0.2503 | 538 | 0.6900 |
0.8416 | 0.5005 | 1076 | 0.6753 |
0.7929 | 0.7508 | 1614 | 0.6673 |
0.8003 | 1.0005 | 2152 | 0.6572 |
0.7125 | 1.2507 | 2690 | 0.6583 |
0.7049 | 1.5010 | 3228 | 0.6528 |
0.6987 | 1.7513 | 3766 | 0.6529 |
0.7025 | 2.0009 | 4304 | 0.6498 |
0.6387 | 2.2512 | 4842 | 0.6575 |
0.6495 | 2.5015 | 5380 | 0.6568 |
0.6711 | 2.7517 | 5918 | 0.6571 |
Framework versions
- Transformers 4.48.0.dev0
- Pytorch 2.5.0+cu124
- Datasets 3.1.0
- Tokenizers 0.21.0
- Downloads last month
- 16
Inference Providers
NEW
This model is not currently available via any of the supported third-party Inference Providers, and
the model is not deployed on the HF Inference API.