See axolotl config
axolotl version: 0.10.0.dev0
base_model: Qwen/Qwen3-0.6B-Base
hub_model_id: cyberbabooshka/base_noreasoning2
wandb_name: base_noreasoning2
tokenizer_type: AutoTokenizer
load_in_8bit: false
load_in_4bit: false
num_processes: 64
dataset_processes: 64
dataset_prepared_path: last_run_prepared
chat_template: jinja
chat_template_jinja: >-
{%- for message in messages %}
{{- '<|im_start|>' + message.role + '\n' + message.content.lstrip('\n') + '<|im_end|>' + '\n' }}
{%- endfor %}
{%- if add_generation_prompt %}
{{- '<|im_start|>assistant\n' }}
{%- endif %}
datasets:
- path: cyberbabooshka/MNLP_M2_mcqa_dataset
split: train
type: chat_template
field_messages: messages
train_on_eos: turn
train_on_eot: turn
message_property_mappings:
role: role
content: content
roles:
user:
- user
assistant:
- assistant
test_datasets:
- path: cyberbabooshka/MNLP_M2_mcqa_dataset
split: test
type: chat_template
field_messages: messages
train_on_eos: turn
train_on_eot: turn
message_property_mappings:
role: role
content: content
roles:
user:
- user
assistant:
- assistant
output_dir: ./outputs
sequence_len: 2048
batch_flattening: true
sample_packing: false
wandb_project: mnlp
wandb_entity: aleksandr-dremov-epfl
wandb_watch:
wandb_log_model:
gradient_accumulation_steps: 1
eval_batch_size: 16
micro_batch_size: 16
optimizer: ademamix_8bit
weight_decay: 0.01
learning_rate: 0.00005
warmup_steps: 500
wsd_final_lr_factor: 0.0
wsd_init_div_factor: 100
wsd_fract_decay: 0.2
wsd_decay_type: "sqrt"
wsd_sqrt_power: 0.5
wsd_cooldown_start_lr_factor: 1.0
bf16: auto
tf32: false
torch_compile: true
flash_attention: true
gradient_checkpointing: false
resume_from_checkpoint:
auto_resume_from_checkpoints: true
logging_steps: 16
eval_steps: 2000
save_steps: 1000
max_steps: 10000000
num_epochs: 2
save_total_limit: 2
special_tokens:
eos_token: "<|im_end|>"
pad_token: "<|endoftext|>"
eot_tokens:
- <|im_end|>
plugins:
- axolotl_wsd.WSDSchedulerPlugin
base_noreasoning2
This model is a fine-tuned version of Qwen/Qwen3-0.6B-Base on the cyberbabooshka/MNLP_M2_mcqa_dataset dataset. It achieves the following results on the evaluation set:
- Loss: 0.7643
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 16
- eval_batch_size: 16
- seed: 42
- distributed_type: multi-GPU
- num_devices: 2
- total_train_batch_size: 32
- total_eval_batch_size: 32
- optimizer: Use OptimizerNames.ADEMAMIX_8BIT and the args are: No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 500
- training_steps: 55761
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
No log | 0.0000 | 1 | 0.9809 |
0.8514 | 0.0717 | 2000 | 0.8555 |
0.8304 | 0.1435 | 4000 | 0.8445 |
0.8246 | 0.2152 | 6000 | 0.8379 |
0.8393 | 0.2869 | 8000 | 0.8315 |
0.8216 | 0.3587 | 10000 | 0.8278 |
0.848 | 0.4304 | 12000 | 0.8235 |
0.8166 | 0.5021 | 14000 | 0.8206 |
0.8344 | 0.5739 | 16000 | 0.8175 |
0.827 | 0.6456 | 18000 | 0.8140 |
0.8287 | 0.7173 | 20000 | 0.8123 |
0.8234 | 0.7891 | 22000 | 0.8089 |
0.7958 | 0.8608 | 24000 | 0.8076 |
0.825 | 0.9325 | 26000 | 0.8057 |
0.7736 | 1.0043 | 28000 | 0.8037 |
0.7528 | 1.0760 | 30000 | 0.8037 |
0.8106 | 1.1477 | 32000 | 0.8021 |
0.7787 | 1.2195 | 34000 | 0.8016 |
0.7483 | 1.2912 | 36000 | 0.8008 |
0.7535 | 1.3629 | 38000 | 0.7993 |
0.7994 | 1.4347 | 40000 | 0.7987 |
0.7475 | 1.5064 | 42000 | 0.7982 |
0.7844 | 1.5781 | 44000 | 0.7970 |
0.7743 | 1.5999 | 44608 | 0.7964 |
0.7672 | 1.6499 | 46000 | 0.7807 |
0.7147 | 1.7216 | 48000 | 0.7714 |
0.7784 | 1.7933 | 50000 | 0.7670 |
0.7582 | 1.8651 | 52000 | 0.7650 |
0.7778 | 1.9368 | 54000 | 0.7643 |
Framework versions
- Transformers 4.52.1
- Pytorch 2.7.0+cu126
- Datasets 3.5.0
- Tokenizers 0.21.1
- Downloads last month
- 9
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for cyberbabooshka/base_noreasoning2
Base model
Qwen/Qwen3-0.6B-Base