See axolotl config
axolotl version: 0.10.0.dev0
base_model: cyberbabooshka/base_noreasoning
hub_model_id: cyberbabooshka/MNLP_M2_mcqa_model
wandb_name: base
tokenizer_type: AutoTokenizer
load_in_8bit: false
load_in_4bit: false
num_processes: 64
dataset_processes: 64
dataset_prepared_path: last_run_prepared
chat_template: jinja
chat_template_jinja: >-
{%- for message in messages %}
{{- message.content.strip('\n') + '\n' }}
{%- endfor %}
{%- if not add_generation_prompt %}
{{- '<|im_end|>' }}
{%- endif %}
datasets:
- path: cyberbabooshka/MNLP_M2_mcqa_dataset
name: cooldown
split: train
type: chat_template
chat_template: tokenizer_default
field_messages: messages
train_on_eos: all
train_on_eot: all
message_property_mappings:
role: role
content: content
roles:
user:
- user
assistant:
- assistant
test_datasets:
- path: cyberbabooshka/MNLP_M2_mcqa_dataset
name: mcqa
split: test
type: chat_template
chat_template: tokenizer_default
field_messages: messages
train_on_eos: all
train_on_eot: all
message_property_mappings:
role: role
content: content
roles:
user:
- user
assistant:
- assistant
output_dir: ./outputs_mcqa
sequence_len: 2048
batch_flattening: true
sample_packing: false
wandb_project: mnlp
wandb_entity: aleksandr-dremov-epfl
wandb_watch:
wandb_log_model:
gradient_accumulation_steps: 1
eval_batch_size: 16
micro_batch_size: 12
optimizer: ademamix_8bit
weight_decay: 0.01
learning_rate: 0.00001
warmup_steps: 100
wsd_final_lr_factor: 0.0
wsd_init_div_factor: 100
wsd_fract_decay: 0.2
wsd_decay_type: "sqrt"
wsd_sqrt_power: 0.5
wsd_cooldown_start_lr_factor: 1.0
bf16: auto
tf32: false
torch_compile: true
flash_attention: true
gradient_checkpointing: false
resume_from_checkpoint:
auto_resume_from_checkpoints: true
logging_steps: 16
eval_steps: 500
save_steps: 500
max_steps: 1000000
num_epochs: 1
save_total_limit: 2
special_tokens:
eos_token: "<|im_end|>"
pad_token: "<|endoftext|>"
eot_tokens:
- <|im_end|>
plugins:
- axolotl_wsd.WSDSchedulerPlugin
MNLP_M2_mcqa_model
This model is a fine-tuned version of cyberbabooshka/base_noreasoning on the cyberbabooshka/MNLP_M2_mcqa_dataset dataset. It achieves the following results on the evaluation set:
- Loss: 0.6772
Model description
More information needed
Intended uses & limitations
More information needed
Training and evaluation data
More information needed
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 12
- eval_batch_size: 16
- seed: 42
- distributed_type: multi-GPU
- num_devices: 2
- total_train_batch_size: 24
- total_eval_batch_size: 32
- optimizer: Use OptimizerNames.ADEMAMIX_8BIT and the args are: No additional optimizer arguments
- lr_scheduler_type: cosine
- lr_scheduler_warmup_steps: 100
- training_steps: 8438
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
No log | 0.0001 | 1 | 2.2371 |
0.8956 | 0.0593 | 500 | 0.7674 |
0.9093 | 0.1185 | 1000 | 0.7335 |
0.8544 | 0.1778 | 1500 | 0.7159 |
0.8503 | 0.2370 | 2000 | 0.7074 |
0.8781 | 0.2963 | 2500 | 0.7016 |
0.8171 | 0.3555 | 3000 | 0.6968 |
0.9179 | 0.4148 | 3500 | 0.6930 |
0.845 | 0.4740 | 4000 | 0.6895 |
0.8885 | 0.5333 | 4500 | 0.6865 |
0.9432 | 0.5926 | 5000 | 0.6844 |
0.7451 | 0.6518 | 5500 | 0.6825 |
0.8675 | 0.7111 | 6000 | 0.6811 |
0.8606 | 0.7703 | 6500 | 0.6793 |
0.8602 | 0.8000 | 6750 | 0.6793 |
0.8458 | 0.8296 | 7000 | 0.6778 |
0.9051 | 0.8888 | 7500 | 0.6772 |
0.8589 | 0.9481 | 8000 | 0.6772 |
Framework versions
- Transformers 4.52.1
- Pytorch 2.7.0+cu126
- Datasets 3.5.0
- Tokenizers 0.21.1
- Downloads last month
- 266
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support