Built with Axolotl

See axolotl config

axolotl version: 0.10.0.dev0

base_model: cyberbabooshka/base_noreasoning
hub_model_id: cyberbabooshka/MNLP_M2_mcqa_model
wandb_name: base

tokenizer_type: AutoTokenizer
load_in_8bit: false
load_in_4bit: false

num_processes: 64
dataset_processes: 64
dataset_prepared_path: last_run_prepared

chat_template: jinja
chat_template_jinja: >-
  {%- for message in messages %}
    {{- message.content.strip('\n') + '\n' }}
  {%- endfor %}
  {%- if not add_generation_prompt %}
    {{- '<|im_end|>' }}
  {%- endif %}


datasets:
  - path: cyberbabooshka/MNLP_M2_mcqa_dataset
    name: cooldown
    split: train
    type: chat_template
    chat_template: tokenizer_default
    field_messages: messages
    train_on_eos: all
    train_on_eot: all
    message_property_mappings:
      role: role
      content: content
    roles:
      user:
        - user
      assistant:
        - assistant

test_datasets:
  - path: cyberbabooshka/MNLP_M2_mcqa_dataset
    name: mcqa
    split: test
    type: chat_template
    chat_template: tokenizer_default
    field_messages: messages
    train_on_eos: all
    train_on_eot: all
    message_property_mappings:
      role: role
      content: content
    roles:
      user:
        - user
      assistant:
        - assistant

output_dir: ./outputs_mcqa

sequence_len: 2048
batch_flattening: true
sample_packing: false

wandb_project: mnlp
wandb_entity: aleksandr-dremov-epfl
wandb_watch:
wandb_log_model:

gradient_accumulation_steps: 1
eval_batch_size: 16
micro_batch_size: 12

optimizer: ademamix_8bit
weight_decay: 0.01

learning_rate: 0.00001
warmup_steps: 100

wsd_final_lr_factor: 0.0
wsd_init_div_factor: 100
wsd_fract_decay: 0.2
wsd_decay_type: "sqrt"
wsd_sqrt_power: 0.5
wsd_cooldown_start_lr_factor: 1.0

bf16: auto
tf32: false

torch_compile: true
flash_attention: true
gradient_checkpointing: false

resume_from_checkpoint:
auto_resume_from_checkpoints: true

logging_steps: 16
eval_steps: 500
save_steps: 500
max_steps: 1000000
num_epochs: 1
save_total_limit: 2

special_tokens:
  eos_token: "<|im_end|>"
  pad_token: "<|endoftext|>"

eot_tokens:
  - <|im_end|>

plugins:
  - axolotl_wsd.WSDSchedulerPlugin

MNLP_M2_mcqa_model

This model is a fine-tuned version of cyberbabooshka/base_noreasoning on the cyberbabooshka/MNLP_M2_mcqa_dataset dataset. It achieves the following results on the evaluation set:

  • Loss: 0.6772

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 12
  • eval_batch_size: 16
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 2
  • total_train_batch_size: 24
  • total_eval_batch_size: 32
  • optimizer: Use OptimizerNames.ADEMAMIX_8BIT and the args are: No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 100
  • training_steps: 8438

Training results

Training Loss Epoch Step Validation Loss
No log 0.0001 1 2.2371
0.8956 0.0593 500 0.7674
0.9093 0.1185 1000 0.7335
0.8544 0.1778 1500 0.7159
0.8503 0.2370 2000 0.7074
0.8781 0.2963 2500 0.7016
0.8171 0.3555 3000 0.6968
0.9179 0.4148 3500 0.6930
0.845 0.4740 4000 0.6895
0.8885 0.5333 4500 0.6865
0.9432 0.5926 5000 0.6844
0.7451 0.6518 5500 0.6825
0.8675 0.7111 6000 0.6811
0.8606 0.7703 6500 0.6793
0.8602 0.8000 6750 0.6793
0.8458 0.8296 7000 0.6778
0.9051 0.8888 7500 0.6772
0.8589 0.9481 8000 0.6772

Framework versions

  • Transformers 4.52.1
  • Pytorch 2.7.0+cu126
  • Datasets 3.5.0
  • Tokenizers 0.21.1
Downloads last month
266
Safetensors
Model size
596M params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for cyberbabooshka/MNLP_M2_mcqa_model

Finetuned
(2)
this model

Dataset used to train cyberbabooshka/MNLP_M2_mcqa_model