Built with Axolotl

See axolotl config

axolotl version: 0.5.2

base_model: mistralai/Mistral-7B-v0.1
model_type: AutoModelForCausalLM
tokenizer_config: Open-Orca/Mistral-7B-OpenOrca
tokenizer_type: AutoTokenizer
tokenizer_use_fast: false
resize_token_embeddings_to_32x: false

flash_attention: true
xformers_attention:

load_in_8bit: false
load_in_4bit: false
strict: false

chat_template: chatml
datasets:
  - path: skymizer/open-orca-conversations
    type: chat_template
    field_messages: messages
    train_on_split: train

test_datasets:
  - path: skymizer/open-orca-conversations
    type: chat_template
    field_messages: messages
    split: test

hf_use_auth_token: true
dataset_prepared_path: /mnt/home/model-team/dataset/pretokenized/mistral-open-orca
output_dir: /mnt/home/model-team/models/mistral-7B-v0.1-open-orca-q-sparse-wo-relu2

sequence_len: 2048
sample_packing: true
pad_to_sequence_len: true

eval_sample_packing: false
# eval_causal_lm_metrics: ["perplexity"]

wandb_project: "axolotl_q_sparse_sft"
wandb_entity:
wandb_watch:
wandb_name: "mistral-7B-v0.1-open-orca-q-sparse-wo-relu2"
wandb_log_model:

gradient_accumulation_steps: 2
micro_batch_size: 8
eval_batch_size: 
num_epochs: 1
optimizer: adamw_torch
lr_scheduler: cosine
learning_rate: 0.000005
weight_decay: 0.0
adam_beta1: 0.9
adam_beta2: 0.95
adam_eps: 0.000001
max_grad_norm: 1.0

train_on_inputs: false
group_by_length: false
bf16: true
fp16:
tf32: false

hub_model_id: "skymizer/mistral-7B-v0.1-open-orca-q-sparse-wo-relu2"

save_strategy: "steps"
save_steps: 500

gradient_checkpointing: true
gradient_checkpointing_kwargs:
  use_reentrant: false
early_stopping_patience:
resume_from_checkpoint:
local_rank:
logging_steps: 1

warmup_ratio: 0.03
eval_steps: 500
eval_table_size:
eval_max_new_tokens: 2048
debug:
deepspeed: /root/train/axolotl/deepspeed_configs/zero3_bf16.json
fsdp:
fsdp_config:
seed: 42

mistral-7B-v0.1-open-orca-q-sparse-wo-relu2

This model is a fine-tuned version of mistralai/Mistral-7B-v0.1 on the None dataset. It achieves the following results on the evaluation set:

  • Loss: 1.6786

Model description

More information needed

Intended uses & limitations

More information needed

Training and evaluation data

More information needed

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 5e-06
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • distributed_type: multi-GPU
  • num_devices: 8
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 128
  • total_eval_batch_size: 64
  • optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.95) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
  • lr_scheduler_type: cosine
  • lr_scheduler_warmup_steps: 182
  • num_epochs: 1

Training results

Training Loss Epoch Step Validation Loss
8.2253 0.0002 1 8.1308
3.2047 0.0824 500 3.1735
2.6902 0.1648 1000 2.6022
2.378 0.2472 1500 2.2960
2.1837 0.3296 2000 2.1155
2.0304 0.4120 2500 1.9842
1.9425 0.4944 3000 1.8926
1.8618 0.5768 3500 1.8224
1.7589 0.6592 4000 1.7663
1.746 0.7416 4500 1.7255
1.8055 0.8240 5000 1.6973
1.8027 0.9064 5500 1.6817
1.7091 0.9888 6000 1.6786

Framework versions

  • Transformers 4.46.3
  • Pytorch 2.5.1+cu124
  • Datasets 3.1.0
  • Tokenizers 0.20.3
Downloads last month
2
Safetensors
Model size
7.24B params
Tensor type
BF16
·
Inference Providers NEW
This model is not currently available via any of the supported third-party Inference Providers, and the model is not deployed on the HF Inference API.

Model tree for skymizer/mistral-7B-v0.1-open-orca-q-sparse-wo-relu2

Finetuned
(819)
this model