This is a 3bit AutoRound GPTQ version of Mistral-Large-Instruct-2407. This conversion used model-*.safetensors.

This quantized model needs at least ~ 50GB + context (~5GB) VRAM. I quantized it so that it could fit 64GB VRAM.

Quantization script (it takes around 520 GB RAM and A40 GPU 48GB around 20 hours to convert):

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "mistralai/Mistral-Large-Instruct-2407"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)

from auto_round import AutoRound

bits, group_size, sym = 3, 128, True

autoround = AutoRound(model, tokenizer, nsamples=256, iters=512, low_gpu_mem_usage=True, batch_size=4, bits=bits, group_size=group_size, sym=sym,
                     device='cuda')
autoround.quantize()
output_dir = "./Mistral-Large-Instruct-2407-3bit"
autoround.save_quantized(output_dir, format='auto_gptq', inplace=True)

Evals using lm-eval-harness.

example command:
# !pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git auto-gptq optimum
m="VPTQ-community/Mistral-Large-Instruct-2407-v8-k65536-256-woft"
!lm_eval --model hf --model_args pretrained={m},dtype=auto --tasks wikitext  --num_fewshot 0 --batch_size 1 --output_path ./eval/

hf (pretrained=MLDataScientist/Mistral-Large-Instruct-2407-GPTQ-3bit,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 2

Tasks Version Filter n-shot Metric Value Stderr
wikitext 2 none 0 bits_per_byte ↓ 0.4103 ± N/A
none 0 byte_perplexity ↓ 1.3290 ± N/A
none 0 word_perplexity ↓ 4.5765 ± N/A

vs 3bit VPTQ hf (pretrained=VPTQ-community/Mistral-Large-Instruct-2407-v8-k65536-256-woft,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 1

Tasks Version Filter n-shot Metric Value Stderr
wikitext 2 none 0 bits_per_byte ↓ 0.4017 ± N/A
none 0 byte_perplexity ↓ 1.3211 ± N/A
none 0 word_perplexity ↓ 4.4324 ± N/A

vs 4bit GPTQ: hf (pretrained=ModelCloud/Mistral-Large-Instruct-2407-gptq-4bit,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 1:

Tasks Version Filter n-shot Metric Value Stderr
wikitext 2 none 0 bits_per_byte ↓ 0.3536 ± N/A
none 0 byte_perplexity ↓ 1.2777 ± N/A
none 0 word_perplexity ↓ 3.7082 ± N/A

vs 4bit VPTQ hf (pretrained=VPTQ-community/Mistral-Large-Instruct-2407-v8-k65536-65536-woft,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 1

Tasks Version Filter n-shot Metric Value Stderr
wikitext 2 none 0 bits_per_byte ↓ 0.3415 ± N/A
none 0 byte_perplexity ↓ 1.2671 ± N/A
none 0 word_perplexity ↓ 3.5463 ± N/A

vs exl2 4bpw (I think the tests are different)

Wikitext C4 FineWeb Max VRAM
EXL2 4.00 bpw 2.885 6.484 6.246 60.07 GB
Downloads last month
51
Safetensors
Model size
13.3B params
Tensor type
FP16
·
I32
·
Inference Examples
Unable to determine this model's library. Check the docs .

Model tree for MLDataScientist/Mistral-Large-Instruct-2407-GPTQ-3bit

Quantized
(22)
this model