File size: 3,453 Bytes
2b41a4f ec7de9d dde60eb 2b41a4f 44ba3bc ec7de9d 44ba3bc ec7de9d 44ba3bc |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
---
library_name: transformers
license: mit
tags:
- torchao
---
# Quantization Recipe
We used following code to get the quantized model:
```
model_id = "microsoft/Phi-4-mini-instruct"
from transformers import (
AutoModelForCausalLM,
AutoProcessor,
AutoTokenizer,
)
from torchao.quantization.quant_api import (
Int8DynamicActivationIntxWeightConfig,
MappingType,
quantize_,
)
from torchao.quantization.granularity import PerGroup
import torch
model = AutoModelForCausalLM.from_pretrained(
model_id, torch_dtype="auto", device_map="auto"
)
linear_config = Int8DynamicActivationIntxWeightConfig(
weight_dtype=torch.int4,
weight_granularity=PerGroup(32),
weight_mapping_type=MappingType.SYMMETRIC,
)
quantize_(
model,
linear_config,
)
state_dict = model.state_dict()
torch.save(state_dict, "phi4-mini-8dq4w.pt")
```
# Model Quality
We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.
## baseline
```
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
```
## 8dq4w
```
import lm_eval
from lm_eval import evaluator
from lm_eval.utils import (
make_table,
)
# model is after calling quantize_ as we do in the recipe
# quantize_(
# model,
# linear_config,
#)
lm_eval_model = lm_eval.models.huggingface.HFLM(pretrained=model, batch_size=8)
results = evaluator.simple_evaluate(
lm_eval_model, tasks=["hellaswag"], device="cuda:0", batch_size="auto"
)
print(make_table(results))
```
| Benchmark | | |
|----------------------------------|-------------|-------------------|
| | Phi-4 mini-Ins | phi4-mini-8dq4w |
| **Popular aggregated benchmark** | | |
| **Reasoning** | | |
| HellaSwag | 54.57 | 53.19 |
| **Multilingual** | | |
| **Math** | | |
| **Overall** | **TODO** | **TODO** |
# Exporting to ExecuTorch
Exporting to ExecuTorch requires you clone and install [ExecuTorch](https://github.com/pytorch/executorch).
## Convert quantized checkpoint to ExecuTorch's format
python -m executorch.examples.models.phi_4_mini.convert_weights phi4-mini-8dq4w.pt phi4-mini-8dq4w-converted.pt
## Export to an ExecuTorch *.pte with XNNPACK
PARAMS="executorch/examples/models/phi_4_mini/config.json"
python -m executorch.examples.models.llama.export_llama \
--model "phi_4_mini" \
--checkpoint "phi4-mini-8dq4w-converted.pt" \
--params "$PARAMS" \
-kv \
--use_sdpa_with_kv_cache \
-X \
--metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
--output_name="phi4-mini-8dq4w.pte"
## Run model with pybindings
export TOKENIZER="/path/to/tokenizer.json"
export TOKENIZER_CONFIG="/path/to/tokenizer_config.json"
export PROMPT="<|system|><|end|><|user|>What is in a california roll?<|end|><|assistant|>"
python -m executorch.examples.models.llama.runner.native \
--model phi_4_mini \
--pte phi4-mini-8dq4w.pte \
-kv \
--tokenizer ${TOKENIZER} \
--tokenizer_config ${TOKENIZER_CONFIG} \
--prompt "${PROMPT}" \
--params "${PARAMS}" \
--max_len 128 \
--temperature 0 |