---
library_name: transformers
license: mit
tags:
- torchao
---

# Quantization Recipe

We used following code to get the quantized model:

```
model_id = "microsoft/Phi-4-mini-instruct"
from transformers import (
    AutoModelForCausalLM,
    AutoProcessor,
    AutoTokenizer,
)
from torchao.quantization.quant_api import (
    Int8DynamicActivationIntxWeightConfig,
    MappingType,
    quantize_,
)
from torchao.quantization.granularity import PerGroup
import torch

model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="auto", device_map="auto"
)
linear_config = Int8DynamicActivationIntxWeightConfig(
    weight_dtype=torch.int4,
    weight_granularity=PerGroup(32),
    weight_mapping_type=MappingType.SYMMETRIC,
)
quantize_(
	model,
    linear_config,
)
state_dict = model.state_dict()
torch.save(state_dict, "phi4-mini-8dq4w.pt")
```

# Model Quality

We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.

## baseline
```
lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8
```

## 8dq4w
```
import lm_eval
from lm_eval import evaluator
from lm_eval.utils import (
    make_table,
)

# model is after calling quantize_ as we do in the recipe
# quantize_(
#	model,
#    linear_config,
#)
lm_eval_model = lm_eval.models.huggingface.HFLM(pretrained=model, batch_size=8)
results = evaluator.simple_evaluate(
    lm_eval_model, tasks=["hellaswag"], device="cuda:0", batch_size="auto"
)
print(make_table(results))
```

| Benchmark                        |             |                   |
|----------------------------------|-------------|-------------------|
|                                  | Phi-4 mini-Ins | phi4-mini-8dq4w | 
| **Popular aggregated benchmark** |             |                   |
| **Reasoning**                    |             |                   |
| HellaSwag                        | 54.57        | 53.19            |
| **Multilingual**                 |             |                   |
| **Math**                         |             |                   |
| **Overall**                      | **TODO**    | **TODO**          |


# Exporting to ExecuTorch

Exporting to ExecuTorch requires you clone and install [ExecuTorch](https://github.com/pytorch/executorch).


## Convert quantized checkpoint to ExecuTorch's format
python -m executorch.examples.models.phi_4_mini.convert_weights phi4-mini-8dq4w.pt phi4-mini-8dq4w-converted.pt

## Export to an ExecuTorch *.pte with XNNPACK
PARAMS="executorch/examples/models/phi_4_mini/config.json"
python -m executorch.examples.models.llama.export_llama \
  --model "phi_4_mini" \
  --checkpoint "phi4-mini-8dq4w-converted.pt" \
  --params "$PARAMS" \
  -kv \
  --use_sdpa_with_kv_cache \
  -X \
  --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \
  --output_name="phi4-mini-8dq4w.pte"
  
## Run model with pybindings
export TOKENIZER="/path/to/tokenizer.json"
export TOKENIZER_CONFIG="/path/to/tokenizer_config.json"
export PROMPT="<|system|><|end|><|user|>What is in a california roll?<|end|><|assistant|>"
python -m executorch.examples.models.llama.runner.native \
  --model phi_4_mini \
  --pte phi4-mini-8dq4w.pte \
  -kv \
  --tokenizer ${TOKENIZER} \
  --tokenizer_config ${TOKENIZER_CONFIG} \
  --prompt "${PROMPT}" \
  --params "${PARAMS}" \
  --max_len 128 \
  --temperature 0