--- library_name: transformers license: mit tags: - torchao --- # Quantization Recipe We used following code to get the quantized model: ``` model_id = "microsoft/Phi-4-mini-instruct" from transformers import ( AutoModelForCausalLM, AutoProcessor, AutoTokenizer, ) from torchao.quantization.quant_api import ( Int8DynamicActivationIntxWeightConfig, MappingType, quantize_, ) from torchao.quantization.granularity import PerGroup import torch model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype="auto", device_map="auto" ) linear_config = Int8DynamicActivationIntxWeightConfig( weight_dtype=torch.int4, weight_granularity=PerGroup(32), weight_mapping_type=MappingType.SYMMETRIC, ) quantize_( model, linear_config, ) state_dict = model.state_dict() torch.save(state_dict, "phi4-mini-8dq4w.pt") ``` # Model Quality We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model. ## baseline ``` lm_eval --model hf --model_args pretrained=microsoft/Phi-4-mini-instruct --tasks hellaswag --device cuda:0 --batch_size 8 ``` ## 8dq4w ``` import lm_eval from lm_eval import evaluator from lm_eval.utils import ( make_table, ) # model is after calling quantize_ as we do in the recipe # quantize_( # model, # linear_config, #) lm_eval_model = lm_eval.models.huggingface.HFLM(pretrained=model, batch_size=8) results = evaluator.simple_evaluate( lm_eval_model, tasks=["hellaswag"], device="cuda:0", batch_size="auto" ) print(make_table(results)) ``` | Benchmark | | | |----------------------------------|-------------|-------------------| | | Phi-4 mini-Ins | phi4-mini-8dq4w | | **Popular aggregated benchmark** | | | | **Reasoning** | | | | HellaSwag | 54.57 | 53.19 | | **Multilingual** | | | | **Math** | | | | **Overall** | **TODO** | **TODO** | # Exporting to ExecuTorch Exporting to ExecuTorch requires you clone and install [ExecuTorch](https://github.com/pytorch/executorch). ## Convert quantized checkpoint to ExecuTorch's format python -m executorch.examples.models.phi_4_mini.convert_weights phi4-mini-8dq4w.pt phi4-mini-8dq4w-converted.pt ## Export to an ExecuTorch *.pte with XNNPACK PARAMS="executorch/examples/models/phi_4_mini/config.json" python -m executorch.examples.models.llama.export_llama \ --model "phi_4_mini" \ --checkpoint "phi4-mini-8dq4w-converted.pt" \ --params "$PARAMS" \ -kv \ --use_sdpa_with_kv_cache \ -X \ --metadata '{"get_bos_id":128000, "get_eos_ids":[128009, 128001]}' \ --output_name="phi4-mini-8dq4w.pte" ## Run model with pybindings export TOKENIZER="/path/to/tokenizer.json" export TOKENIZER_CONFIG="/path/to/tokenizer_config.json" export PROMPT="<|system|><|end|><|user|>What is in a california roll?<|end|><|assistant|>" python -m executorch.examples.models.llama.runner.native \ --model phi_4_mini \ --pte phi4-mini-8dq4w.pte \ -kv \ --tokenizer ${TOKENIZER} \ --tokenizer_config ${TOKENIZER_CONFIG} \ --prompt "${PROMPT}" \ --params "${PARAMS}" \ --max_len 128 \ --temperature 0