--- license: mit license_link: https://huggingface.co/microsoft/phi-4/resolve/main/LICENSE language: - en pipeline_tag: text-generation base_model: microsoft/phi-4 tags: - phi - nlp - math - code - chat - conversational - neuralmagic - redhat - llmcompressor - quantized - int8 ---

phi-4-quantized.w8a8

## Model Overview - **Model Architecture:** Phi3ForCausalLM - **Input:** Text - **Output:** Text - **Model Optimizations:** - **Activation quantization:** INT8 - **Weight quantization:** INT8 - **Intended Use Cases:** This model is designed to accelerate research on language models, for use as a building block for generative AI powered features. It provides uses for general purpose AI systems and applications (primarily in English) which require: 1. Memory/compute constrained environments. 2. Latency bound scenarios. 3. Reasoning and logic. - **Out-of-scope:** This model is not specifically designed or evaluated for all downstream purposes, thus: 1. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. 2. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case, including the model’s focus on English. 3. Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under. - **Release Date:** 03/03/2025 - **Version:** 1.0 - **Model Developers:** Red Hat (Neural Magic) ### Model Optimizations This model was obtained by quantizing activations and weights of [phi-4](https://huggingface.co/microsoft/phi-4) to INT8 data type. This optimization reduces the number of bits used to represent weights and activations from 16 to 8, reducing GPU memory requirements (by approximately 50%) and increasing matrix-multiply compute throughput (by approximately 2x). Weight quantization also reduces disk size requirements by approximately 50%. Only weights and activations of the linear operators within transformers blocks are quantized. Weights are quantized with a symmetric static per-channel scheme, whereas activations are quantized with a symmetric dynamic per-token scheme. A combination of the [SmoothQuant](https://arxiv.org/abs/2211.10438) and [GPTQ](https://arxiv.org/abs/2210.17323) algorithms is applied for quantization, as implemented in the [llm-compressor](https://github.com/vllm-project/llm-compressor) library. ## Deployment This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below. ```python from vllm import LLM, SamplingParams from transformers import AutoTokenizer model_id = "neuralmagic-ent/phi-4-quantized.w8a8" number_gpus = 1 sampling_params = SamplingParams(temperature=0.7, top_p=0.8, max_tokens=256) tokenizer = AutoTokenizer.from_pretrained(model_id) messages = [ {"role": "user", "content": "Give me a short introduction to large language model."}, ] prompts = tokenizer.apply_chat_template(messages, tokenize=False) llm = LLM(model=model_id, tensor_parallel_size=number_gpus) outputs = llm.generate(prompts, sampling_params) generated_text = outputs[0].outputs[0].text print(generated_text) ``` vLLM aslo supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.

Deploy on Red Hat AI Inference Server

```bash $ podman run --rm -it --device nvidia.com/gpu=all -p 8000:8000 \ --ipc=host \ --env "HUGGING_FACE_HUB_TOKEN=$HF_TOKEN" \ --env "HF_HUB_OFFLINE=0" -v ~/.cache/vllm:/home/vllm/.cache \ --name=vllm \ registry.access.redhat.com/rhaiis/rh-vllm-cuda \ vllm serve \ --tensor-parallel-size 8 \ --max-model-len 32768 \ --enforce-eager --model RedHatAI/phi-4-quantized.w8a8 ``` See [Red Hat AI Inference Server documentation](https://docs.redhat.com/en/documentation/red_hat_ai_inference_server/) for more details.

Deploy on Red Hat Enterprise Linux AI

```bash # Download model from Red Hat Registry via docker # Note: This downloads the model to ~/.cache/instructlab/models unless --model-dir is specified. ilab model download --repository docker://registry.redhat.io/rhelai1/phi-4-quantized-w8a8:1.5 ``` ```bash # Serve model via ilab ilab model serve --model-path ~/.cache/instructlab/models/phi-4-quantized-w8a8 # Chat with model ilab model chat --model ~/.cache/instructlab/models/phi-4-quantized-w8a8 ``` See [Red Hat Enterprise Linux AI documentation](https://docs.redhat.com/en/documentation/red_hat_enterprise_linux_ai/1.4) for more details.

Deploy on Red Hat Openshift AI

```python # Setting up vllm server with ServingRuntime # Save as: vllm-servingruntime.yaml apiVersion: serving.kserve.io/v1alpha1 kind: ServingRuntime metadata: name: vllm-cuda-runtime # OPTIONAL CHANGE: set a unique name annotations: openshift.io/display-name: vLLM NVIDIA GPU ServingRuntime for KServe opendatahub.io/recommended-accelerators: '["nvidia.com/gpu"]' labels: opendatahub.io/dashboard: 'true' spec: annotations: prometheus.io/port: '8080' prometheus.io/path: '/metrics' multiModel: false supportedModelFormats: - autoSelect: true name: vLLM containers: - name: kserve-container image: quay.io/modh/vllm:rhoai-2.20-cuda # CHANGE if needed. If AMD: quay.io/modh/vllm:rhoai-2.20-rocm command: - python - -m - vllm.entrypoints.openai.api_server args: - "--port=8080" - "--model=/mnt/models" - "--served-model-name={{.Name}}" env: - name: HF_HOME value: /tmp/hf_home ports: - containerPort: 8080 protocol: TCP ``` ```python # Attach model to vllm server. This is an NVIDIA template # Save as: inferenceservice.yaml apiVersion: serving.kserve.io/v1beta1 kind: InferenceService metadata: annotations: openshift.io/display-name: phi-4-quantized.w8a8 # OPTIONAL CHANGE serving.kserve.io/deploymentMode: RawDeployment name: phi-4-quantized.w8a8 # specify model name. This value will be used to invoke the model in the payload labels: opendatahub.io/dashboard: 'true' spec: predictor: maxReplicas: 1 minReplicas: 1 model: modelFormat: name: vLLM name: '' resources: limits: cpu: '2' # this is model specific memory: 8Gi # this is model specific nvidia.com/gpu: '1' # this is accelerator specific requests: # same comment for this block cpu: '1' memory: 4Gi nvidia.com/gpu: '1' runtime: vllm-cuda-runtime # must match the ServingRuntime name above storageUri: oci://registry.redhat.io/rhelai1/modelcar-phi-4-quantized-w8a8:1.5 tolerations: - effect: NoSchedule key: nvidia.com/gpu operator: Exists ``` ```bash # make sure first to be in the project where you want to deploy the model # oc project # apply both resources to run model # Apply the ServingRuntime oc apply -f vllm-servingruntime.yaml # Apply the InferenceService oc apply -f qwen-inferenceservice.yaml ``` ```python # Replace and below: # - Run `oc get inferenceservice` to find your URL if unsure. # Call the server using curl: curl https://-predictor-default./v1/chat/completions -H "Content-Type: application/json" \ -d '{ "model": "phi-4-quantized.w8a8", "stream": true, "stream_options": { "include_usage": true }, "max_tokens": 1, "messages": [ { "role": "user", "content": "How can a bee fly when its wings are so small?" } ] }' ``` See [Red Hat Openshift AI documentation](https://docs.redhat.com/en/documentation/red_hat_openshift_ai/2025) for more details.

## Creation

Creation details

This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below. ```python from transformers import AutoModelForCausalLM, AutoTokenizer from llmcompressor.modifiers.quantization import GPTQModifier from llmcompressor.modifiers.smoothquant import SmoothQuantModifier from llmcompressor.transformers import oneshot from datasets import load_dataset # Load model model_stub = "microsoft/phi-4" model_name = model_stub.split("/")[-1] num_samples = 1024 max_seq_len = 8192 tokenizer = AutoTokenizer.from_pretrained(model_stub) model = AutoModelForCausalLM.from_pretrained( model_stub, device_map="auto", torch_dtype="auto", ) def preprocess_fn(example): return {"text": tokenizer.apply_chat_template(example["messages"], add_generation_prompt=False, tokenize=False)} ds = load_dataset("neuralmagic/LLM_compression_calibration", split="train") ds = ds.map(preprocess_fn) # Configure the quantization algorithm and scheme recipe = [ SmoothQuantModifier( smoothing_strength=0.7, mappings=[ [["re:.*qkv_proj"], "re:.*input_layernorm"], [["re:.*gate_up_proj"], "re:.*post_attention_layernorm"], ], ), GPTQModifier( ignore=["lm_head"], sequential_targets=["Phi3DecoderLayer"], dampening_frac=0.01, targets="Linear", scheme="W8A8", ), ] # Apply quantization oneshot( model=model, dataset=ds, recipe=recipe, max_seq_length=max_seq_len, num_calibration_samples=num_samples, ) # Save to disk in compressed-tensors format save_path = model_name + "-quantized.w8a8" model.save_pretrained(save_path) tokenizer.save_pretrained(save_path) print(f"Model and tokenizer saved to: {save_path}") ```

## Evaluation The model was evaluated on the OpenLLM leaderboard tasks (version 1) with the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and the [vLLM](https://docs.vllm.ai/en/stable/) engine, using the following command: ``` lm_eval \ --model vllm \ --model_args pretrained="neuralmagic-ent/phi-4-quantized.w8a8",dtype=auto,gpu_memory_utilization=0.6,max_model_len=4096,enable_chunk_prefill=True,tensor_parallel_size=1 \ --tasks openllm \ --batch_size auto ``` ### Accuracy #### Open LLM Leaderboard evaluation scores

Benchmark	phi-4	phi-4-quantized.w8a8 (this model)	Recovery
MMLU (5-shot)	80.30	80.39	100.1%
ARC Challenge (25-shot)	64.42	64.33	99.9%
GSM-8K (5-shot, strict-match)	90.07	90.30	100.3%
Hellaswag (10-shot)	84.37	84.30	99.9%
Winogrande (5-shot)	80.58	79.95	99.2%
TruthfulQA (0-shot, mc2)	59.37	58.82	99.1%
Average	76.52	76.35	99.8%