pytorch
/

Qwen3-8B-AWQ-INT4

Text Generation

text-generation-inference

Model card Files Files and versions

jerryzh168 commited on 6 days ago

Commit

ae7fb3c

·

verified ·

1 Parent(s): abbab01

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -13,7 +13,7 @@ This repository hosts the **Qwen3-8B** model quantized with [torchao](https://hu
 using int4 weight-only quantization and the [awq](https://arxiv.org/abs/2306.00978) algorithm.
 This work is brought to you by the PyTorch team. This model can be used directly or served using [vLLM](https://docs.vllm.ai/en/latest/) for 53% VRAM reduction (7.82 GB needed)
 and 1.34x speedup on H100 GPUs for batch size 1. The model is calibrated with 10 samples from `mmlu_abstract_algebra` task to recover the accuracy for `mmlu_abstract_algebra` specifically.
-AWQ-INT4 improves the accuracy of `mmlu_abstract_algebra` of INT4 from 55 to 56, while the non quantized baseline is 58.
 # Inference with vLLM
 Install vllm nightly and torchao nightly to get some recent changes:

 using int4 weight-only quantization and the [awq](https://arxiv.org/abs/2306.00978) algorithm.
 This work is brought to you by the PyTorch team. This model can be used directly or served using [vLLM](https://docs.vllm.ai/en/latest/) for 53% VRAM reduction (7.82 GB needed)
 and 1.34x speedup on H100 GPUs for batch size 1. The model is calibrated with 10 samples from `mmlu_abstract_algebra` task to recover the accuracy for `mmlu_abstract_algebra` specifically.
+AWQ-INT4 improves the accuracy of `mmlu_abstract_algebra` of INT4 from 55 to 56, while the bfloat16 baseline is 58.
 # Inference with vLLM
 Install vllm nightly and torchao nightly to get some recent changes: