pytorch
/

Qwen3-8B-AWQ-INT4

Text Generation

text-generation-inference

Model card Files Files and versions

jerryzh168 commited on 11 days ago

Commit

d3cbc97

·

verified ·

1 Parent(s): a150c6d

Update README.md

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -12,7 +12,7 @@ language:
 This repository hosts the **Qwen3-8B** model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
 using int4 weight-only quantization and the [awq](https://arxiv.org/abs/2306.00978) algorithm.
 This work is brought to you by the PyTorch team. This model can be used directly or served using [vLLM](https://docs.vllm.ai/en/latest/) for 53% VRAM reduction (7.82 GB needed)
-and TODOx speedup on H100 GPUs. The model is calibrated with 2 samples from `bbh` task to recover the accuracy for `bbh` specifically.
 # Inference with vLLM
 Install vllm nightly and torchao nightly to get some recent changes:
@@ -303,7 +303,7 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
 | Benchmark (Latency)              |                |                           |                           |
 |----------------------------------|----------------|---------------------------|---------------------------|
 |                                  | Qwen/Qwen3-8B  | pytorch/Qwen3-8B-INT4     | pytorch/Qwen3-8B-AWQ-INT4 |
-| latency (batch_size=1)           | 2.46s          | 1.40s (1.76x speedup)     |                           |
 <details>
 <summary> Reproduce Model Performance Results </summary>

 This repository hosts the **Qwen3-8B** model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
 using int4 weight-only quantization and the [awq](https://arxiv.org/abs/2306.00978) algorithm.
 This work is brought to you by the PyTorch team. This model can be used directly or served using [vLLM](https://docs.vllm.ai/en/latest/) for 53% VRAM reduction (7.82 GB needed)
+and 1.34x speedup on H100 GPUs for batch size 1. The model is calibrated with 5 samples from `bbh` task to recover the accuracy for `bbh` specifically.
 # Inference with vLLM
 Install vllm nightly and torchao nightly to get some recent changes:
 | Benchmark (Latency)              |                |                           |                           |
 |----------------------------------|----------------|---------------------------|---------------------------|
 |                                  | Qwen/Qwen3-8B  | pytorch/Qwen3-8B-INT4     | pytorch/Qwen3-8B-AWQ-INT4 |
+| latency (batch_size=1)           | 2.46s          | 1.40s (1.76x speedup)     | 1.83 (1.34x speedup)      |
 <details>
 <summary> Reproduce Model Performance Results </summary>