Update README.md
Browse files
README.md
CHANGED
@@ -12,7 +12,7 @@ language:
|
|
12 |
This repository hosts the **Qwen3-8B** model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
|
13 |
using int4 weight-only quantization and the [awq](https://arxiv.org/abs/2306.00978) algorithm.
|
14 |
This work is brought to you by the PyTorch team. This model can be used directly or served using [vLLM](https://docs.vllm.ai/en/latest/) for 53% VRAM reduction (7.82 GB needed)
|
15 |
-
and
|
16 |
|
17 |
# Inference with vLLM
|
18 |
Install vllm nightly and torchao nightly to get some recent changes:
|
@@ -303,7 +303,7 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
|
|
303 |
| Benchmark (Latency) | | | |
|
304 |
|----------------------------------|----------------|---------------------------|---------------------------|
|
305 |
| | Qwen/Qwen3-8B | pytorch/Qwen3-8B-INT4 | pytorch/Qwen3-8B-AWQ-INT4 |
|
306 |
-
| latency (batch_size=1) | 2.46s | 1.40s (1.76x speedup) |
|
307 |
|
308 |
<details>
|
309 |
<summary> Reproduce Model Performance Results </summary>
|
|
|
12 |
This repository hosts the **Qwen3-8B** model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
|
13 |
using int4 weight-only quantization and the [awq](https://arxiv.org/abs/2306.00978) algorithm.
|
14 |
This work is brought to you by the PyTorch team. This model can be used directly or served using [vLLM](https://docs.vllm.ai/en/latest/) for 53% VRAM reduction (7.82 GB needed)
|
15 |
+
and 1.34x speedup on H100 GPUs for batch size 1. The model is calibrated with 5 samples from `bbh` task to recover the accuracy for `bbh` specifically.
|
16 |
|
17 |
# Inference with vLLM
|
18 |
Install vllm nightly and torchao nightly to get some recent changes:
|
|
|
303 |
| Benchmark (Latency) | | | |
|
304 |
|----------------------------------|----------------|---------------------------|---------------------------|
|
305 |
| | Qwen/Qwen3-8B | pytorch/Qwen3-8B-INT4 | pytorch/Qwen3-8B-AWQ-INT4 |
|
306 |
+
| latency (batch_size=1) | 2.46s | 1.40s (1.76x speedup) | 1.83 (1.34x speedup) |
|
307 |
|
308 |
<details>
|
309 |
<summary> Reproduce Model Performance Results </summary>
|