jerryzh168 commited on
Commit
d3cbc97
·
verified ·
1 Parent(s): a150c6d

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +2 -2
README.md CHANGED
@@ -12,7 +12,7 @@ language:
12
  This repository hosts the **Qwen3-8B** model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
13
  using int4 weight-only quantization and the [awq](https://arxiv.org/abs/2306.00978) algorithm.
14
  This work is brought to you by the PyTorch team. This model can be used directly or served using [vLLM](https://docs.vllm.ai/en/latest/) for 53% VRAM reduction (7.82 GB needed)
15
- and TODOx speedup on H100 GPUs. The model is calibrated with 2 samples from `bbh` task to recover the accuracy for `bbh` specifically.
16
 
17
  # Inference with vLLM
18
  Install vllm nightly and torchao nightly to get some recent changes:
@@ -303,7 +303,7 @@ print(f"Peak Memory Usage: {mem:.02f} GB")
303
  | Benchmark (Latency) | | | |
304
  |----------------------------------|----------------|---------------------------|---------------------------|
305
  | | Qwen/Qwen3-8B | pytorch/Qwen3-8B-INT4 | pytorch/Qwen3-8B-AWQ-INT4 |
306
- | latency (batch_size=1) | 2.46s | 1.40s (1.76x speedup) | |
307
 
308
  <details>
309
  <summary> Reproduce Model Performance Results </summary>
 
12
  This repository hosts the **Qwen3-8B** model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
13
  using int4 weight-only quantization and the [awq](https://arxiv.org/abs/2306.00978) algorithm.
14
  This work is brought to you by the PyTorch team. This model can be used directly or served using [vLLM](https://docs.vllm.ai/en/latest/) for 53% VRAM reduction (7.82 GB needed)
15
+ and 1.34x speedup on H100 GPUs for batch size 1. The model is calibrated with 5 samples from `bbh` task to recover the accuracy for `bbh` specifically.
16
 
17
  # Inference with vLLM
18
  Install vllm nightly and torchao nightly to get some recent changes:
 
303
  | Benchmark (Latency) | | | |
304
  |----------------------------------|----------------|---------------------------|---------------------------|
305
  | | Qwen/Qwen3-8B | pytorch/Qwen3-8B-INT4 | pytorch/Qwen3-8B-AWQ-INT4 |
306
+ | latency (batch_size=1) | 2.46s | 1.40s (1.76x speedup) | 1.83 (1.34x speedup) |
307
 
308
  <details>
309
  <summary> Reproduce Model Performance Results </summary>