Update README.md
Browse files
README.md
CHANGED
@@ -11,7 +11,7 @@ language:
|
|
11 |
|
12 |
This repository hosts the **Qwen3-8B** model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
|
13 |
using int4 weight-only quantization and the [awq](https://arxiv.org/abs/2306.00978) algorithm.
|
14 |
-
This work is brought to you by the PyTorch team. This model can be used directly or served using [vLLM](https://docs.vllm.ai/en/latest/) for
|
15 |
and TODOx speedup on H100 GPUs. The model is calibrated with 2 samples from `bbh` task to recover the accuracy for `bbh` specifically.
|
16 |
|
17 |
# Inference with vLLM
|
|
|
11 |
|
12 |
This repository hosts the **Qwen3-8B** model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
|
13 |
using int4 weight-only quantization and the [awq](https://arxiv.org/abs/2306.00978) algorithm.
|
14 |
+
This work is brought to you by the PyTorch team. This model can be used directly or served using [vLLM](https://docs.vllm.ai/en/latest/) for 53% VRAM reduction (7.82 GB needed)
|
15 |
and TODOx speedup on H100 GPUs. The model is calibrated with 2 samples from `bbh` task to recover the accuracy for `bbh` specifically.
|
16 |
|
17 |
# Inference with vLLM
|