pytorch
/

Qwen3-8B-AWQ-INT4

Text Generation

text-generation-inference

Model card Files Files and versions

jerryzh168 commited on 10 days ago

Commit

a150c6d

·

verified ·

1 Parent(s): 56961c6

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -11,7 +11,7 @@ language:
 This repository hosts the **Qwen3-8B** model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
 using int4 weight-only quantization and the [awq](https://arxiv.org/abs/2306.00978) algorithm.
-This work is brought to you by the PyTorch team. This model can be used directly or served using [vLLM](https://docs.vllm.ai/en/latest/) for TODO% VRAM reduction (TODO GB needed)
 and TODOx speedup on H100 GPUs. The model is calibrated with 2 samples from `bbh` task to recover the accuracy for `bbh` specifically.
 # Inference with vLLM

 This repository hosts the **Qwen3-8B** model quantized with [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao)
 using int4 weight-only quantization and the [awq](https://arxiv.org/abs/2306.00978) algorithm.
+This work is brought to you by the PyTorch team. This model can be used directly or served using [vLLM](https://docs.vllm.ai/en/latest/) for 53% VRAM reduction (7.82 GB needed)
 and TODOx speedup on H100 GPUs. The model is calibrated with 2 samples from `bbh` task to recover the accuracy for `bbh` specifically.
 # Inference with vLLM