4-bit Quantized models using more memory than they should.
#2
by
sovitrath
- opened
As I understand the 4B models, quantized to 4-bit should use somewhere around 2.5GB VRAM. But I am experiencing more VRAM usage with INT4 quant across all Qwen3 models. For example, the 4B model is using 5GB VRAM with BitsAndBytes directly and with the unsloth bnb model, it is using around 4.2 GB VRAM. Any ideas what architectural changes they made for this.
Although, the 8-bit models are quite stable and using almost exactly the same VRAM as the number params. E.g. 4B model ~ 4.5GB VRAM. 8B model ~ 9GB VRAM.