Quantization issues
Has anybody succeded in quantizing this model?
I'm strugging to fit this model in 64GB of VRAM on vllm. Even with on-the-fly FP8 quantization, it fails to start unless I do some CPU offloading (on-the-fly FP8 requires the BF16 model to temporarily fit in memory at the start), which slows down inference. Tried bitsandbytes quantization, but that is apparently not supported on this architecture.
I'm trying to compress it with llm-compress, but I get the following error when calling oneshot():
... in oneshot: one_shot = Oneshot(**local_args, **kwargs)
... File ".../compressed_tensors_utils.py", line 131, in untie_word_embeddings
... input_embed = model.get_input_embeddings()
NotImplementedError: `get_input_embeddings` not auto‑handled for Qwen3OmniMoeForConditionalGeneration; please override in the subclass.
Can anybody suggest a solution?
You can use the quantization code of Qwen2.5-Omni as a reference?
I haven't tried it yet but this guy seems to have managed:
cpatonn/Qwen3-Omni-30B-A3B-Instruct-AWQ-4bit
I use a lot of his quants.
@gghfez
I've tried this quant, I think it's a good option. Loading the weights consumes about 20ish gigabytes of memory in vLLM.