Quantization issues

#17

by stev236 - opened 9 days ago

9 days ago

Has anybody succeded in quantizing this model?

I'm strugging to fit this model in 64GB of VRAM on vllm. Even with on-the-fly FP8 quantization, it fails to start unless I do some CPU offloading (on-the-fly FP8 requires the BF16 model to temporarily fit in memory at the start), which slows down inference. Tried bitsandbytes quantization, but that is apparently not supported on this architecture.

I'm trying to compress it with llm-compress, but I get the following error when calling oneshot():

... in oneshot:    one_shot = Oneshot(**local_args, **kwargs)
... File ".../compressed_tensors_utils.py", line 131, in untie_word_embeddings
...       input_embed = model.get_input_embeddings()
NotImplementedError: `get_input_embeddings` not auto‑handled for Qwen3OmniMoeForConditionalGeneration; please override in the subclass.

Can anybody suggest a solution?