Thank you and a couple QQs
FYI, I am able to use this model successfully on vLLM with 4x 3090s even though other quant approaches do not work with vLLM and tensor parallelism at 4. This provided the evidence I needed that I can get my finetune quantified and running on the system.
That said, I was able to quantize using your recipe and then load the quant model leveraging the config.json you provide.
A couple questions:
- Why do you quantize in bfloat16 but then load using float16?
- My model does not generate thinking content post quantization, was there anything you did to avoid that problem with your model?
- My model goes nuts after replying for a bit. At first it responds to the query but then eventually goes to unrelated random sentences... Have you encountered such an issue before?
Any pointers would be super appreciated. Quantization script below.
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="balanced",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
use_safetensors=True,
max_memory={0: "23GB", 1: "23GB", 2: "23GB", 3: "23GB", "cpu": "256GB"}
)
# Load Nemotron dataset for calibration
dataset = load_dataset("nvidia/HelpSteer2", split="train[:512]")
dataset = dataset.shuffle(seed=42)
# Preprocess dataset to extract text field
def preprocess_fn(examples):
return {"text": [p + " " + r for p, r in zip(examples["prompt"], examples["response"])]}
dataset = dataset.map(preprocess_fn, batched=True, remove_columns=dataset.column_names)
# Apply AWQ quantization with calibration
oneshot(
model=model,
recipe=recipe_path,
dataset=dataset,
num_calibration_samples=512,
output_dir=output_dir,
save_compressed=True
# Removed overwrite_output_dir - not a valid parameter
)
For reference, the following works for your model without issue (including thinking):
docker run --gpus all -it --rm --shm-size=16g -v /ssd:/models vllm/vllm-openai:v0.10.2 --model /models/GLM-4.5-Air-AWQ-4bit/ --tensor-parallel-size 4 --max-model-len 16384 --enable-expert-parallel
Thank you for your interests.
- float16 datatype is required for model loading to be compatible and supported from vllm loading kernel. The dtype for model quantization is insignificant.
- I am not sure but this might be due to your chosen calibration dataset as your chosen dataset nvidia/HelpSteer2 does not have thinking content.
- I think the rambling and spilling random sentences originate from high perplexity and quantization losses. I don't remember the extents but that also occurs in my model, at least in its chain of thoughts. The most recent model update is to reduce the model rambling, where I ignore the active params as shown in the ignore list inside config.json.
In general, to have the quantized model acting similar to the original bf16 model, in addition to quantize configs e.g, group_size=32, ignore=["re:.*shared_experts.*", ...]
, I would recommend to use calibration dataset systhesised/originated by the original bf16 model. In my case, I use dataset symthesized larger models e.g., DeepSeek-R1, as there is a high chances that GLM-4.5-Air is also trained on those larger models.
Good calibration dataset should increase accuracy and model quality in general, and decrease perplexity and model rambling randomness.
Hope this helps.