Quantized MCQA Model โ€“ W8A8

Model Summary

This model is a quantized version of our MCQA model. It was quantized using post-training quantization (PTQ), targeting both weights and activations (W8A8) using the LLMCompressor framework.

Technical Details

  • Base model: hssawhney/mnlp-model
  • Quantization method: SmoothQuant + GPTQ
  • Precision: BF16 (activations) + INT8 (weights)
  • Calibration data: 512 samples from zay25/quantization-dataset
  • Excluded layers: lm_head (to preserve output logits)
  • Final model size: ~717 MB

Evaluation

The quantized model was evaluated on the full MCQA demo dataset using the LightEval framework. Performance dropped with only a 0.02 decrease in accuracy compared to the full-precision (FP32) version.

Intended Use

This model is optimized for efficient inference in multiple-choice question answering tasks, particularly in the context of STEM tutoring. It is well-suited for low-resource deployment environments where latency and memory usage are critical.

Downloads last month
2
Safetensors
Model size
752M params
Tensor type
BF16
ยท
I8
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for zay25/MNLP_M2_quantized_model

Quantized
(1)
this model