Joseph717171/Gpt-OSS-20B-MXFP4-GGUF

Gpt-OSS-20B-MXFP4-GGUF

GGUF MXFP4_MOE quant of openai/gpt-OSS-20b. This GGUF model was quantized from the dequantized/Upcasted F32 of the model (not including the MoE layers - as per ggeranov's assertion: "we don't mess with the bits and their placement. We just trust that OpenAI did a good job" and convert from HuggingFace to GGUF). This was done to help preserve and improve the model's accuracy and precision post quantization.

Note: After further experimentation, it turns out it is best to keep the MXFP4 MoE layers in their given state and not fully-dequantize/Upcast to F32. Because, for the aforementioned reason from ggeranov, this leads to a regression in performance. The only reason this is a reality for us, is because llama.cpp just converts from HuggingFace to GGUF for the MOE_Layers. If this wasn't the case, my method for dequantizing/upcasting the model weights to F32 and quantizing would remain the best method for quantizing. And, I would like to add that when llama.cpp finally supports imatrix calibration/training for the MXFP4 MOE layers, we should be able to fully-dequantize/upcast the model weights, calibrate/train an imatrix for it and then quantize using the imatrix to improve the quants accuracy and model preservation. But, unfortunately, we are still waiting for that PR to be made manifest. So, in the meantime, here's the best next option.

filesize: (~12.11 GB)