GGUF uploaded now + Chat template Fixes!
Edit: Reuploaded due to OpenAI's chat template change & our new chat template fixes. Please redownload
It's uploaded now!! With some of our chat template fixes!
The FP4 version. Please update whichever inference engine youre using!
Dynamic GGUFs with different sizes will come later!! Thanks to llama.cpp if they update it.
Let us know if you encounter any issues!
I don't get it. They released only MXFP4 prequantized versions on huggingface.
How:
- yours can be F16?
- how can you apply dynamic quantization on an already 4 bit quantized model?
Brain is crashing now
@owao (if that's even how you grab someone's attention on this site),
I had downloaded a file called "gpt-oss-20b-MXFP4.gguf" shortly before it was renamed to "gpt-oss-20b-F16.gguf".
It has the same SHA256 hash of the F16 file so they're the same.
llama.cpp hasn't released binaries that support the gguf model yet and I'm too lazy to compile anyways so I'm using LM Studio.
I don't get it. They released only MXFP4 prequantized versions on huggingface.
How:
- yours can be F16?
- how can you apply dynamic quantization on an already 4 bit quantized model?
Brain is crashing now
@owao (if that's even how you grab someone's attention on this site),
I had downloaded a file called "gpt-oss-20b-MXFP4.gguf" shortly before it was renamed to "gpt-oss-20b-F16.gguf".
It has the same SHA256 hash of the F16 file so they're the same.llama.cpp hasn't released binaries that support the gguf model yet and I'm too lazy to compile anyways so I'm using LM Studio.
We named it F16 so it can appear on the HF repo page but yes, it's mostly the same
Push the imatrix to the repo @shimmyshimmer , please. π
We're waiting for llama.cpp to support it first
Push the imatrix to the repo @shimmyshimmer , please. π
We're waiting for llama.cpp to support it first
Damn it... So, no imatrix training yet? Also: did you boys use the new MXFP4_MOE ggml type for your quant or no?
Push the imatrix to the repo @shimmyshimmer , please. π
We're waiting for llama.cpp to support it first
Damn it... So, no imatrix training yet? Also: did you boys use the new MXFP4_MOE ggml type for your quant or no?
This one is the new FP4 MOE quant.
Push the imatrix to the repo @shimmyshimmer , please. π
We're waiting for llama.cpp to support it first
Damn it... So, no imatrix training yet? Also: did you boys use the new MXFP4_MOE ggml type for your quant or no?
This one is the new FP4 MOE quant.
That's odd. I can't run yours. But, I can run the one from lmstudio-community/gpt-oss-20b-GGUF
Why is yours larger than LM Studios?
Quant Size Comparison:
Unlsoth's quant : 13.8 GB
LM Studio's quant: 12.1 GB
Nevermind, the memory error. Renaming the file to say f16.gguf fixed it. Still odd though...
Push the imatrix to the repo @shimmyshimmer , please. π
We're waiting for llama.cpp to support it first
Damn it... So, no imatrix training yet? Also: did you boys use the new MXFP4_MOE ggml type for your quant or no?
This one is the new FP4 MOE quant.
That's odd. I can't run yours. But, I can run the one from lmstudio-community/gpt-oss-20b-GGUF
Why is yours larger than LM Studios?
Quant Size Comparison:
Unlsoth's quant : 13.8 GB
LM Studio's quant: 12.1 GBNevermind, the memory error. Renaming the file to say f16.gguf fixed it. Still odd though...
Our one is converted purely from f16, LMStudio's is 8bit. We havent verified accuracy degradation when casting from 16bit to 8bit hence why we did 16bit
llama.cpp binaries have just been released with support for the new gpt-oss stuff! I'm running it now.
https://github.com/ggml-org/llama.cpp/releases/tag/b6096
Good luck unsloth team on releasing imatrix quants and stuff like that if it's possible!
Wait please redownload the F16 versions since we fixed some chat template issues!
this was fast!
So far sst/opencode and qwen-code crash llama.cpp with Unexpected content at end of input
. Open WebUI seems fine (besides not yet detecting the "thinking" tokens) but I'm not using tool calling there.
@shimmyshimmer @DigitalFauna Thanks for the effort trying to spark the neural pathway I was missing! But unfortulately it didn't fully initialized π
@shimmyshimmer
Apologize for my assumption they released only 4bits weights, I assumed that because of the GB size of the model in their repo, but just saw most of the weights are actually BF16! So now further quantization makes more sense to me. But!
I don't get how their published BF16 version can be so small? I mean if we compare to Mistral 24B, BF16 version is 42GB+. Is it because a significant part of the weights are in U8? (still I don't even know what is U8, but I'm going to educate myself on that I promise). But still, it should be something like ~30GB?
Sorry if my question is dumb, but I'm so confused here... I might be missing essential parts
I now saw your other message saying they actually trained it in BF16 and only posttrained it in 4bits. I'm now even more confused lol! Why is it released as BF16 then??
I hope I'm not alone it such a confusion state and any explanation could serve some others!
Oh wait! Is it called BF16 because 4 active experts so 4*4 = 16?
help me
Edit: Reuploaded due to OpenAI's chat template change & our new chat template fixes. Please redownload
It's uploaded now!! With some of our chat template fixes!
The FP4 version. Please update whichever inference engine youre using!
Dynamic GGUFs with different sizes will come later!! Thanks to llama.cpp if they update it.
Let us know if you encounter any issues!
Best, GGUF quants of OpenAI/OSS-20B bar none. Unsloth's Dynamic 2.0 GGUF calibration dataset wins again. π
Does Tool Calling work?
My llama.cpp crashes on any quant when working with Qwen Code. =(
I found this thread: https://huggingface.co/openai/gpt-oss-120b/discussions/69
Maybe it is still possible to fix the chat template?
gguf_init_from_file_impl: tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE)
gguf_init_from_file_impl: failed to read tensor info
llama_model_load: error loading model: llama_model_loader: failed to load model from /root/.cache/huggingface/hub/models--unsloth--gpt-oss-20b-GGUF/snapshots/ff0f965518cc8b299d10c1318d42c6e15689f11e/./gpt-oss-20b-F16.gguf
llama_model_load_from_file_impl: failed to load model
I met it in colab although i had updated the version of llama_cpp,what should i do to run the model correctly
tool calling fails silently for me when trying to use python interpreter tool in llama-server(version: 3 (c4f5356)) - no response, no terminal output
There's work being done at https://github.com/ggml-org/llama.cpp/pull/15158 to get tool calling to work. Kinda works right now with some issues.
@owao BF16 is a data type used to store weights, it has nothing to do with the architecture of a model, or its MoE expert configuration.
FP16 uses 5 bits for exponent and 10 bits for mantissa, while BF16 used 8 bits for exponent and 7 bits for mantissa. So FP16 is technically more "precise", while BF16 allows for greater ranges of values. Check the following wikipedia article (or maybe just ask any half decent LLM, even just a tiny one) for a detailed explanation:
https://en.wikipedia.org/wiki/Bfloat16_floating-point_format
@mingyi456
Thanks, I would never have thought about asking my LM to explain, neither use wikipedia! Sorry but mate... lol that's a bit condescendant
My questions are not really answerable with my LMs.
Also, you only answered a question you wrongly inferred from my last one (which surely was the dumbest among all the ones I asked).
Anyway thanks
Now I guess I shouldn't have any hope having any answer because I behaved like an asshole, I'll deal with that...
gpt-oss-20b-F16.gguf
Thanks for uploading but im unble to get it working in ollama
Error: unable to load model: /usr/share/ollama/.ollama/models/blobs/sha256-bc4d52a46e1d89088ff3cbb4be21a7c99f0bb68b53514d7d50679c9f07e33a41
The above error pops up even with the latest model
But it works in lmStudio
@mashriram
Patience still not supported for now
https://huggingface.co/unsloth/gpt-oss-20b-GGUF/discussions/17#68958418964bc7263fe13adf