unsloth/gpt-oss-20b-GGUF · GGUF uploaded now + Chat template Fixes!

shimmyshimmer

Unsloth AI org 7 days ago

•

edited 7 days ago

Edit: Reuploaded due to OpenAI's chat template change & our new chat template fixes. Please redownload

It's uploaded now!! With some of our chat template fixes!

The FP4 version. Please update whichever inference engine youre using!

Dynamic GGUFs with different sizes will come later!! Thanks to llama.cpp if they update it.

Let us know if you encounter any issues!

shimmyshimmer pinned discussion 7 days ago

owao

7 days ago

I don't get it. They released only MXFP4 prequantized versions on huggingface.
How:

yours can be F16?
how can you apply dynamic quantization on an already 4 bit quantized model?

Brain is crashing now

DigitalFauna

7 days ago

•

edited 7 days ago

@owao (if that's even how you grab someone's attention on this site),

I had downloaded a file called "gpt-oss-20b-MXFP4.gguf" shortly before it was renamed to "gpt-oss-20b-F16.gguf".
It has the same SHA256 hash of the F16 file so they're the same.

llama.cpp hasn't released binaries that support the gguf model yet and I'm too lazy to compile anyways so I'm using LM Studio.

shimmyshimmer changed discussion title from 4-bit GGUF uploaded now! to GGUF uploaded now + Chat template Fixes! 7 days ago

Joseph717171

7 days ago

•

edited 6 days ago

Push the imatrix to the repo 

@shimmyshimmer
	, please. 🙏 
          ^^^                           ^^^^^^^^

shimmyshimmer

Unsloth AI org 7 days ago

I don't get it. They released only MXFP4 prequantized versions on huggingface.
How:

yours can be F16?

how can you apply dynamic quantization on an already 4 bit quantized model?

Brain is crashing now

@owao (if that's even how you grab someone's attention on this site),

I had downloaded a file called "gpt-oss-20b-MXFP4.gguf" shortly before it was renamed to "gpt-oss-20b-F16.gguf".
It has the same SHA256 hash of the F16 file so they're the same.

llama.cpp hasn't released binaries that support the gguf model yet and I'm too lazy to compile anyways so I'm using LM Studio.

We named it F16 so it can appear on the HF repo page but yes, it's mostly the same

shimmyshimmer

Unsloth AI org 7 days ago

Push the imatrix to the repo @shimmyshimmer , please. 🙏

We're waiting for llama.cpp to support it first

Joseph717171

7 days ago

•

edited 7 days ago

Push the imatrix to the repo @shimmyshimmer , please. 🙏

We're waiting for llama.cpp to support it first

Damn it... So, no imatrix training yet? Also: did you boys use the new MXFP4_MOE ggml type for your quant or no?

shimmyshimmer

Unsloth AI org 7 days ago

Push the imatrix to the repo @shimmyshimmer , please. 🙏

We're waiting for llama.cpp to support it first

Damn it... So, no imatrix training yet? Also: did you boys use the new MXFP4_MOE ggml type for your quant or no?

This one is the new FP4 MOE quant.

Joseph717171

7 days ago

•

edited 7 days ago

Push the imatrix to the repo @shimmyshimmer , please. 🙏

We're waiting for llama.cpp to support it first

Damn it... So, no imatrix training yet? Also: did you boys use the new MXFP4_MOE ggml type for your quant or no?

This one is the new FP4 MOE quant.

That's odd. I can't run yours. But, I can run the one from lmstudio-community/gpt-oss-20b-GGUF

Why is yours larger than LM Studios?

Quant Size Comparison:
Unlsoth's quant : 13.8 GB
LM Studio's quant: 12.1 GB

Nevermind, the memory error. Renaming the file to say f16.gguf fixed it. Still odd though...

shimmyshimmer

Unsloth AI org 7 days ago

Push the imatrix to the repo @shimmyshimmer , please. 🙏

We're waiting for llama.cpp to support it first

Damn it... So, no imatrix training yet? Also: did you boys use the new MXFP4_MOE ggml type for your quant or no?

This one is the new FP4 MOE quant.

That's odd. I can't run yours. But, I can run the one from lmstudio-community/gpt-oss-20b-GGUF

Why is yours larger than LM Studios?

Quant Size Comparison:
Unlsoth's quant : 13.8 GB
LM Studio's quant: 12.1 GB

Nevermind, the memory error. Renaming the file to say f16.gguf fixed it. Still odd though...

Our one is converted purely from f16, LMStudio's is 8bit. We havent verified accuracy degradation when casting from 16bit to 8bit hence why we did 16bit

DigitalFauna

7 days ago

llama.cpp binaries have just been released with support for the new gpt-oss stuff! I'm running it now.
https://github.com/ggml-org/llama.cpp/releases/tag/b6096
Good luck unsloth team on releasing imatrix quants and stuff like that if it's possible!

danielhanchen

Unsloth AI org 7 days ago

Wait please redownload the F16 versions since we fixed some chat template issues!

Luozhu

7 days ago

this was fast!

marceldev89

7 days ago

•

edited 7 days ago

So far sst/opencode and qwen-code crash llama.cpp with Unexpected content at end of input. Open WebUI seems fine (besides not yet detecting the "thinking" tokens) but I'm not using tool calling there.

owao

6 days ago

@shimmyshimmer @DigitalFauna Thanks for the effort trying to spark the neural pathway I was missing! But unfortulately it didn't fully initialized 😅

@shimmyshimmer Apologize for my assumption they released only 4bits weights, I assumed that because of the GB size of the model in their repo, but just saw most of the weights are actually BF16! So now further quantization makes more sense to me. But!
I don't get how their published BF16 version can be so small? I mean if we compare to Mistral 24B, BF16 version is 42GB+. Is it because a significant part of the weights are in U8? (still I don't even know what is U8, but I'm going to educate myself on that I promise). But still, it should be something like ~30GB?
Sorry if my question is dumb, but I'm so confused here... I might be missing essential parts

owao

6 days ago

•

edited 6 days ago

I now saw your other message saying they actually trained it in BF16 and only posttrained it in 4bits. I'm now even more confused lol! Why is it released as BF16 then??
I hope I'm not alone it such a confusion state and any explanation could serve some others!

owao

6 days ago

Oh wait! Is it called BF16 because 4 active experts so 4*4 = 16?

owao

6 days ago

help me

Joseph717171

5 days ago

Edit: Reuploaded due to OpenAI's chat template change & our new chat template fixes. Please redownload

It's uploaded now!! With some of our chat template fixes!

The FP4 version. Please update whichever inference engine youre using!

Dynamic GGUFs with different sizes will come later!! Thanks to llama.cpp if they update it.

Let us know if you encounter any issues!

Best, GGUF quants of OpenAI/OSS-20B bar none. Unsloth's Dynamic 2.0 GGUF calibration dataset wins again. 😋

BahamutRU

5 days ago

Does Tool Calling work?
My llama.cpp crashes on any quant when working with Qwen Code. =(

I found this thread: https://huggingface.co/openai/gpt-oss-120b/discussions/69

Maybe it is still possible to fix the chat template?

valid-name1

5 days ago

gguf_init_from_file_impl: tensor 'blk.0.ffn_down_exps.weight' has invalid ggml type 39 (NONE)
gguf_init_from_file_impl: failed to read tensor info
llama_model_load: error loading model: llama_model_loader: failed to load model from /root/.cache/huggingface/hub/models--unsloth--gpt-oss-20b-GGUF/snapshots/ff0f965518cc8b299d10c1318d42c6e15689f11e/./gpt-oss-20b-F16.gguf
llama_model_load_from_file_impl: failed to load model
I met it in colab although i had updated the version of llama_cpp,what should i do to run the model correctly

sdfsfsfADS

4 days ago

tool calling fails silently for me when trying to use python interpreter tool in llama-server(version: 3 (c4f5356)) - no response, no terminal output

marceldev89

4 days ago

There's work being done at https://github.com/ggml-org/llama.cpp/pull/15158 to get tool calling to work. Kinda works right now with some issues.

mingyi456

4 days ago

@owao BF16 is a data type used to store weights, it has nothing to do with the architecture of a model, or its MoE expert configuration.

FP16 uses 5 bits for exponent and 10 bits for mantissa, while BF16 used 8 bits for exponent and 7 bits for mantissa. So FP16 is technically more "precise", while BF16 allows for greater ranges of values. Check the following wikipedia article (or maybe just ask any half decent LLM, even just a tiny one) for a detailed explanation:

https://en.wikipedia.org/wiki/Bfloat16_floating-point_format

owao

4 days ago

@mingyi456 Thanks, I would never have thought about asking my LM to explain, neither use wikipedia! Sorry but mate... lol that's a bit condescendant
My questions are not really answerable with my LMs.
Also, you only answered a question you wrongly inferred from my last one (which surely was the dumbest among all the ones I asked).
Anyway thanks
Now I guess I shouldn't have any hope having any answer because I behaved like an asshole, I'll deal with that...

mashriram

3 days ago

•

edited 3 days ago

gpt-oss-20b-F16.gguf
Thanks for uploading but im unble to get it working in ollama
Error: unable to load model: /usr/share/ollama/.ollama/models/blobs/sha256-bc4d52a46e1d89088ff3cbb4be21a7c99f0bb68b53514d7d50679c9f07e33a41
The above error pops up even with the latest model
But it works in lmStudio

owao

3 days ago

@mashriram Patience still not supported for now
https://huggingface.co/unsloth/gpt-oss-20b-GGUF/discussions/17#68958418964bc7263fe13adf