openai/gpt-oss-20b · Don't think this works on 4090 with transformers, for me anyway

Aug 6

I got

MXFP4 quantization requires triton >= 3.4.0 and triton_kernels installed, we will default to dequantizing the model to bf16

Ended up taking all my vram

I then tried it on lmstudio, and it seems to load, but everytime I asked it a question

How many rs are in the word 'strawberry

This message contains no content. The AI has nothing to say.

What is lmstudio doing that make it at least load? That transformers can't do.

Anyone know the magic to get it to work on a 4090 in transformers?

Fancellu

Aug 6

Also works on ollama

So my 4090 CAN do MXFP4, as it only takes 16gb, and answers questions just fine. Why does it work on ollama then, and not transformers?

Thanks

kosiasuzu

Aug 6

ollama seems to be the only reliable way to run the models, its a compatibility issue with the latest version of triton not allowing for the mx4 quantization, not sure what the issue with lmstudio is though

marcsun13

Aug 6

Anyone know the magic to get it to work on a 4090 in transformers?

Please try with transformers from source, we recently merged this PR !

saucam

Aug 6

•

edited Aug 7

Does not work on 5070 Ti

$ nvidia-smi 
Wed Aug  6 22:27:01 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.64.03              Driver Version: 575.64.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5070 ...    Off |   00000000:02:00.0 Off |                  N/A |
| N/A   45C    P8              8W /   30W |      15MiB /  12227MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            4032      G   /usr/bin/gnome-shell                      2MiB |
+-----------------------------------------------------------------------------------------+


$ python try.py 
MXFP4 quantization requires triton >= 3.4.0 and triton_kernels installed, we will default to dequantizing the model to bf16
Loading checkpoint shards:   0%|                                                                                   | 0/3 [00:00<?, ?it/s]

Still trying to load bf16

kosiasuzu

Aug 6

@saucam have you downloaded the model? if this is your first time loading it the download takes a long time and there progress bar stays the same till the model is completely downloaded, so you won't get any feedback but give it like an hour based on your internet connection

saucam

Aug 7

@kosiasuzu yes I downloaded the model, what I was trying to show was that this problem is still there even though I am using latest transformers with 5070 gpu

MXFP4 quantization requires triton >= 3.4.0 and triton_kernels installed, we will default to dequantizing the model to bf16

$ pip list | grep triton
triton                        3.4.0
$ pip list | grep transformers
transformers                  4.56.0.dev0

Fancellu

Aug 7

Anyone know the magic to get it to work on a 4090 in transformers?

Please try with transformers from source, we recently merged this PR !

Not for me

pip install git+https://github.com/huggingface/transformers

print(transformers.__version__)

4.56.0.dev0

But when I run my code

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
)

messages = [
    {"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

=>
MXFP4 quantization requires triton >= 3.4.0 and kernels installed, we will default to dequantizing the model to bf16

nvidia-smi
Thu Aug  7 09:20:03 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.88                 Driver Version: 580.88         CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090      WDDM  |   00000000:01:00.0  On |                  Off |
|  0%   42C    P8             24W /  450W |    1120MiB /  24564MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

marcsun13

Aug 7

This happens because you don't have triton 3.4 and kernels installed as you can see the logs you shared.

The following runs smoothly on a T4 on google colab so hopefully it works also on your 4090 :
https://colab.research.google.com/drive/15DJv6QWgc49MuC7dlNS9ifveXBDjCWO5?usp=sharing

Fancellu

Aug 7

I tried

pip install triton==3.4

ERROR: Could not find a version that satisfies the requirement triton==3.4 (from versions: none)
ERROR: No matching distribution found for triton==3.4

In fact even

pip install triton failed.

BTW I'm running on Win11

I see that Triton is Linux centric

I tried

https://www.wheelodex.org/projects/triton-windows/

https://www.wheelodex.org/projects/triton-windows/wheels/triton_windows-3.4.0.post20-cp312-cp312-win_amd64.whl/

pip install --no-index --find-links=. triton-windows

But my code still couldn't see triton

Do you have to do something your side to see Triton-Windows? Has ANYONE ever got gpt-oss-20b working on Windows?

Fancellu

Aug 7

Ah https://github.com/huggingface/transformers/issues/39985
https://github.com/huggingface/transformers/pull/39986

Fancellu

Aug 7

Works for me

pip uninstall transformers
pip install git+https://github.com/Tsumugii24/transformers

Now runs fast, takes 16gb vram on 4090

debisoft

Aug 8

Made some changes to marcsun13's code to make it work properly:🧑‍💻
https://colab.research.google.com/drive/1bHrdFob-K49DgEhtPqcb5AIV5VUyEGru?usp=sharing

rbabu5686

Aug 8

•

edited Aug 8

I’m trying to load gpt-oss-20b on an RTX A6000.

I installed all requirements:

!pip install -q --upgrade torch
!pip uninstall -q torchvision torchaudio -y
!pip install -q git+https://github.com/huggingface/transformers triton==3.4 git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

Then I load the model:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="cuda",
)

But I get this error:

ValueError: The model is quantized with Mxfp4Config but you are passing a NoneType config. Please make sure to pass the same quantization config class to `from_pretrained` with different loading attributes.

How should I correctly load this quantized model and pass the proper Mxfp4Config?