Don't think this works on 4090 with transformers, for me anyway

#53
by Fancellu - opened

I got

MXFP4 quantization requires triton >= 3.4.0 and triton_kernels installed, we will default to dequantizing the model to bf16

Ended up taking all my vram

I then tried it on lmstudio, and it seems to load, but everytime I asked it a question

How many rs are in the word 'strawberry

This message contains no content. The AI has nothing to say.

What is lmstudio doing that make it at least load? That transformers can't do.

Anyone know the magic to get it to work on a 4090 in transformers?

Also works on ollama

So my 4090 CAN do MXFP4, as it only takes 16gb, and answers questions just fine. Why does it work on ollama then, and not transformers?

Thanks

ollama seems to be the only reliable way to run the models, its a compatibility issue with the latest version of triton not allowing for the mx4 quantization, not sure what the issue with lmstudio is though

Anyone know the magic to get it to work on a 4090 in transformers?

Please try with transformers from source, we recently merged this PR !

Does not work on 5070 Ti

$ nvidia-smi 
Wed Aug  6 22:27:01 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.64.03              Driver Version: 575.64.03      CUDA Version: 12.9     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 5070 ...    Off |   00000000:02:00.0 Off |                  N/A |
| N/A   45C    P8              8W /   30W |      15MiB /  12227MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|    0   N/A  N/A            4032      G   /usr/bin/gnome-shell                      2MiB |
+-----------------------------------------------------------------------------------------+


$ python try.py 
MXFP4 quantization requires triton >= 3.4.0 and triton_kernels installed, we will default to dequantizing the model to bf16
Loading checkpoint shards:   0%|                                                                                   | 0/3 [00:00<?, ?it/s]

Still trying to load bf16

@saucam have you downloaded the model? if this is your first time loading it the download takes a long time and there progress bar stays the same till the model is completely downloaded, so you won't get any feedback but give it like an hour based on your internet connection

@kosiasuzu yes I downloaded the model, what I was trying to show was that this problem is still there even though I am using latest transformers with 5070 gpu

MXFP4 quantization requires triton >= 3.4.0 and triton_kernels installed, we will default to dequantizing the model to bf16

$ pip list | grep triton
triton                        3.4.0
$ pip list | grep transformers
transformers                  4.56.0.dev0

Anyone know the magic to get it to work on a 4090 in transformers?

Please try with transformers from source, we recently merged this PR !

Not for me

pip install git+https://github.com/huggingface/transformers

print(transformers.__version__)

4.56.0.dev0

But when I run my code

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
)

messages = [
    {"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

=>
MXFP4 quantization requires triton >= 3.4.0 and kernels installed, we will default to dequantizing the model to bf16

nvidia-smi
Thu Aug  7 09:20:03 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.88                 Driver Version: 580.88         CUDA Version: 13.0     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                  Driver-Model | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4090      WDDM  |   00000000:01:00.0  On |                  Off |
|  0%   42C    P8             24W /  450W |    1120MiB /  24564MiB |      5%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

This happens because you don't have triton 3.4 and kernels installed as you can see the logs you shared.

The following runs smoothly on a T4 on google colab so hopefully it works also on your 4090 :
https://colab.research.google.com/drive/15DJv6QWgc49MuC7dlNS9ifveXBDjCWO5?usp=sharing

I tried

pip install triton==3.4

ERROR: Could not find a version that satisfies the requirement triton==3.4 (from versions: none)
ERROR: No matching distribution found for triton==3.4

In fact even

pip install triton failed.

BTW I'm running on Win11

I see that Triton is Linux centric

I tried

https://www.wheelodex.org/projects/triton-windows/

https://www.wheelodex.org/projects/triton-windows/wheels/triton_windows-3.4.0.post20-cp312-cp312-win_amd64.whl/

pip install --no-index --find-links=. triton-windows

But my code still couldn't see triton

Do you have to do something your side to see Triton-Windows? Has ANYONE ever got gpt-oss-20b working on Windows?

Works for me

pip uninstall transformers
pip install git+https://github.com/Tsumugii24/transformers

Now runs fast, takes 16gb vram on 4090

Made some changes to marcsun13's code to make it work properly:🧑‍💻
https://colab.research.google.com/drive/1bHrdFob-K49DgEhtPqcb5AIV5VUyEGru?usp=sharing

I’m trying to load gpt-oss-20b on an RTX A6000.

I installed all requirements:

!pip install -q --upgrade torch
!pip uninstall -q torchvision torchaudio -y
!pip install -q git+https://github.com/huggingface/transformers triton==3.4 git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels

Then I load the model:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="cuda",
)

But I get this error:

ValueError: The model is quantized with Mxfp4Config but you are passing a NoneType config. Please make sure to pass the same quantization config class to `from_pretrained` with different loading attributes.

How should I correctly load this quantized model and pass the proper Mxfp4Config?

Sorry for this bug, we fix it !

@weege007
Thanks, it’s working now. However, there seems to be quite a bit of latency during inference.

Some maintainer needs to approve this

https://github.com/huggingface/transformers/pull/39986

Works for me on my 4090

Ok, it has been approved for next patch

Fancellu changed discussion status to closed

Sign up or log in to comment