Don't think this works on 4090 with transformers, for me anyway
I got
MXFP4 quantization requires triton >= 3.4.0 and triton_kernels installed, we will default to dequantizing the model to bf16
Ended up taking all my vram
I then tried it on lmstudio, and it seems to load, but everytime I asked it a question
How many rs are in the word 'strawberry
This message contains no content. The AI has nothing to say.
What is lmstudio doing that make it at least load? That transformers can't do.
Anyone know the magic to get it to work on a 4090 in transformers?
Also works on ollama
So my 4090 CAN do MXFP4, as it only takes 16gb, and answers questions just fine. Why does it work on ollama then, and not transformers?
Thanks
ollama seems to be the only reliable way to run the models, its a compatibility issue with the latest version of triton not allowing for the mx4 quantization, not sure what the issue with lmstudio is though
Does not work on 5070 Ti
$ nvidia-smi
Wed Aug 6 22:27:01 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.64.03 Driver Version: 575.64.03 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 5070 ... Off | 00000000:02:00.0 Off | N/A |
| N/A 45C P8 8W / 30W | 15MiB / 12227MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| 0 N/A N/A 4032 G /usr/bin/gnome-shell 2MiB |
+-----------------------------------------------------------------------------------------+
$ python try.py
MXFP4 quantization requires triton >= 3.4.0 and triton_kernels installed, we will default to dequantizing the model to bf16
Loading checkpoint shards: 0%| | 0/3 [00:00<?, ?it/s]
Still trying to load bf16
@kosiasuzu yes I downloaded the model, what I was trying to show was that this problem is still there even though I am using latest transformers with 5070 gpu
MXFP4 quantization requires triton >= 3.4.0 and triton_kernels installed, we will default to dequantizing the model to bf16
$ pip list | grep triton
triton 3.4.0
$ pip list | grep transformers
transformers 4.56.0.dev0
Anyone know the magic to get it to work on a 4090 in transformers?
Please try with transformers from source, we recently merged this PR !
Not for me
pip install git+https://github.com/huggingface/transformers
print(transformers.__version__)
4.56.0.dev0
But when I run my code
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto",
)
messages = [
{"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to(model.device)
generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))
=>
MXFP4 quantization requires triton >= 3.4.0 and kernels installed, we will default to dequantizing the model to bf16
nvidia-smi
Thu Aug 7 09:20:03 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 580.88 Driver Version: 580.88 CUDA Version: 13.0 |
+-----------------------------------------+------------------------+----------------------+
| GPU Name Driver-Model | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 4090 WDDM | 00000000:01:00.0 On | Off |
| 0% 42C P8 24W / 450W | 1120MiB / 24564MiB | 5% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+
This happens because you don't have triton 3.4 and kernels installed as you can see the logs you shared.
The following runs smoothly on a T4 on google colab so hopefully it works also on your 4090 :
https://colab.research.google.com/drive/15DJv6QWgc49MuC7dlNS9ifveXBDjCWO5?usp=sharing
I tried
pip install triton==3.4
ERROR: Could not find a version that satisfies the requirement triton==3.4 (from versions: none)
ERROR: No matching distribution found for triton==3.4
In fact even
pip install triton
failed.
BTW I'm running on Win11
I see that Triton is Linux centric
I tried
https://www.wheelodex.org/projects/triton-windows/
https://www.wheelodex.org/projects/triton-windows/wheels/triton_windows-3.4.0.post20-cp312-cp312-win_amd64.whl/
pip install --no-index --find-links=. triton-windows
But my code still couldn't see triton
Do you have to do something your side to see Triton-Windows? Has ANYONE ever got gpt-oss-20b working on Windows?
Works for me
pip uninstall transformers
pip install git+https://github.com/Tsumugii24/transformers
Now runs fast, takes 16gb vram on 4090
Made some changes to marcsun13's code to make it work properly:🧑💻
https://colab.research.google.com/drive/1bHrdFob-K49DgEhtPqcb5AIV5VUyEGru?usp=sharing
I’m trying to load gpt-oss-20b
on an RTX A6000.
I installed all requirements:
!pip install -q --upgrade torch
!pip uninstall -q torchvision torchaudio -y
!pip install -q git+https://github.com/huggingface/transformers triton==3.4 git+https://github.com/triton-lang/triton.git@main#subdirectory=python/triton_kernels
Then I load the model:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="cuda",
)
But I get this error:
ValueError: The model is quantized with Mxfp4Config but you are passing a NoneType config. Please make sure to pass the same quantization config class to `from_pretrained` with different loading attributes.
How should I correctly load this quantized model and pass the proper Mxfp4Config
?
Sorry for this bug, we fix it !
Some maintainer needs to approve this
https://github.com/huggingface/transformers/pull/39986
Works for me on my 4090
Ok, it has been approved for next patch