Qwen3-30B-A3B-NVFP4

NVFP4-quantized version of Qwen/Qwen3-30B-A3B produced with llmcompressor.

Notes

Quantization scheme: NVFP4 (linear layers, lm_head excluded)
Calibration samples: 512
Max sequence length during calibration: 2048

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "llmat/Qwen3-30B-A3B-NVFP4"
number_gpus = 1

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

Downloads last month: 11

Safetensors

Model size

17.4B params

Tensor type

F32

BF16

F8_E4M3

Model tree for llmat/Qwen3-30B-A3B-NVFP4

Base model

Qwen/Qwen3-30B-A3B-Base

Finetuned

Qwen/Qwen3-30B-A3B

Quantized

(101)

this model