Qwen3-30B-A3B-NVFP4
NVFP4-quantized version of Qwen/Qwen3-30B-A3B
produced with llmcompressor.
Notes
- Quantization scheme: NVFP4 (linear layers,
lm_head
excluded) - Calibration samples: 512
- Max sequence length during calibration: 2048
Deployment
Use with vLLM
This model can be deployed efficiently using the vLLM backend, as shown in the example below.
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "llmat/Qwen3-30B-A3B-NVFP4"
number_gpus = 1
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
tokenizer = AutoTokenizer.from_pretrained(model_id)
messages = [
{"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
{"role": "user", "content": "Who are you?"},
]
prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
outputs = llm.generate(prompts, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
vLLM also supports OpenAI-compatible serving. See the documentation for more details.
- Downloads last month
- 11