Qwen3-14B-NVFP4

NVFP4-quantized version of Qwen/Qwen3-14B produced with llmcompressor.

Notes

Quantization scheme: NVFP4 (linear layers, lm_head excluded)
Calibration samples: 512
Max sequence length during calibration: 2048

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "llmat/Qwen3-14B-NVFP4"
number_gpus = 2

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

Downloads last month: 17

Safetensors

Model size

8.99B params

Tensor type

BF16

F32

F8_E4M3

Model tree for llmat/Qwen3-14B-NVFP4

Base model

Qwen/Qwen3-14B-Base

Finetuned

Qwen/Qwen3-14B

Quantized

(118)

this model