Qwen3-14B-NVFP4 / README.md

llmat

Update README.md

6737934 verified about 1 month ago

preview code

raw

history blame contribute delete

1.46 kB

metadata

language: en
license: apache-2.0
pipeline_tag: text-generation
tags:
  - quantization
  - nvfp4
  - qwen
base_model: Qwen/Qwen3-14B
model_name: Qwen3-14B-NVFP4

Qwen3-14B-NVFP4

NVFP4-quantized version of Qwen/Qwen3-14B produced with llmcompressor.

Notes

Quantization scheme: NVFP4 (linear layers, lm_head excluded)
Calibration samples: 512
Max sequence length during calibration: 2048

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "llmat/Qwen3-14B-NVFP4"
number_gpus = 2

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
    {"role": "user", "content": "Who are you?"},
]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.