Qwen3-1.7B-NVFP4A16
Model Overview
- Model Architecture: Qwen/Qwen3-1.7B
- Input: Text
- Output: Text
- Model Optimizations:
- Weight quantization: FP4 with per-group 16
- Activation quantization: FP16
- Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
- Release Date: 8/4/2025
- Version: 1.0
- Model Developers: [Your Organization]
This model is a quantized version of Qwen/Qwen3-1.7B. It was evaluated on several tasks to assess its quality in comparison to the unquantized model.
Installation
Install the required dependencies:
pip install llmcompressor==0.6.0.1 vllm==0.9.0
Deployment
Use with vLLM
This model can be deployed efficiently using the vLLM backend, as shown in the example below.
from vllm import LLM, SamplingParams
from transformers import AutoTokenizer
model_id = "path/to/Qwen3-1.7B-NVFP4A16"
number_gpus = 1
sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who are you?"},
]
prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
llm = LLM(model=model_id, tensor_parallel_size=number_gpus)
outputs = llm.generate(prompts, sampling_params)
generated_text = outputs[0].outputs[0].text
print(generated_text)
vLLM also supports OpenAI-compatible serving. See the documentation for more details.
Creation
This model was created by applying LLM Compressor with post-training quantization (PTQ) without calibration samples, as presented in the code snippet below.
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.utils import dispatch_for_generation
# Load model.
MODEL_ID = "Qwen/Qwen3-1.7B"
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, torch_dtype="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
# Configure the quantization algorithm and scheme.
# In this case, we:
# * quantize the weights to fp4 with per group 16 via ptq
recipe = QuantizationModifier(targets="Linear", scheme="NVFP4A16", ignore=["lm_head"])
# Apply quantization.
oneshot(model=model, recipe=recipe)
print("\n\n========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
model.device
)
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("==========================================\n\n")
# Save to disk in compressed-tensors format.
SAVE_DIR = f"../../../model/{MODEL_ID.split('/')[-1]}-NVFP4A16"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)
Evaluation
This model was evaluated on the well-known MMLU-Redux, Math500, IFEval, and RULER-NIAH benchmarks. All evaluations were conducted using a custom evaluation framework with vLLM backend.
Accuracy
Category | Metric | Qwen/Qwen3-1.7B | Qwen3-1.7B-NVFP4A16 (this model) | Recovery (%) |
---|---|---|---|---|
General Knowledge | MMLU-Redux (default nonthinking) | 64.4% | 55.23% | 85.8% |
Mathematical Reasoning | Math500 (default thinking) | 93.4% | 89.6% | 95.9% |
Instruction Following | IFEval(Strict Prompt Level Acc) | 68.2% | 66.17% | 97.0% |
Long Context | RULER-NIAH-32k | TBD | 76.21% | TBD% |
Reproduction
The results were obtained using the following commands:
MMLU-Redux
# MMLU-Redux without thinking
HYDRA_FULL_ERROR=1 python3 infer.py \
--config-name=qwen3_1.7B_bf16 \
model_name=/app/model/Qwen3-1.7B-NVFP4A16 \
eval_dataset=mmlu-redux \
eval_predictor=vllm \
predictor_conf.vllm.tensor_parallel_size=1 \
predictor_conf.vllm.gpu_memory_utilization=0.9 \
output_dir=/app/outputs/mmlu_redux_nvfp4_1.7vllm \
debug=false
Math500
# Math500 with thinking
HYDRA_FULL_ERROR=1 python3 infer.py \
--config-name=qwen3_1.7B_bf16 \
model_name=/app/model/Qwen3-1.7B-NVFP4A16 \
eval_dataset=math500 \
eval_predictor=vllm \
predictor_conf.vllm.tensor_parallel_size=1 \
predictor_conf.vllm.gpu_memory_utilization=0.9 \
output_dir=/app/outputs/math500_nvfp4_1.7vllm_thinking \
debug=false
IFEval
HYDRA_FULL_ERROR=1 python3 infer.py \
--config-name=qwen3_1.7B_bf16 \
model_name=/app/model/Qwen3-1.7B-NVFP4A16 \
eval_dataset=ifeval \
eval_predictor=vllm \
predictor_conf.vllm.tensor_parallel_size=1 \
predictor_conf.vllm.gpu_memory_utilization=0.9 \
output_dir=/app/outputs/ifeval_nvfp4_1.7vllm \
debug=false
RULER-NIAH-32k
HYDRA_FULL_ERROR=1 python3 infer.py \
--config-name=qwen3_1.7B_bf16 \
model_name=/app/model/Qwen3-1.7B-NVFP4A16 \
eval_dataset=ruler-niah-32k \
eval_dataset_config=ruler-32k \
eval_predictor=vllm \
predictor_conf.vllm.tensor_parallel_size=1 \
predictor_conf.vllm.gpu_memory_utilization=0.5 \
predictor_conf.vllm.max_num_seqs=1 \
predictor_conf.vllm.max_num_batched_tokens=16384 \
predictor_conf.vllm.max_seq_len=32768 \
predictor_conf.vllm.enable_prefix_caching=false \
+predictor_conf.vllm.cpu_offload_gb=8 \
+predictor_conf.vllm.device=auto \
output_dir=/app/outputs/ruler_niah_nvfp4_1.7vllm \
debug=false
Technical Details
- Quantization Scheme: NVFP4A16 (FP4 weights with FP16 activations, per-group 16)
- Excluded Layers: Language model head (
lm_head
) is not quantized - Memory Reduction: Approximately 75% reduction in model size
- Inference Backend: Optimized for vLLM with tensor parallelism support
- Context Length: Supports up to 32k tokens (as tested with RULER-NIAH)
Configuration Notes
- GPU memory utilization can be adjusted between 0.5-0.9 depending on available hardware
- For long context evaluation (32k), reduced memory utilization (0.5) and CPU offload (8GB) are recommended
- Prefix caching can be disabled for memory-constrained environments
- Tensor parallel size of 1 is sufficient for the 1.7B parameter model
- Downloads last month
- 31