Qwen3-1.7B-NVFP4A16

Model Overview

  • Model Architecture: Qwen/Qwen3-1.7B
  • Input: Text
  • Output: Text
  • Model Optimizations:
    • Weight quantization: FP4 with per-group 16
    • Activation quantization: FP16
  • Out-of-scope: Use in any manner that violates applicable laws or regulations (including trade compliance laws). Use in languages other than English.
  • Release Date: 8/4/2025
  • Version: 1.0
  • Model Developers: [Your Organization]

This model is a quantized version of Qwen/Qwen3-1.7B. It was evaluated on several tasks to assess its quality in comparison to the unquantized model.

Installation

Install the required dependencies:

pip install llmcompressor==0.6.0.1 vllm==0.9.0

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend, as shown in the example below.

from vllm import LLM, SamplingParams
from transformers import AutoTokenizer

model_id = "path/to/Qwen3-1.7B-NVFP4A16"
number_gpus = 1

sampling_params = SamplingParams(temperature=0.6, top_p=0.9, max_tokens=256)

tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who are you?"},
]

prompts = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)

llm = LLM(model=model_id, tensor_parallel_size=number_gpus)

outputs = llm.generate(prompts, sampling_params)

generated_text = outputs[0].outputs[0].text
print(generated_text)

vLLM also supports OpenAI-compatible serving. See the documentation for more details.

Creation

This model was created by applying LLM Compressor with post-training quantization (PTQ) without calibration samples, as presented in the code snippet below.

from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.utils import dispatch_for_generation

# Load model.
MODEL_ID = "Qwen/Qwen3-1.7B"
model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, torch_dtype="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

# Configure the quantization algorithm and scheme.
# In this case, we:
#   * quantize the weights to fp4 with per group 16 via ptq
recipe = QuantizationModifier(targets="Linear", scheme="NVFP4A16", ignore=["lm_head"])

# Apply quantization.
oneshot(model=model, recipe=recipe)

print("\n\n========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
input_ids = tokenizer("Hello my name is", return_tensors="pt").input_ids.to(
    model.device
)
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
print("==========================================\n\n")

# Save to disk in compressed-tensors format.
SAVE_DIR = f"../../../model/{MODEL_ID.split('/')[-1]}-NVFP4A16"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

Evaluation

This model was evaluated on the well-known MMLU-Redux, Math500, IFEval, and RULER-NIAH benchmarks. All evaluations were conducted using a custom evaluation framework with vLLM backend.

Accuracy

Category Metric Qwen/Qwen3-1.7B Qwen3-1.7B-NVFP4A16 (this model) Recovery (%)
General Knowledge MMLU-Redux (default nonthinking) 64.4% 55.23% 85.8%
Mathematical Reasoning Math500 (default thinking) 93.4% 89.6% 95.9%
Instruction Following IFEval(Strict Prompt Level Acc) 68.2% 66.17% 97.0%
Long Context RULER-NIAH-32k TBD 76.21% TBD%

Reproduction

The results were obtained using the following commands:

MMLU-Redux

# MMLU-Redux without thinking
HYDRA_FULL_ERROR=1 python3 infer.py \
    --config-name=qwen3_1.7B_bf16 \
    model_name=/app/model/Qwen3-1.7B-NVFP4A16 \
    eval_dataset=mmlu-redux \
    eval_predictor=vllm \
    predictor_conf.vllm.tensor_parallel_size=1 \
    predictor_conf.vllm.gpu_memory_utilization=0.9 \
    output_dir=/app/outputs/mmlu_redux_nvfp4_1.7vllm \
    debug=false

Math500

# Math500 with thinking
HYDRA_FULL_ERROR=1 python3 infer.py \
    --config-name=qwen3_1.7B_bf16 \
    model_name=/app/model/Qwen3-1.7B-NVFP4A16 \
    eval_dataset=math500 \
    eval_predictor=vllm \
    predictor_conf.vllm.tensor_parallel_size=1 \
    predictor_conf.vllm.gpu_memory_utilization=0.9 \
    output_dir=/app/outputs/math500_nvfp4_1.7vllm_thinking \
    debug=false

IFEval

HYDRA_FULL_ERROR=1 python3 infer.py \
    --config-name=qwen3_1.7B_bf16 \
    model_name=/app/model/Qwen3-1.7B-NVFP4A16 \
    eval_dataset=ifeval \
    eval_predictor=vllm \
    predictor_conf.vllm.tensor_parallel_size=1 \
    predictor_conf.vllm.gpu_memory_utilization=0.9 \
    output_dir=/app/outputs/ifeval_nvfp4_1.7vllm \
    debug=false

RULER-NIAH-32k

HYDRA_FULL_ERROR=1 python3 infer.py \
    --config-name=qwen3_1.7B_bf16 \
    model_name=/app/model/Qwen3-1.7B-NVFP4A16 \
    eval_dataset=ruler-niah-32k \
    eval_dataset_config=ruler-32k \
    eval_predictor=vllm \
    predictor_conf.vllm.tensor_parallel_size=1 \
    predictor_conf.vllm.gpu_memory_utilization=0.5 \
    predictor_conf.vllm.max_num_seqs=1 \
    predictor_conf.vllm.max_num_batched_tokens=16384 \
    predictor_conf.vllm.max_seq_len=32768 \
    predictor_conf.vllm.enable_prefix_caching=false \
    +predictor_conf.vllm.cpu_offload_gb=8 \
    +predictor_conf.vllm.device=auto \
    output_dir=/app/outputs/ruler_niah_nvfp4_1.7vllm \
    debug=false

Technical Details

  • Quantization Scheme: NVFP4A16 (FP4 weights with FP16 activations, per-group 16)
  • Excluded Layers: Language model head (lm_head) is not quantized
  • Memory Reduction: Approximately 75% reduction in model size
  • Inference Backend: Optimized for vLLM with tensor parallelism support
  • Context Length: Supports up to 32k tokens (as tested with RULER-NIAH)

Configuration Notes

  • GPU memory utilization can be adjusted between 0.5-0.9 depending on available hardware
  • For long context evaluation (32k), reduced memory utilization (0.5) and CPU offload (8GB) are recommended
  • Prefix caching can be disabled for memory-constrained environments
  • Tensor parallel size of 1 is sufficient for the 1.7B parameter model
Downloads last month
31
Safetensors
Model size
1.1B params
Tensor type
F32
BF16
F8_E4M3
U8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support

Model tree for 2imi9/Qwen3-1.7B-NVFP4A16

Finetuned
Qwen/Qwen3-1.7B
Quantized
(95)
this model