text
Collection
2 items
โข
Updated
This repository contains the TensorRT-optimized version of the QwQ-32B model, built upon the original QwQ-32B model with the following features:
pip install tensorrt transformers polygraphy
Environment | Throughput (tokens/sec) | Latency (ms/token) | VRAM Usage |
---|---|---|---|
Original (A100 80GB) | 45 | 22 | 58GB |
TensorRT (A100 80GB) | 80 | 12.5 | 52GB |
Test conditions: FP16 precision, input length 512, output length 128, batch size=1
Precision Selection:
Optimization Configuration:
# Recommended configuration when building the TRT engine
config = {
"precision": "fp16",
"max_input_length": 8192,
"opt_batch_size": [1, 2, 4],
"max_output_length": 2048
}
Long Sequence Handling:
max_input_length
when building the TRT engineModel Differences:
Version Compatibility:
Quantization Information:
This optimized version is based on the following original work:
@misc{qwq32b,
title = {QwQ-32B: Embracing the Power of Reinforcement Learning},
url = {https://qwenlm.github.io/blog/qwq-32b/},
author = {Qwen Team},
month = {March},
year = {2025}
}
For technical issues, please submit an issue via:
Note: Use of this model is subject to the original model's Apache 2.0 License