QwQ-32B TensorRT Optimized Version

Model Introduction

This repository contains the TensorRT-optimized version of the QwQ-32B model, built upon the original QwQ-32B model with the following features:

TensorRT Acceleration: Optimized for inference using NVIDIA TensorRT
Performance Boost: Significantly improved inference speed compared to the original PyTorch implementation
Hardware Optimization: Deeply optimized for NVIDIA GPUs
Precision Retention: Maintains the same inference accuracy as the original model

System Requirements

Hardware Requirements

GPU: NVIDIA GPU (Ampere architecture or newer recommended, e.g., A100, H100, RTX 3090/4090)
VRAM: At least 64GB GPU memory (FP16 precision)

Software Requirements

CUDA: Version 11.8 or higher
TensorRT: Version 8.6 or higher
Python: 3.8-3.10

Dependencies:

pip install tensorrt transformers polygraphy

Performance Benchmarks

Environment	Throughput (tokens/sec)	Latency (ms/token)	VRAM Usage
Original (A100 80GB)	45	22	58GB
TensorRT (A100 80GB)	80	12.5	52GB

Test conditions: FP16 precision, input length 512, output length 128, batch size=1

Deployment Recommendations

Precision Selection:
- FP16: Recommended for most scenarios, balancing precision and performance
- INT8: Requires additional quantization calibration, further reducing VRAM usage

Optimization Configuration:

# Recommended configuration when building the TRT engine
config = {
    "precision": "fp16",
    "max_input_length": 8192,
    "opt_batch_size": [1, 2, 4],
    "max_output_length": 2048
}

Long Sequence Handling:
- If processing sequences longer than 8K, ensure YaRN extension is enabled
- Set appropriate max_input_length when building the TRT engine

Notes

Model Differences:
- This version is optimized for inference and does not support training or fine-tuning
- Some dynamic control features (e.g., dynamic batch size) must be pre-configured during engine building
Version Compatibility:
- Ensure the TensorRT version matches the CUDA version
- Different GPU architectures require separate engine builds
Quantization Information:
- FP16 version maintains the original model's precision
- INT8 version may have slight precision loss

Acknowledgments

This optimized version is based on the following original work:

@misc{qwq32b,
    title = {QwQ-32B: Embracing the Power of Reinforcement Learning},
    url = {https://qwenlm.github.io/blog/qwq-32b/},
    author = {Qwen Team},
    month = {March},
    year = {2025}
}

Issue Reporting

For technical issues, please submit an issue via:

GitHub Issues
Huggingface Discussion Section

Note: Use of this model is subject to the original model's Apache 2.0 License

ALGOTECH
/

QwQ-32B-TRT