QwQ-32B TensorRT Optimized Version

Chat TensorRT Optimized

Model Introduction

This repository contains the TensorRT-optimized version of the QwQ-32B model, built upon the original QwQ-32B model with the following features:

  • TensorRT Acceleration: Optimized for inference using NVIDIA TensorRT
  • Performance Boost: Significantly improved inference speed compared to the original PyTorch implementation
  • Hardware Optimization: Deeply optimized for NVIDIA GPUs
  • Precision Retention: Maintains the same inference accuracy as the original model

System Requirements

Hardware Requirements

  • GPU: NVIDIA GPU (Ampere architecture or newer recommended, e.g., A100, H100, RTX 3090/4090)
  • VRAM: At least 64GB GPU memory (FP16 precision)

Software Requirements

  • CUDA: Version 11.8 or higher
  • TensorRT: Version 8.6 or higher
  • Python: 3.8-3.10
  • Dependencies:
    pip install tensorrt transformers polygraphy
    

Performance Benchmarks

Environment Throughput (tokens/sec) Latency (ms/token) VRAM Usage
Original (A100 80GB) 45 22 58GB
TensorRT (A100 80GB) 80 12.5 52GB

Test conditions: FP16 precision, input length 512, output length 128, batch size=1

Deployment Recommendations

  1. Precision Selection:

    • FP16: Recommended for most scenarios, balancing precision and performance
    • INT8: Requires additional quantization calibration, further reducing VRAM usage
  2. Optimization Configuration:

    # Recommended configuration when building the TRT engine
    config = {
        "precision": "fp16",
        "max_input_length": 8192,
        "opt_batch_size": [1, 2, 4],
        "max_output_length": 2048
    }
    
  3. Long Sequence Handling:

    • If processing sequences longer than 8K, ensure YaRN extension is enabled
    • Set appropriate max_input_length when building the TRT engine

Notes

  1. Model Differences:

    • This version is optimized for inference and does not support training or fine-tuning
    • Some dynamic control features (e.g., dynamic batch size) must be pre-configured during engine building
  2. Version Compatibility:

    • Ensure the TensorRT version matches the CUDA version
    • Different GPU architectures require separate engine builds
  3. Quantization Information:

    • FP16 version maintains the original model's precision
    • INT8 version may have slight precision loss

Acknowledgments

This optimized version is based on the following original work:

@misc{qwq32b,
    title = {QwQ-32B: Embracing the Power of Reinforcement Learning},
    url = {https://qwenlm.github.io/blog/qwq-32b/},
    author = {Qwen Team},
    month = {March},
    year = {2025}
}

Issue Reporting

For technical issues, please submit an issue via:

Note: Use of this model is subject to the original model's Apache 2.0 License

Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ALGOTECH/QwQ-32B-TRT

Base model

Qwen/Qwen2.5-32B
Finetuned
Qwen/QwQ-32B
Finetuned
(66)
this model

Collection including ALGOTECH/QwQ-32B-TRT