Qwen3-32B W8A8 Quantized
This is a W8A8 quantized version of Qwen/Qwen3-32B using LLM-Compressor.
Quantization Details
- Base Model: Qwen/Qwen3-32B
- Quantization Method: W8A8 (8-bit weights, 8-bit activations)
- Quantization Framework: LLM-Compressor
- Model Size: Significantly reduced from original 32.8B parameters
- Precision: INT8 for both weights and activations
Performance Considerations
- Memory Usage: Significantly reduced memory footprint compared to the original FP16/BF16 model
- Inference Speed: Faster inference due to reduced precision and smaller model size
- Accuracy: Minimal accuracy loss compared to the original model (specific benchmarks may vary)
Hardware Requirements
This quantized model has lower hardware requirements than the original:
- Memory: Approximately 2x less GPU memory required
- Compute: Compatible with INT8 tensor operations
- Recommended: GPUs with tensor core support for optimal INT8 performance
All Original Features Preserved
This quantized model retains all the capabilities of the original Qwen3-32B:
- Thinking Mode Support: Seamless switching between thinking and non-thinking modes
- Enhanced Reasoning: Superior performance in mathematics, code generation, and logical reasoning
- Multilingual Support: 100+ languages and dialects
- Agent Capabilities: Tool calling and external integration
- Long Context: Native 32,768 token support, extensible to 131,072 with YaRN
Switching Between Thinking and Non-Thinking Mode
The quantized model supports the same thinking mode controls as the original:
enable_thinking=True
(Default)
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True
)
enable_thinking=False
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=False
)
Best Practices
Follow the same best practices as the original model:
Sampling Parameters:
- Thinking mode:
Temperature=0.6
,TopP=0.95
,TopK=20
,MinP=0
- Non-thinking mode:
Temperature=0.7
,TopP=0.8
,TopK=20
,MinP=0
- Thinking mode:
Output Length: Use 32,768 tokens for most queries, 38,912 for complex problems
Avoid Greedy Decoding: Do not use greedy decoding in thinking mode
Original Model Information
For complete documentation, benchmarks, and detailed usage instructions, please refer to the original Qwen3-32B model card.
Key Specifications (from original model):
- Type: Causal Language Models
- Parameters: 32.8B total, 31.2B non-embedding
- Layers: 64
- Attention Heads: 64 for Q, 8 for KV (GQA)
- Context Length: 32,768 tokens natively, 131,072 with YaRN
Citation
If you use this quantized model, please cite the original Qwen3 work.
@misc{qwen3technicalreport,
title={Qwen3 Technical Report},
author={Qwen Team},
year={2025},
eprint={2505.09388},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.09388},
}
Disclaimer
This is an unofficial quantized version. For the official model and support, please refer to the original Qwen3-32B repository.
- Downloads last month
- -
Model tree for ramblingpolymath/Qwen3-32B-W8A8
Base model
Qwen/Qwen3-32B