Qwen3-32B W8A8 Quantized

This is a W8A8 quantized version of Qwen/Qwen3-32B using LLM-Compressor.

Quantization Details

Base Model: Qwen/Qwen3-32B
Quantization Method: W8A8 (8-bit weights, 8-bit activations)
Quantization Framework: LLM-Compressor
Model Size: Significantly reduced from original 32.8B parameters
Precision: INT8 for both weights and activations

Performance Considerations

Memory Usage: Significantly reduced memory footprint compared to the original FP16/BF16 model
Inference Speed: Faster inference due to reduced precision and smaller model size
Accuracy: Minimal accuracy loss compared to the original model (specific benchmarks may vary)

Hardware Requirements

This quantized model has lower hardware requirements than the original:

Memory: Approximately 2x less GPU memory required
Compute: Compatible with INT8 tensor operations
Recommended: GPUs with tensor core support for optimal INT8 performance

All Original Features Preserved

This quantized model retains all the capabilities of the original Qwen3-32B:

Thinking Mode Support: Seamless switching between thinking and non-thinking modes
Enhanced Reasoning: Superior performance in mathematics, code generation, and logical reasoning
Multilingual Support: 100+ languages and dialects
Agent Capabilities: Tool calling and external integration
Long Context: Native 32,768 token support, extensible to 131,072 with YaRN

Switching Between Thinking and Non-Thinking Mode

The quantized model supports the same thinking mode controls as the original:

`enable_thinking=True` (Default)

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True
)

`enable_thinking=False`

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)

Best Practices

Follow the same best practices as the original model:

Sampling Parameters:
- Thinking mode: Temperature=0.6, TopP=0.95, TopK=20, MinP=0
- Non-thinking mode: Temperature=0.7, TopP=0.8, TopK=20, MinP=0
Output Length: Use 32,768 tokens for most queries, 38,912 for complex problems
Avoid Greedy Decoding: Do not use greedy decoding in thinking mode

Original Model Information

For complete documentation, benchmarks, and detailed usage instructions, please refer to the original Qwen3-32B model card.

Key Specifications (from original model):

Type: Causal Language Models
Parameters: 32.8B total, 31.2B non-embedding
Layers: 64
Attention Heads: 64 for Q, 8 for KV (GQA)
Context Length: 32,768 tokens natively, 131,072 with YaRN

Citation

If you use this quantized model, please cite the original Qwen3 work.

@misc{qwen3technicalreport,
      title={Qwen3 Technical Report}, 
      author={Qwen Team},
      year={2025},
      eprint={2505.09388},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.09388}, 
}

Disclaimer

This is an unofficial quantized version. For the official model and support, please refer to the original Qwen3-32B repository.

ramblingpolymath
/

Qwen3-32B-W8A8

Qwen3-32B W8A8 Quantized

Quantization Details

Performance Considerations

Hardware Requirements

All Original Features Preserved

Switching Between Thinking and Non-Thinking Mode

`enable_thinking=True` (Default)

`enable_thinking=False`

Best Practices

Original Model Information

Key Specifications (from original model):

Citation

Disclaimer

Model tree for ramblingpolymath/Qwen3-32B-W8A8

Qwen3-32B W8A8 Quantized

Quantization Details

Performance Considerations

Hardware Requirements

All Original Features Preserved

Switching Between Thinking and Non-Thinking Mode

enable_thinking=True (Default)

enable_thinking=False

Best Practices

Original Model Information

Key Specifications (from original model):

Citation

Disclaimer

Model tree for ramblingpolymath/Qwen3-32B-W8A8

`enable_thinking=True` (Default)

`enable_thinking=False`