Qwen3-32B W8A8 Quantized

This is a W8A8 quantized version of Qwen/Qwen3-32B using LLM-Compressor.

Quantization Details

  • Base Model: Qwen/Qwen3-32B
  • Quantization Method: W8A8 (8-bit weights, 8-bit activations)
  • Quantization Framework: LLM-Compressor
  • Model Size: Significantly reduced from original 32.8B parameters
  • Precision: INT8 for both weights and activations

Performance Considerations

  • Memory Usage: Significantly reduced memory footprint compared to the original FP16/BF16 model
  • Inference Speed: Faster inference due to reduced precision and smaller model size
  • Accuracy: Minimal accuracy loss compared to the original model (specific benchmarks may vary)

Hardware Requirements

This quantized model has lower hardware requirements than the original:

  • Memory: Approximately 2x less GPU memory required
  • Compute: Compatible with INT8 tensor operations
  • Recommended: GPUs with tensor core support for optimal INT8 performance

All Original Features Preserved

This quantized model retains all the capabilities of the original Qwen3-32B:

  • Thinking Mode Support: Seamless switching between thinking and non-thinking modes
  • Enhanced Reasoning: Superior performance in mathematics, code generation, and logical reasoning
  • Multilingual Support: 100+ languages and dialects
  • Agent Capabilities: Tool calling and external integration
  • Long Context: Native 32,768 token support, extensible to 131,072 with YaRN

Switching Between Thinking and Non-Thinking Mode

The quantized model supports the same thinking mode controls as the original:

enable_thinking=True (Default)

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=True
)

enable_thinking=False

text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
    enable_thinking=False
)

Best Practices

Follow the same best practices as the original model:

  1. Sampling Parameters:

    • Thinking mode: Temperature=0.6, TopP=0.95, TopK=20, MinP=0
    • Non-thinking mode: Temperature=0.7, TopP=0.8, TopK=20, MinP=0
  2. Output Length: Use 32,768 tokens for most queries, 38,912 for complex problems

  3. Avoid Greedy Decoding: Do not use greedy decoding in thinking mode

Original Model Information

For complete documentation, benchmarks, and detailed usage instructions, please refer to the original Qwen3-32B model card.

Key Specifications (from original model):

  • Type: Causal Language Models
  • Parameters: 32.8B total, 31.2B non-embedding
  • Layers: 64
  • Attention Heads: 64 for Q, 8 for KV (GQA)
  • Context Length: 32,768 tokens natively, 131,072 with YaRN

Citation

If you use this quantized model, please cite the original Qwen3 work.

@misc{qwen3technicalreport,
      title={Qwen3 Technical Report}, 
      author={Qwen Team},
      year={2025},
      eprint={2505.09388},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.09388}, 
}

Disclaimer

This is an unofficial quantized version. For the official model and support, please refer to the original Qwen3-32B repository.

Downloads last month
-
Safetensors
Model size
32.8B params
Tensor type
BF16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ramblingpolymath/Qwen3-32B-W8A8

Base model

Qwen/Qwen3-32B
Quantized
(99)
this model