|
--- |
|
license: apache-2.0 |
|
base_model: Qwen/Qwen3-14B |
|
tags: |
|
- mlx |
|
- 3bit |
|
- quantized |
|
--- |
|
|
|
# Qwen3-14B 3bit MLX |
|
|
|
This model is a 3-bit quantized version of [Qwen/Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B) using MLX. |
|
|
|
## Model Details |
|
- **Quantization**: 3-bit |
|
- **Framework**: MLX |
|
- **Base Model**: Qwen/Qwen3-14B |
|
- **Model Size**: ~5.25GB (3-bit quantized) |
|
- **Original Size**: 14B parameters |
|
|
|
## Usage |
|
|
|
```python |
|
from mlx_lm import load, generate |
|
|
|
model, tokenizer = load("mlx-community/Qwen3-14B-3bit") |
|
|
|
prompt = "Hello, how are you?" |
|
messages = [{"role": "user", "content": prompt}] |
|
formatted_prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True) |
|
|
|
response = generate(model, tokenizer, prompt=formatted_prompt, max_tokens=100) |
|
print(response) |
|
``` |
|
|
|
## Performance on M1/M2/M3 |
|
- **Memory Usage**: ~7-8GB |
|
- **Inference Speed**: ~15-25 tokens/sec (M1 Max) |
|
- **First Token Latency**: ~1-3 seconds |
|
|
|
## Requirements |
|
- Apple Silicon Mac (M1/M2/M3) |
|
- macOS 13.0+ |
|
- Python 3.8+ |
|
- MLX and mlx-lm packages |
|
|
|
## Installation |
|
|
|
```bash |
|
pip install mlx mlx-lm |
|
``` |
|
|
|
## Chat Mode |
|
|
|
```bash |
|
mlx_lm.chat --model mlx-community/Qwen3-14B-3bit |
|
``` |
|
|