WIP, will make sure it works, and explain how to use.
I am working on jetson devices
You need my patched version of MLC to run this model. (It needs AWQ support for qwen3) My patch
If you have a Jetson Orin AGX
use corupta/mlc:0.20.0-r36.4-cp312-cu128-24.04
from docker hub. Verified it works in my case.
docker run -dit --rm \
--name llm_server \
--gpus all \
-p 9000:9000 \
-e DOCKER_PULL=always --pull always \
-e HF_HUB_CACHE=/root/.cache/huggingface \
-v /mnt/nvme/cache:/root/.cache \
corupta/mlc:0.20.0-r36.4-cp312-cu128-24.04 \
sudonim serve \
--model corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC \
--quantization q4f16_awq \
--max-batch-size 1 \
--host 0.0.0.0 \
--port 9000
Also Tried Speculative Decoding via
docker run -dit --rm --name llm_server --gpus all -p 9000:9000 -e DOCKER_PULL=always --pull always -e HF_HUB_CACHE=/root/.cache/huggingface -v /mnt/nvme/cache:/root/.cache corupta/mlc:0.20.0-r36.4-cp312-cu128-24.04 sudonim serve --model jukofyork/DeepSeek-R1-0528-CODER-DRAFT-0.6B-v1.0 --quantization q4f16_0 --chat-template deepseek_r1_qwen --max-batch-size 1 --host 0.0.0.0 --port 9000
Fix created mlc config "context_window_size": 131072, ... "stop_token_ids": [0, 1], ... "pad_token_id": 2, "bos_token_id": 0, "eos_token_id": 1
docker run -it --rm --gpus all -v /mnt/nvme/cache:/root/.cache -p 9000:9000 \
mlc:0.20.0-r36.4-cp312-cu128-24.04 \
mlc_llm serve --mode interactive --device cuda \
--host 0.0.0.0 --port 9000 --overrides='gpu_memory_utilization=0.90' \
--model-lib /root/.cache/mlc_llm/corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC/aarch64-cu128-sm87.so \
/root/.cache/mlc_llm/corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC \
--additional-models /root/.cache/mlc_llm/jukofyork/DeepSeek-R1-0528-CODER-DRAFT-0.6B-v1.0-q4f16_0-MLC,/root/.cache/mlc_llm/jukofyork/DeepSeek-R1-0528-CODER-DRAFT-0.6B-v1.0-q4f16_0-MLC/aarch64-cu128-sm87.so \
--speculative-mode small_draft
But its overhead was bigger than its yield. (It generated tokens sometimes faster, sometimes slower (maybe 20-50% hit on speculative output, I didn't really record this ratio), but yielded an average of the same speed, or perhaps 1% faster)
If you have a Jetson Xavier AGX
use corupta/mlc:0.20.0-r35.6.1-cp312-cu124-22.04
from docker hub.
docker run -dit --rm \
--name llm_server \
--gpus all \
-p 9000:9000 \
-e HF_HUB_CACHE=/root/.cache/huggingface \
-v /mnt/nvme/cache:/root/.cache \
mlc:r35.6.1-cp312-cu124-22.04 \
sudonim serve \
--model corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC \
--quantization q4f16_awq \
--max-batch-size 1 \
--host 0.0.0.0 \
--port 9000
Jetpack5 Image is built with corupta/jetson-containers-jp5
When running the model you might need to tweak prefill_chunk
in sudonim or prefill_chunk_size
in mlc-llm, to fit the model to your memory constraints.
The model is based on Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-awq-inc and converted via below command (along with manual modifications to mlc-chat-config.json)
mlc_llm gen_config $LOCAL_MODEL_PATH \
--quantization $QUANTIZATION \
--conv-template $CONV_TEMPLATE \
-o $MLC_MODEL_PATH
mlc_llm convert_weight $LOCAL_MODEL_PATH \
--quantization $QUANTIZATION \
-o $MLC_MODEL_PATH \
--source-format awq \
--source $LOCAL_MODEL_PATH/model.safetensors.index.json
This model is an int4 model with group_size 128 and symmetric quantization of deepseek-ai/DeepSeek-R1-0528-Qwen3-8B generated by intel/auto-round algorithm.
Please follow the license of the original model.
- Downloads last month
- 20
Model tree for corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC
Base model
deepseek-ai/DeepSeek-R1-0528-Qwen3-8B