WIP, will make sure it works, and explain how to use.

I am working on jetson devices

You need my patched version of MLC to run this model. (It needs AWQ support for qwen3) My patch

If you have a Jetson Orin AGX

use corupta/mlc:0.20.0-r36.4-cp312-cu128-24.04 from docker hub. Verified it works in my case.

docker run -dit --rm \
  --name llm_server \
  --gpus all \
  -p 9000:9000 \
  -e DOCKER_PULL=always --pull always \
  -e HF_HUB_CACHE=/root/.cache/huggingface \
  -v /mnt/nvme/cache:/root/.cache \
  corupta/mlc:0.20.0-r36.4-cp312-cu128-24.04 \
    sudonim serve \
      --model corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC \
      --quantization q4f16_awq \
      --max-batch-size 1 \
      --host 0.0.0.0 \
      --port 9000

Also Tried Speculative Decoding via

docker run -dit --rm   --name llm_server   --gpus all   -p 9000:9000   -e DOCKER_PULL=always --pull always   -e HF_HUB_CACHE=/root/.cache/huggingface   -v /mnt/nvme/cache:/root/.cache   corupta/mlc:0.20.0-r36.4-cp312-cu128-24.04     sudonim serve       --model jukofyork/DeepSeek-R1-0528-CODER-DRAFT-0.6B-v1.0       --quantization q4f16_0       --chat-template deepseek_r1_qwen       --max-batch-size 1       --host 0.0.0.0       --port 9000

Fix created mlc config "context_window_size": 131072, ... "stop_token_ids": [0, 1], ... "pad_token_id": 2, "bos_token_id": 0, "eos_token_id": 1

docker run -it --rm --gpus all -v /mnt/nvme/cache:/root/.cache -p 9000:9000 \
  mlc:0.20.0-r36.4-cp312-cu128-24.04 \
  mlc_llm serve --mode interactive --device cuda \
  --host 0.0.0.0 --port 9000 --overrides='gpu_memory_utilization=0.90' \
  --model-lib /root/.cache/mlc_llm/corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC/aarch64-cu128-sm87.so \
  /root/.cache/mlc_llm/corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC \
  --additional-models /root/.cache/mlc_llm/jukofyork/DeepSeek-R1-0528-CODER-DRAFT-0.6B-v1.0-q4f16_0-MLC,/root/.cache/mlc_llm/jukofyork/DeepSeek-R1-0528-CODER-DRAFT-0.6B-v1.0-q4f16_0-MLC/aarch64-cu128-sm87.so \
  --speculative-mode small_draft

But its overhead was bigger than its yield. (It generated tokens sometimes faster, sometimes slower (maybe 20-50% hit on speculative output, I didn't really record this ratio), but yielded an average of the same speed, or perhaps 1% faster)

If you have a Jetson Xavier AGX

use corupta/mlc:0.20.0-r35.6.1-cp312-cu124-22.04 from docker hub.

docker run -dit --rm \
  --name llm_server \
  --gpus all \
  -p 9000:9000 \
  -e HF_HUB_CACHE=/root/.cache/huggingface \
  -v /mnt/nvme/cache:/root/.cache \
  mlc:r35.6.1-cp312-cu124-22.04 \
    sudonim serve \
      --model corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC \
      --quantization q4f16_awq \
      --max-batch-size 1 \
      --host 0.0.0.0 \
      --port 9000

Jetpack5 Image is built with corupta/jetson-containers-jp5

When running the model you might need to tweak prefill_chunk in sudonim or prefill_chunk_size in mlc-llm, to fit the model to your memory constraints.

The model is based on Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-awq-inc and converted via below command (along with manual modifications to mlc-chat-config.json)

mlc_llm gen_config $LOCAL_MODEL_PATH \
  --quantization $QUANTIZATION \
  --conv-template $CONV_TEMPLATE \
  -o $MLC_MODEL_PATH
mlc_llm convert_weight $LOCAL_MODEL_PATH  \
  --quantization $QUANTIZATION \
  -o $MLC_MODEL_PATH \
  --source-format awq \
  --source $LOCAL_MODEL_PATH/model.safetensors.index.json
This model is an int4 model with group_size 128 and symmetric quantization of deepseek-ai/DeepSeek-R1-0528-Qwen3-8B generated by intel/auto-round algorithm.

Please follow the license of the original model.
Downloads last month
20
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for corupta/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound-inc-q4f16_awq-MLC