mi50 can't use this model

#1
by mrguo - opened

(base) mrguo6221@mrguo6221:~$ mlc_llm serve /home/mrguo6221/gemma-3-27b-int4 --mode server --overrides "tensor_parallel_shards=4" --host 0.0.0.0 --port 8001
[2025-04-22 01:09:22] INFO auto_device.py:90: Not found device: cuda:0
[2025-04-22 01:09:23] INFO auto_device.py:79: Found device: rocm:0
[2025-04-22 01:09:23] INFO auto_device.py:79: Found device: rocm:1
[2025-04-22 01:09:23] INFO auto_device.py:79: Found device: rocm:2
[2025-04-22 01:09:23] INFO auto_device.py:79: Found device: rocm:3
[2025-04-22 01:09:25] INFO auto_device.py:90: Not found device: metal:0
[2025-04-22 01:09:28] INFO auto_device.py:90: Not found device: vulkan:0
[2025-04-22 01:09:29] INFO auto_device.py:90: Not found device: opencl:0
[2025-04-22 01:09:31] INFO auto_device.py:79: Found device: cpu:0
[2025-04-22 01:09:31] INFO auto_device.py:35: Using device: rocm:0
[2025-04-22 01:09:31] INFO jit.py:43: MLC_JIT_POLICY = ON. Can be one of: ON, OFF, REDO, READONLY
[2025-04-22 01:09:31] INFO jit.py:158: Using cached model lib: /home/mrguo6221/.cache/mlc_llm/model_lib/5d2248ae87722ebdbfc590d042e3442b.so
[2025-04-22 01:09:31] INFO engine_base.py:192: The selected engine mode is server. We use as much GPU memory as possible (within the limit of gpu_memory_utilization).
[2025-04-22 01:09:31] INFO engine_base.py:200: If you have low concurrent requests and want to use less GPU memory, please select mode "local".
[2025-04-22 01:09:31] INFO engine_base.py:205: If you don't have concurrent requests and only use the engine interactively, please select mode "interactive".
[01:09:37] /workspace/mlc-llm/cpp/serve/config.cc:798: Under mode "local", max batch size will be set to 4, max KV cache token capacity will be set to 8192, prefill chunk size will be set to 8192.
[01:09:37] /workspace/mlc-llm/cpp/serve/config.cc:798: Under mode "interactive", max batch size will be set to 1, max KV cache token capacity will be set to 8192, prefill chunk size will be set to 8192.
[01:09:37] /workspace/mlc-llm/cpp/serve/config.cc:798: Under mode "server", max batch size will be set to 128, max KV cache token capacity will be set to 183611, prefill chunk size will be set to 8192.
[01:09:37] /workspace/mlc-llm/cpp/serve/config.cc:879: The actual engine mode is "server". So max batch size is 128, max KV cache token capacity is 183611, prefill chunk size is 8192.
[01:09:37] /workspace/mlc-llm/cpp/serve/config.cc:884: Estimated total single GPU memory usage: 27839.143 MB (Parameters: 4191.776 MB. KVCache: 22330.937 MB. Temporary buffer: 1316.431 MB). The actual usage might be slightly larger than the estimated number.
[01:09:37] /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:150: [Worker #0] Loading model to device: rocm:0
[01:09:37] /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:150: [Worker #1] Loading model to device: rocm:1
[01:09:37] /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:150: [Worker #2] Loading model to device: rocm:2
[01:09:37] /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:150: [Worker #3] Loading model to device: rocm:3
[01:09:38] /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:175: Loading parameters...
terminate called after throwing an instance of 'tvm::runtime::Error' ] [10/1119]
what(): TVMError: Assert fail: T.Cast("int32", _shard_k_norm_var_w_shape[0]) == 512, Argument _shard_k_norm.var_w.shape[0] has an unsatisfied constraint: 512 == T.Cast("int32", _shard_k_norm_var_w_shape[0])
Stack trace:
0: TVMThrowLastError.cold
1: tvm::runtime::PackedFuncObj::Extractor<tvm::runtime::PackedFuncSubObj<tvm::runtime::WrapPackedFunc(int ()(TVMValue*, int*, int, TVMValue*, int*, void*), tvm::runtime::ObjectPtrtvm::runtime::Object const&)::{lambda(tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue)#1}> >::Call(tvm::runtime::PackedFuncObj const*, tvm::runtime::TVMArgs, tvm::runtime::TVMRetValue*)
2: mlc::llm::multi_gpu::PreprocessorPool::Apply(tvm::runtime::NDArray, mlc::llm::ModelMetadata::Param const&) const
at /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:77
3: mlc::llm::multi_gpu::BroadcastOrShardAndScatter(tvm::runtime::NDArray, mlc::llm::ModelMetadata::Param const&, int, mlc::llm::multi_gpu::PreprocessorPool const&)
at /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:113
4: mlc::llm::multi_gpu::LoadMultiGPU(std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&, tvm::runtime::Module, std::__cxx11::basic_string<char, std::char_traits, std::allocator > const&)
at /workspace/mlc-llm/cpp/multi_gpu/multi_gpu_loader.cc:193
5: tvm::runtime::DiscoWorker::Impl::CallPacked(tvm::runtime::DiscoWorker*, long, tvm::runtime::PackedFunc, tvm::runtime::TVMArgs const&)
6: tvm::runtime::DiscoWorker::Impl::MainLoop(tvm::runtime::DiscoWorker*)
7: execute_native_thread_routine
at ../../../../../libstdc++-v3/src/c++11/thread.cc:104
8: start_thread
at ./nptl/pthread_create.c:442
9: 0x00007fc09cff484f
at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:81
10: 0xffffffffffffffff

Sign up or log in to comment