Unable to run in ollama due to error
I ran ollama pull hf.co/bartowski/nvidia_Llama-3_3-Nemotron-Super-49B-v1-GGUF:Q4_K_M
to download and install the model in ollama. When I then try to use the model it crashes my ollama with the following error:
panic: interface conversion: interface {} is *ggml.array, not uint32
goroutine 27 [running]:
github.com/ollama/ollama/fs/ggml.keyValue[...](0xc00010a570, {0x7ff67b6bf1a3, 0x14}, {0xc000624548, 0x1, 0x7ff67a55e960})
C:/a/ollama/ollama/fs/ggml/ggml.go:146 +0x2de
github.com/ollama/ollama/fs/ggml.KV.Uint(...)
C:/a/ollama/ollama/fs/ggml/ggml.go:96
github.com/ollama/ollama/fs/ggml.KV.HeadCount(...)
C:/a/ollama/ollama/fs/ggml/ggml.go:56
github.com/ollama/ollama/fs/ggml.GGML.GraphSize({{0x7ff67b874828?, 0xc000726000?}, {0x7ff67b8747d8?, 0xc00018d808?}}, 0x20000, 0x200, {0x0, 0x0})
C:/a/ollama/ollama/fs/ggml/ggml.go:418 +0x137
github.com/ollama/ollama/llm.EstimateGPULayers({_, _, _}, , {, _, _}, {{0x20000, 0x200, 0xffffffffffffffff, ...}, ...})
C:/a/ollama/ollama/llm/memory.go:140 +0x659
github.com/ollama/ollama/llm.PredictServerFit({0xc00004bba8?, 0x7ff67a540f2e?, 0xc00004b8c0?}, 0xc000350060, {0xc00004b908?, _, _}, {0x0, 0x0, 0x0}, ...)
C:/a/ollama/ollama/llm/memory.go:23 +0xbd
github.com/ollama/ollama/server.pickBestFullFitByLibrary(0xc000570000, 0xc000350060, {0xc000160600?, 0x2?, 0x2?}, 0xc00004bcf8)
C:/a/ollama/ollama/server/sched.go:714 +0x6f3
github.com/ollama/ollama/server.(*Scheduler).processPending(0xc00009a8a0, {0x7ff67b878800, 0xc000726ff0})
C:/a/ollama/ollama/server/sched.go:226 +0xe6b
github.com/ollama/ollama/server.(*Scheduler).Run.func1()
C:/a/ollama/ollama/server/sched.go:108 +0x1f
created by github.com/ollama/ollama/server.(*Scheduler).Run in goroutine 1
C:/a/ollama/ollama/server/sched.go:107 +0xb1
There was an update to ollama, but the update did not fix or change the error at all. I have also tried the other quants from DevQuasar, they have the same issue.
Also does not work in LM Studio 3.13.2 *latest
Looking into it
The code that cause crash is here: https://github.com/ollama/ollama/blob/main/fs/ggml/ggml.go#L55
Seems like it's part of the code in go to determine how many layers should be offloaded to GPU. The problem is that in llama.cpp, we support 2 possible types for HeadCount
:
- A number, meaning all layers in the model have the same number of head
- An array, meaning each layer in the model can have different number of head
The problem is that ollama only support the first option for now, while llama.cpp support all of the 2. I think we should open an issue on ollama
Upstream issue: https://github.com/ollama/ollama/issues/9984