Unable to run in ollama due to error

#3
by Khawn2u - opened

I ran ollama pull hf.co/bartowski/nvidia_Llama-3_3-Nemotron-Super-49B-v1-GGUF:Q4_K_Mto download and install the model in ollama. When I then try to use the model it crashes my ollama with the following error:

panic: interface conversion: interface {} is *ggml.array, not uint32

goroutine 27 [running]:
github.com/ollama/ollama/fs/ggml.keyValue[...](0xc00010a570, {0x7ff67b6bf1a3, 0x14}, {0xc000624548, 0x1, 0x7ff67a55e960})
C:/a/ollama/ollama/fs/ggml/ggml.go:146 +0x2de
github.com/ollama/ollama/fs/ggml.KV.Uint(...)
C:/a/ollama/ollama/fs/ggml/ggml.go:96
github.com/ollama/ollama/fs/ggml.KV.HeadCount(...)
C:/a/ollama/ollama/fs/ggml/ggml.go:56
github.com/ollama/ollama/fs/ggml.GGML.GraphSize({{0x7ff67b874828?, 0xc000726000?}, {0x7ff67b8747d8?, 0xc00018d808?}}, 0x20000, 0x200, {0x0, 0x0})
C:/a/ollama/ollama/fs/ggml/ggml.go:418 +0x137
github.com/ollama/ollama/llm.EstimateGPULayers({_, _, _}, , {, _, _}, {{0x20000, 0x200, 0xffffffffffffffff, ...}, ...})
C:/a/ollama/ollama/llm/memory.go:140 +0x659
github.com/ollama/ollama/llm.PredictServerFit({0xc00004bba8?, 0x7ff67a540f2e?, 0xc00004b8c0?}, 0xc000350060, {0xc00004b908?, _, _}, {0x0, 0x0, 0x0}, ...)
C:/a/ollama/ollama/llm/memory.go:23 +0xbd
github.com/ollama/ollama/server.pickBestFullFitByLibrary(0xc000570000, 0xc000350060, {0xc000160600?, 0x2?, 0x2?}, 0xc00004bcf8)
C:/a/ollama/ollama/server/sched.go:714 +0x6f3
github.com/ollama/ollama/server.(*Scheduler).processPending(0xc00009a8a0, {0x7ff67b878800, 0xc000726ff0})
C:/a/ollama/ollama/server/sched.go:226 +0xe6b
github.com/ollama/ollama/server.(*Scheduler).Run.func1()
C:/a/ollama/ollama/server/sched.go:108 +0x1f
created by github.com/ollama/ollama/server.(*Scheduler).Run in goroutine 1
C:/a/ollama/ollama/server/sched.go:107 +0xb1

There was an update to ollama, but the update did not fix or change the error at all. I have also tried the other quants from DevQuasar, they have the same issue.

Also does not work in LM Studio 3.13.2 *latest

The lmstudio bug I think is only because of the chat template

For ollama that's trickier, would have to ping @ollama or maybe @reach-vb

Sorry didn't see this when you initially posted!

Looking into it

The code that cause crash is here: https://github.com/ollama/ollama/blob/main/fs/ggml/ggml.go#L55

Seems like it's part of the code in go to determine how many layers should be offloaded to GPU. The problem is that in llama.cpp, we support 2 possible types for HeadCount:

  • A number, meaning all layers in the model have the same number of head
  • An array, meaning each layer in the model can have different number of head

The problem is that ollama only support the first option for now, while llama.cpp support all of the 2. I think we should open an issue on ollama

cc: @ollama for vis too πŸ€—

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment