Is this behavior normal?

#4
by ElvisM - opened

Seems that Gemma 3 just consumes more VRAM than other 12b models. For example, I can load Mistral Nemo at 16k context length and it fits nicely in my 16gb VRAM Nvidia card, still leaving space for more context, if needed. Meanwhile, Gemma 3 with the same precision goes slightly over.

But the weird thing is that this wasn't what happened initially. This started happening after I tried to load the model a second time after turning off my PC.

These quants are slightly larger ("MAX") than comparable quants of models this size.
That being said, I find there is a lot more "processing" going on in GEmmas (Gemma2/3) that other models.
This might show up as using more VRAM, than say a Llama or Mistral.

These quants are slightly larger ("MAX") than comparable quants of models this size.
That being said, I find there is a lot more "processing" going on in GEmmas (Gemma2/3) that other models.
This might show up as using more VRAM, than say a Llama or Mistral.

So I asked around and some people have said that Llamacpp is still a little buggy with Gemma. Also, if you quantize the cache, it seems to run primarily on CPU rather than the GPU, making it a lot slower. I can run Q4KM to reduce the VRAM usage, but it seems that we are going to have to wait until they iron out these bugs.

thanks for the update;

There have been several llamacpp bug fixes for Gemma over the past few days. Make sure you are using the latest.

https://github.com/ggml-org/llama.cpp/releases

(Though, that's not to say your issue isn't still among the bugs.)

There have been several llamacpp bug fixes for Gemma over the past few days. Make sure you are using the latest.

https://github.com/ggml-org/llama.cpp/releases

(Though, that's not to say your issue isn't still among the bugs.)

They have definitely helped, but context still takes up too much VRAM compared to other models. It almost feels like the KV cache is FP32 rather than FP16. It works better with LMStudio than on Oobabooga, which is running an older version of Llamma.cpp.

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment