Is this behavior normal?

by ElvisM - opened Mar 22

Mar 22

Seems that Gemma 3 just consumes more VRAM than other 12b models. For example, I can load Mistral Nemo at 16k context length and it fits nicely in my 16gb VRAM Nvidia card, still leaving space for more context, if needed. Meanwhile, Gemma 3 with the same precision goes slightly over.

But the weird thing is that this wasn't what happened initially. This started happening after I tried to load the model a second time after turning off my PC.

DavidAU

Owner Mar 23

These quants are slightly larger ("MAX") than comparable quants of models this size.
That being said, I find there is a lot more "processing" going on in GEmmas (Gemma2/3) that other models.
This might show up as using more VRAM, than say a Llama or Mistral.

ElvisM

Mar 23

These quants are slightly larger ("MAX") than comparable quants of models this size.
That being said, I find there is a lot more "processing" going on in GEmmas (Gemma2/3) that other models.
This might show up as using more VRAM, than say a Llama or Mistral.

So I asked around and some people have said that Llamacpp is still a little buggy with Gemma. Also, if you quantize the cache, it seems to run primarily on CPU rather than the GPU, making it a lot slower. I can run Q4KM to reduce the VRAM usage, but it seems that we are going to have to wait until they iron out these bugs.

DavidAU

Owner Mar 23

thanks for the update;

SuperbEmphasis

Mar 23

There have been several llamacpp bug fixes for Gemma over the past few days. Make sure you are using the latest.

https://github.com/ggml-org/llama.cpp/releases

(Though, that's not to say your issue isn't still among the bugs.)

ElvisM

Mar 23

There have been several llamacpp bug fixes for Gemma over the past few days. Make sure you are using the latest.

https://github.com/ggml-org/llama.cpp/releases

(Though, that's not to say your issue isn't still among the bugs.)

They have definitely helped, but context still takes up too much VRAM compared to other models. It almost feels like the KV cache is FP32 rather than FP16. It works better with LMStudio than on Oobabooga, which is running an older version of Llamma.cpp.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment