KV caching appears nonfunctional with Agatha and Upstream Cohere model — anyone else seeing this?

#1
by Rednero - opened

I've been testing the Agatha-111B models (including both iMatrix and non-iMatrix versions), as well as the upstream c4ai-command-a-03-2025, and I'm seeing behavior that suggests KV caching isn't working.

With other GGUF models like LLaMA and Mistral, the system reuses past tokens during generation, and performance stays consistent. With Agatha and Cohere, each new token appears to reprocess the entire prompt. Generation slows down significantly as the conversation grows. Even when continuing a previous response, it behaves as if it's starting from scratch.

One thing I noticed is that the system never accumulates past tokens like it should—n_past stays fixed at 256 regardless of prompt length.

This happens across quant types, regardless of flags, and only with this model family.

Update llama.cpp to the latest version and use --swa-full to cache long prompts. Without --swa-full caching works only for short prompts

This is a known bug with llama.cpp. I suggest you use koboldCPP, which is known to have its KV-caching implementation working with Cohere models without further configuration. Alternatively you can try updating llama.cpp and enabling --swa-full as the above comment suggests.

There was discussion about some bugs and related commits recently in llama.cpp, but the cache not working without --swa-full is not a bug. With models that use swa, you need to use --swa-full for the cache to work with longer prompts. It takes more memory in exchange and kobold is probably doing that by default.

I can confirm that latest llama.cpp works fine on my end with --swa-full.

There was discussion about some bugs and related commits recently in llama.cpp, but the cache not working without --swa-full is not a bug. With models that use swa, you need to use --swa-full for the cache to work with longer prompts. It takes more memory in exchange and kobold is probably doing that by default.

I can confirm that latest llama.cpp works fine on my end with --swa-full.

By a bug, I meant that --swa-full was non-functional for Command A in older versions of llama.cpp, which is unexpected behaviour. This was confirmed on the BeaverAI server where a member could not get it to work even with the toggle enabled until they updated.

There was discussion about some bugs and related commits recently in llama.cpp, but the cache not working without --swa-full is not a bug. With models that use swa, you need to use --swa-full for the cache to work with longer prompts. It takes more memory in exchange and kobold is probably doing that by default.

I can confirm that latest llama.cpp works fine on my end with --swa-full.

By a bug, I meant that --swa-full was non-functional for Command A in older versions of llama.cpp, which is unexpected behaviour. This was confirmed on the BeaverAI server where a member could not get it to work even with the toggle enabled until they updated.

Okay, that makes sense! The latest version looks to be working without issues

There was discussion about some bugs and related commits recently in llama.cpp, but the cache not working without --swa-full is not a bug. With models that use swa, you need to use --swa-full for the cache to work with longer prompts. It takes more memory in exchange and kobold is probably doing that by default.

I can confirm that latest llama.cpp works fine on my end with --swa-full.

Thanks for the quick and accurate responses, especially on such a niche issue.

I'm running the model through Text Generation Web UI (Oobabooga) and have taken the following steps:

  1. Manually updated llama.cpp to the latest build (mid-June 2025).
  2. Reinstalled llama-cpp-python from source inside the virtual environment using --no-binary :all: after confirming Cython, Ninja, and CMake were present.
  3. Verified llama_cpp.version returns 0.3.9.
  4. Passed --swa-full in the extra flags field and confirmed it appears in the startup log.
  5. Tested with --flash-attn as well—also properly parsed.

Despite all this, n_past remains locked at 256 and the prompt is fully reprocessed on each generation. No behavioral difference between iMatrix and non-iMatrix variants. Other models cache correctly. Agatha does not.

Anything else I should try before abandoning Oobabooga for Agatha?

There was discussion about some bugs and related commits recently in llama.cpp, but the cache not working without --swa-full is not a bug. With models that use swa, you need to use --swa-full for the cache to work with longer prompts. It takes more memory in exchange and kobold is probably doing that by default.

I can confirm that latest llama.cpp works fine on my end with --swa-full.

Thanks for the quick and accurate responses, especially on such a niche issue.

I'm running the model through Text Generation Web UI (Oobabooga) and have taken the following steps:

  1. Manually updated llama.cpp to the latest build (mid-June 2025).
  2. Reinstalled llama-cpp-python from source inside the virtual environment using --no-binary :all: after confirming Cython, Ninja, and CMake were present.
  3. Verified llama_cpp.version returns 0.3.9.
  4. Passed --swa-full in the extra flags field and confirmed it appears in the startup log.
  5. Tested with --flash-attn as well—also properly parsed.

Despite all this, n_past remains locked at 256 and the prompt is fully reprocessed on each generation. No behavioral difference between iMatrix and non-iMatrix variants. Other models cache correctly. Agatha does not.

Anything else I should try before abandoning Oobabooga for Agatha?

I looked through llama.cpp github and I think https://github.com/ggml-org/llama.cpp/pull/14163 might be the update that fixed the bug. It was fixed just three days ago, but the fix was part of llama-server. I am not sure how Oogabooga interfaces with llama.cpp exactly, but it probably does not make use of the server.

You could use llama-server binary directly that ships with llama.cpp. I run it like this:

llama-server -m TheDrummer_Agatha-111B-v1-IQ4_XS-00001-of-00002.gguf -c 32768 -fa --swa-full -ctk q8_0 -ctv q8_0 -ngl 999

That creates an OpenAI endpoint and a simple interface.

Another option would be to use Koboldcpp (https://github.com/LostRuins/koboldcpp) like Geechan suggested, which might be more user friendly? I am not that familiar with it.

There was discussion about some bugs and related commits recently in llama.cpp, but the cache not working without --swa-full is not a bug. With models that use swa, you need to use --swa-full for the cache to work with longer prompts. It takes more memory in exchange and kobold is probably doing that by default.

I can confirm that latest llama.cpp works fine on my end with --swa-full.

Thanks for the quick and accurate responses, especially on such a niche issue.

I'm running the model through Text Generation Web UI (Oobabooga) and have taken the following steps:

  1. Manually updated llama.cpp to the latest build (mid-June 2025).
  2. Reinstalled llama-cpp-python from source inside the virtual environment using --no-binary :all: after confirming Cython, Ninja, and CMake were present.
  3. Verified llama_cpp.version returns 0.3.9.
  4. Passed --swa-full in the extra flags field and confirmed it appears in the startup log.
  5. Tested with --flash-attn as well—also properly parsed.

Despite all this, n_past remains locked at 256 and the prompt is fully reprocessed on each generation. No behavioral difference between iMatrix and non-iMatrix variants. Other models cache correctly. Agatha does not.

Anything else I should try before abandoning Oobabooga for Agatha?

I looked through llama.cpp github and I think https://github.com/ggml-org/llama.cpp/pull/14163 might be the update that fixed the bug. It was fixed just three days ago, but the fix was part of llama-server. I am not sure how Oogabooga interfaces with llama.cpp exactly, but it probably does not make use of the server.

You could use llama-server binary directly that ships with llama.cpp. I run it like this:

llama-server -m TheDrummer_Agatha-111B-v1-IQ4_XS-00001-of-00002.gguf -c 32768 -fa --swa-full -ctk q8_0 -ctv q8_0 -ngl 999

That creates an OpenAI endpoint and a simple interface.

Another option would be to use Koboldcpp (https://github.com/LostRuins/koboldcpp) like Geechan suggested, which might be more user friendly? I am not that familiar with it.

Thank you again!

Rednero changed discussion status to closed

Sign up or log in to comment