steampunque commited on
Commit
5e65ae9
·
verified ·
1 Parent(s): 03b386a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +4 -4
README.md CHANGED
@@ -52,10 +52,10 @@ the docs in the mtmd readme in the tools directory of the source tree https://gi
52
  The model also uses sliding window attention. Use of llama.cpp b5554 and above is recommend for support of the SWA mode.
53
  If --swa-full flag is used, the old method of keeping all KV memory and masking out everything outside the SWA window is used.
54
  When using SWA, prompt cache capability is lost but the available context is greatly increased (around 5.5x bigger). A KV
55
- cache of ~55k tokens is available on a 12G VRAM GPU with SWA. There is a problem when using q8_0 KV cache format where
56
- some heavy computations are being pushed to CPU and prompt processing and token gen become unusably slow. This does not happen
57
- with f16 kV so it is recommended to stay with f16 kv until/ if this problem gets resolved. Related discussion in
58
- https://github.com/ggml-org/llama.cpp/issues/13747.
59
 
60
  ## Download the file from below:
61
  | Link | Type | Size/e9 B | Notes |
 
52
  The model also uses sliding window attention. Use of llama.cpp b5554 and above is recommend for support of the SWA mode.
53
  If --swa-full flag is used, the old method of keeping all KV memory and masking out everything outside the SWA window is used.
54
  When using SWA, prompt cache capability is lost but the available context is greatly increased (around 5.5x bigger). A KV
55
+ cache of ~55k tokens is available on a 12G VRAM GPU with SWA and a gemma 3 1b speculator loaded, or ~72k tokens with no speculator loaded.
56
+ There is a problem when using q8_0 KV cache format where some heavy computations are being pushed to CPU and prompt processing and token
57
+ gen become unusably slow. This does not happen with f16 kV so it is recommended to stay with f16 kv until/ if this problem gets resolved.
58
+ Related discussion in https://github.com/ggml-org/llama.cpp/issues/13747.
59
 
60
  ## Download the file from below:
61
  | Link | Type | Size/e9 B | Notes |