steampunque
/

gemma-3-12b-it-Hybrid-GGUF

4-bit precision

Model card Files Files and versions Community

steampunque commited on Jun 2

Commit

5e65ae9

·

verified ·

1 Parent(s): 03b386a

Update README.md

Files changed (1) hide show

README.md +4 -4

README.md CHANGED Viewed

@@ -52,10 +52,10 @@ the docs in the mtmd readme in the tools directory of the source tree https://gi
 The model also uses sliding window attention.  Use of llama.cpp b5554 and above is recommend for support of the SWA mode.
 If --swa-full flag is used, the old method of keeping all KV memory and masking out everything outside the SWA window is used.
 When using SWA, prompt cache capability is lost but the available context is greatly increased (around 5.5x bigger).  A KV
-cache of ~55k tokens is available on a 12G VRAM GPU with SWA.  There is a problem when using q8_0 KV cache format where
-some heavy computations are being pushed to CPU and prompt processing and token gen become unusably slow.  This does not happen
-with f16 kV so it is recommended to stay with f16 kv until/ if this problem gets resolved.  Related discussion in
-https://github.com/ggml-org/llama.cpp/issues/13747.
 ## Download the file from below:
 | Link | Type | Size/e9 B | Notes |

 The model also uses sliding window attention.  Use of llama.cpp b5554 and above is recommend for support of the SWA mode.
 If --swa-full flag is used, the old method of keeping all KV memory and masking out everything outside the SWA window is used.
 When using SWA, prompt cache capability is lost but the available context is greatly increased (around 5.5x bigger).  A KV
+cache of ~55k tokens is available on a 12G VRAM GPU with SWA and a gemma 3 1b speculator loaded, or ~72k tokens with no speculator loaded.
+There is a problem when using q8_0 KV cache format where some heavy computations are being pushed to CPU and prompt processing and token
+gen become unusably slow.  This does not happen with f16 kV so it is recommended to stay with f16 kv until/ if this problem gets resolved.
+Related discussion in https://github.com/ggml-org/llama.cpp/issues/13747.
 ## Download the file from below:
 | Link | Type | Size/e9 B | Notes |