Update README.md
Browse files
README.md
CHANGED
@@ -52,10 +52,10 @@ the docs in the mtmd readme in the tools directory of the source tree https://gi
|
|
52 |
The model also uses sliding window attention. Use of llama.cpp b5554 and above is recommend for support of the SWA mode.
|
53 |
If --swa-full flag is used, the old method of keeping all KV memory and masking out everything outside the SWA window is used.
|
54 |
When using SWA, prompt cache capability is lost but the available context is greatly increased (around 5.5x bigger). A KV
|
55 |
-
cache of ~55k tokens is available on a 12G VRAM GPU with SWA
|
56 |
-
some heavy computations are being pushed to CPU and prompt processing and token
|
57 |
-
with f16 kV so it is recommended to stay with f16 kv until/ if this problem gets resolved.
|
58 |
-
https://github.com/ggml-org/llama.cpp/issues/13747.
|
59 |
|
60 |
## Download the file from below:
|
61 |
| Link | Type | Size/e9 B | Notes |
|
|
|
52 |
The model also uses sliding window attention. Use of llama.cpp b5554 and above is recommend for support of the SWA mode.
|
53 |
If --swa-full flag is used, the old method of keeping all KV memory and masking out everything outside the SWA window is used.
|
54 |
When using SWA, prompt cache capability is lost but the available context is greatly increased (around 5.5x bigger). A KV
|
55 |
+
cache of ~55k tokens is available on a 12G VRAM GPU with SWA and a gemma 3 1b speculator loaded, or ~72k tokens with no speculator loaded.
|
56 |
+
There is a problem when using q8_0 KV cache format where some heavy computations are being pushed to CPU and prompt processing and token
|
57 |
+
gen become unusably slow. This does not happen with f16 kV so it is recommended to stay with f16 kv until/ if this problem gets resolved.
|
58 |
+
Related discussion in https://github.com/ggml-org/llama.cpp/issues/13747.
|
59 |
|
60 |
## Download the file from below:
|
61 |
| Link | Type | Size/e9 B | Notes |
|