UD-Q3_K_XL sometimes gives gibberish when using it via API (SillyTavern). UD-Q4_K_XL works fine.
#5
by
Panchovix
- opened
Hi there, thanks for always doing these greats quants!
I was testing some small conversations via Sillytavern, but, sometimes, I get infinite GGGGGGGGGGGGGGGG or infinite "Blocky Blocky Blocky". Then, after getting the gibberish there, it maints it to the internal API server.
This is with the UD-Q3_K_XL loaded fully on VRAM (128GB total VRAM)
The model works fine on all cases with UD-Q4_K_XL, offloading about ~20GB to CPU.
What could be the reason? I did compile from source on latest commit with
cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_BLAS=OFF -DCMAKE_CUDA_ARCHITECTURES="86;89;120"
Panchovix
changed discussion status to
closed