UD-Q3_K_XL sometimes gives gibberish when using it via API (SillyTavern). UD-Q4_K_XL works fine.

#5
by Panchovix - opened

Hi there, thanks for always doing these greats quants!

I was testing some small conversations via Sillytavern, but, sometimes, I get infinite GGGGGGGGGGGGGGGG or infinite "Blocky Blocky Blocky". Then, after getting the gibberish there, it maints it to the internal API server.

image.png

This is with the UD-Q3_K_XL loaded fully on VRAM (128GB total VRAM)

The model works fine on all cases with UD-Q4_K_XL, offloading about ~20GB to CPU.

What could be the reason? I did compile from source on latest commit with

cmake -B build -DGGML_CUDA=ON -DGGML_CUDA_FA_ALL_QUANTS=ON -DGGML_BLAS=OFF -DCMAKE_CUDA_ARCHITECTURES="86;89;120"

Panchovix changed discussion status to closed

Sign up or log in to comment