Transformers
GGUF
falcon-h1

24B Q4_K_M is broken?

#1
by SaisExperiments - opened

image.png

image.png

Default settings:

image.png

Built from the latest here https://github.com/tiiuae/llama.cpp-Falcon-H1

Using:

git clone https://github.com/tiiuae/llama.cpp-Falcon-H1.git
cd llama.cpp-Falcon-H1/
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j 16

Ran with:

./build/bin/llama-server -c 8192 -ngl 99 -m /home/sai/Downloads/Falcon-H1-34B-Instruct-Q4_K_M.gguf

It's also extremely slow (unsure if this is expected or not):

       eval time =    1060.21 ms /    12 tokens (   88.35 ms per token,    11.32 tokens per second)

Using a 3090 a 32B model like Qwen is closer to 30~ tokens per second

edit: temp 0, topk 0, topp 1

image.png

Technology Innovation Institute org

Hey there ,
Thanks for your comment .
You might want to add a system prompt like "You are a helpful assistant" for better alignment.We tested locally and it should work after this fix.

Regarding throughput: FalconH1 uses Mamba2-SSMs operations, which are less optimized in current Triton kernels for short sequences, so it lags behind full-attention models like Qwen on small contexts. However, it outperforms them once you scale past ~16k tokens. See the blogpost for throughput details.

It doesn't help 😂
image.png

I also get repetitions of its own generation:
image.png

Technology Innovation Institute org

What is the temperature you use for generation?
Importante to know that FalconH1 models are sensible to temperature higher than 0.4.

Locally we get good generations with T=0.1

I suggest to try that one and let us know please 😊

Thanks for the info on the recommended/required temperature!

I just tried launching with~/l/b/bin ❯❯❯ ./llama-server --model /home/user/Downloads/Falcon-H1-34B-Instruct-Q4_0.gguf -c 14000 -ngl 73 -t 0.1 --port 8678 --jinja --chat-template-file /home/user/Downloads/chat_template.jinja (jinja file) but got the same results:

image.png

2 other things: I personally get a really bad throughput, and sometimes the generated text isn't properly sent to the client (in such cases the GPU is working while nothing is showing up).
Also, llama-cli doesn't work, but I guess this is intended for now?

I'm seeing the same issues with the Q8_0 also.

BF16 is completely broken too on the latest build.
@DhiyaEddine did you even test it? GGUF's completely cooked.

Sign up or log in to comment