24B Q4_K_M is broken?

by SaisExperiments - opened May 21

Discussion

SaisExperiments

May 21

•

edited May 21

Default settings:

Built from the latest here https://github.com/tiiuae/llama.cpp-Falcon-H1

Using:

git clone https://github.com/tiiuae/llama.cpp-Falcon-H1.git
cd llama.cpp-Falcon-H1/
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j 16

Ran with:

./build/bin/llama-server -c 8192 -ngl 99 -m /home/sai/Downloads/Falcon-H1-34B-Instruct-Q4_K_M.gguf

It's also extremely slow (unsure if this is expected or not):

       eval time =    1060.21 ms /    12 tokens (   88.35 ms per token,    11.32 tokens per second)

Using a 3090 a 32B model like Qwen is closer to 30~ tokens per second

edit: temp 0, topk 0, topp 1

DhiyaEddine

Technology Innovation Institute org May 21

Hey there ,
Thanks for your comment .
You might want to add a system prompt like "You are a helpful assistant" for better alignment.We tested locally and it should work after this fix.

Regarding throughput: FalconH1 uses Mamba2-SSMs operations, which are less optimized in current Triton kernels for short sequences, so it lags behind full-attention models like Qwen on small contexts. However, it outperforms them once you scale past ~16k tokens. See the blogpost for throughput details.

owao

May 22

It doesn't help 😂

I also get repetitions of its own generation:

DhiyaEddine

Technology Innovation Institute org May 22

What is the temperature you use for generation?
Importante to know that FalconH1 models are sensible to temperature higher than 0.4.

Locally we get good generations with T=0.1

I suggest to try that one and let us know please 😊

owao

May 23

•

edited May 23

Thanks for the info on the recommended/required temperature!

I just tried launching with~/l/b/bin ❯❯❯ ./llama-server --model /home/user/Downloads/Falcon-H1-34B-Instruct-Q4_0.gguf -c 14000 -ngl 73 -t 0.1 --port 8678 --jinja --chat-template-file /home/user/Downloads/chat_template.jinja (jinja file) but got the same results:

2 other things: I personally get a really bad throughput, and sometimes the generated text isn't properly sent to the client (in such cases the GPU is working while nothing is showing up).
Also, llama-cli doesn't work, but I guess this is intended for now?

bgreene010

May 23

I'm seeing the same issues with the Q8_0 also.

jacek2024

May 23

https://github.com/tiiuae/llama.cpp-Falcon-H1/issues/2

ChuckMcSneed

9 days ago

BF16 is completely broken too on the latest build.
@DhiyaEddine did you even test it? GGUF's completely cooked.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment