24B Q4_K_M is broken?
Default settings:
Built from the latest here https://github.com/tiiuae/llama.cpp-Falcon-H1
Using:
git clone https://github.com/tiiuae/llama.cpp-Falcon-H1.git
cd llama.cpp-Falcon-H1/
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j 16
Ran with:
./build/bin/llama-server -c 8192 -ngl 99 -m /home/sai/Downloads/Falcon-H1-34B-Instruct-Q4_K_M.gguf
It's also extremely slow (unsure if this is expected or not):
eval time = 1060.21 ms / 12 tokens ( 88.35 ms per token, 11.32 tokens per second)
Using a 3090 a 32B model like Qwen is closer to 30~ tokens per second
edit: temp 0, topk 0, topp 1
Hey there ,
Thanks for your comment .
You might want to add a system prompt like "You are a helpful assistant" for better alignment.We tested locally and it should work after this fix.
Regarding throughput: FalconH1 uses Mamba2-SSMs operations, which are less optimized in current Triton kernels for short sequences, so it lags behind full-attention models like Qwen on small contexts. However, it outperforms them once you scale past ~16k tokens. See the blogpost for throughput details.
What is the temperature you use for generation?
Importante to know that FalconH1 models are sensible to temperature higher than 0.4.
Locally we get good generations with T=0.1
I suggest to try that one and let us know please 😊
Thanks for the info on the recommended/required temperature!
I just tried launching with~/l/b/bin ❯❯❯ ./llama-server --model /home/user/Downloads/Falcon-H1-34B-Instruct-Q4_0.gguf -c 14000 -ngl 73 -t 0.1 --port 8678 --jinja --chat-template-file /home/user/Downloads/chat_template.jinja
(jinja file) but got the same results:
2 other things: I personally get a really bad throughput, and sometimes the generated text isn't properly sent to the client (in such cases the GPU is working while nothing is showing up).
Also, llama-cli
doesn't work, but I guess this is intended for now?
I'm seeing the same issues with the Q8_0 also.
BF16 is completely broken too on the latest build.
@DhiyaEddine
did you even test it? GGUF's completely cooked.