Text Generation
GGUF
English
conversational

Slow `agentica-org_DeepCoder-14B-Preview-bf16` performance?

#1
by jac-jim - opened

On Apple silicon, I'm having great performance with the integer quantized models in this repository.

However, the floating point .gguf is running super slow and on CPU only. What am I doing wrong? I'm on llama.cpp commit fe5b78c89670b2f37ecb216306bed3e677b49d9f.

time ~/src/ggerganov/llama.cpp/build/bin/llama-cli -s `date +%s` -c 0 -n -1 --multiline-input -sys "A conversation between man and machine." -m ~/models/agentica-org_DeepCoder-14B-Preview-bf16.gguf

Meanwhile, directly substituting this invocation with the phi-4-fp16.gguf file from the official Microsoft repository (https://huggingface.co/microsoft/phi-4-gguf/tree/main) results in the model running on GPU and at great speed.

Are there additional, new llama-cli arguments to get this bf16 model running on the GPU? Am I under-appreciating the difference between fp16 and bf16?

hmm bf16 versus fp16 may be the cause, I would have thought Apple silicon bf16 support was fine but it might not be?

Is there a reason you prefer to use bf16 versus Q8_0 which should be functionally identical?

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment