Slow `agentica-org_DeepCoder-14B-Preview-bf16` performance?
On Apple silicon, I'm having great performance with the integer quantized models in this repository.
However, the floating point .gguf
is running super slow and on CPU only. What am I doing wrong? I'm on llama.cpp
commit fe5b78c89670b2f37ecb216306bed3e677b49d9f
.
time ~/src/ggerganov/llama.cpp/build/bin/llama-cli -s `date +%s` -c 0 -n -1 --multiline-input -sys "A conversation between man and machine." -m ~/models/agentica-org_DeepCoder-14B-Preview-bf16.gguf
Meanwhile, directly substituting this invocation with the phi-4-fp16.gguf
file from the official Microsoft repository (https://huggingface.co/microsoft/phi-4-gguf/tree/main) results in the model running on GPU and at great speed.
Are there additional, new llama-cli
arguments to get this bf16 model running on the GPU? Am I under-appreciating the difference between fp16 and bf16?
hmm bf16 versus fp16 may be the cause, I would have thought Apple silicon bf16 support was fine but it might not be?
Is there a reason you prefer to use bf16 versus Q8_0 which should be functionally identical?