bartowski/agentica-org_DeepCoder-14B-Preview-GGUF · Slow `agentica-org

On Apple silicon, I'm having great performance with the integer quantized models in this repository.

However, the floating point .gguf is running super slow and on CPU only. What am I doing wrong? I'm on llama.cpp commit fe5b78c89670b2f37ecb216306bed3e677b49d9f.

time ~/src/ggerganov/llama.cpp/build/bin/llama-cli -s `date +%s` -c 0 -n -1 --multiline-input -sys "A conversation between man and machine." -m ~/models/agentica-org_DeepCoder-14B-Preview-bf16.gguf

Meanwhile, directly substituting this invocation with the phi-4-fp16.gguf file from the official Microsoft repository (https://huggingface.co/microsoft/phi-4-gguf/tree/main) results in the model running on GPU and at great speed.

Are there additional, new llama-cli arguments to get this bf16 model running on the GPU? Am I under-appreciating the difference between fp16 and bf16?

bartowski
/

agentica-org_DeepCoder-14B-Preview-GGUF

Slow `agentica-org_DeepCoder-14B-Preview-bf16` performance?