kai-os/Carnice-9b · comparison

sry
https://huggingface.co/McGill-NLP/A3-Qwen3.5-9B

Apr 18

your model never follow instructions!

is much much better

17 days ago

I have tested several finetunes of 9B on the https://benchlocal.com/ bench packs and you are right McGill-NLP/A3-Qwen3.5-9B is better on average vs this model. Except for the Hermesagent score. Carnice has the highest Hermes agent score of all. McGill-NLP/A3-Qwen3.5-9B has the highest CLI score.

15 days ago

i dont believe benchmarks that much, but okay ... what hardware you have to make this local benchmark?

yes from
https://huggingface.co/Jackrong, he some interesting tunes

15 days ago

Two 3090, but one would be enough. No thinking and no MTP was active.

15 days ago

•

@Neiko2002 how do you installed hermes ? and which github? only option via docker?
the tool check runs allready ...

15 days ago

The https://benchlocal.com/ is a benchmark tool which uses a hermes docker image for the hermes-20 bench pack.
https://github.com/stevibe/HermesAgent-20

15 days ago

so yes
downlaod git-repo
npm install
...
and it works?

15 days ago

Not sure what you mean. You can visit the https://benchlocal.com/ website download the benchmarking tool. Start it and choose HermesAgent-20 bench pack there. Than you can choose the model you want to test (local or cloud) and let it run. It automatically setups the docker container, downloads the git repo https://github.com/stevibe/HermesAgent-20 and runs the tests inside the container. The results will be displayed in the benchmark tool.

15 days ago

i would think so but

15 days ago

This means docker is not running on your machine. You need docker for this bench pack. Than it can install the docker image which contains hermes and the HermesAgent-20 tasks.

15 days ago

•

dont know if you on windows, but with only WSL (not WSL2) a minimal docker would run?
(why all is docker, WSL/2 is unsecure on windows) :..(

15 days ago

•

I'm on windows (with WSL2) as well. But as far as I know docker desktop does not work with WSL1. I'm not a docker fan myself, but I makes it quite easy for people to share complex installations across different operating systems.

15 days ago

•

ohm... thx anyway ... dont know if I will do that ;)

9 days ago

•

edited 9 days ago

@Neiko2002
seems teh test is very unstable (or the model...) seem you must run 5-10 times the same test to have a mean.
i only run the test once for now on this model
negentropy-claude-opus-4.7-9b-q6_k -> hermes -> 78 -> yours was 68. any explanation ? i mean the model temp is zero so it should not that differ every time...

9 days ago

I know what you mean with unstable. But a model never answers exactly the same given the same input text, even with temp 0. So yeah, there are a few fluctuations in the points, but not much. What I did in all my tests is disable thinking. This reduces the fluctuations even more, and it's also the reason why my points are so much lower. I feel your 78 with thinking is in line with my 68 without thinking.

9 days ago

i see, sou you changed the system prompt from the provider ...
but even with thinking all bug-find test with tuned models are worse than original ...

btw most tuned models they claim the thinking procces is tuned.
i have still start, but will see ;)

9 days ago

I was planning to run all tests again with thinking mode on. But some models just spend a stupid amount of tokens for a single tasks, that it takes forever or even fails. I'm using vLLM as a LLM inference server and disabled thinking there. For the Qwen models I could use a thinking budget and reasoning end string to force the model to stop thinking after the budget got spend, but the other models do not have it.