comparison

#2
by kalle07 - opened

your model never follow instructions!

sry
https://huggingface.co/McGill-NLP/A3-Qwen3.5-9B

is much much better

I have tested several finetunes of 9B on the https://benchlocal.com/ bench packs and you are right McGill-NLP/A3-Qwen3.5-9B is better on average vs this model. Except for the Hermesagent score. Carnice has the highest Hermes agent score of all. McGill-NLP/A3-Qwen3.5-9B has the highest CLI score.
image

i dont believe benchmarks that much, but okay ... what hardware you have to make this local benchmark?

yes from
https://huggingface.co/Jackrong, he some interesting tunes

Two 3090, but one would be enough. No thinking and no MTP was active.

@Neiko2002 how do you installed hermes ? and which github? only option via docker?
the tool check runs allready ...

The https://benchlocal.com/ is a benchmark tool which uses a hermes docker image for the hermes-20 bench pack.
https://github.com/stevibe/HermesAgent-20

so yes
downlaod git-repo
npm install
...
and it works?

Not sure what you mean. You can visit the https://benchlocal.com/ website download the benchmarking tool. Start it and choose HermesAgent-20 bench pack there. Than you can choose the model you want to test (local or cloud) and let it run. It automatically setups the docker container, downloads the git repo https://github.com/stevibe/HermesAgent-20 and runs the tests inside the container. The results will be displayed in the benchmark tool.

i would think so but

grafik

This means docker is not running on your machine. You need docker for this bench pack. Than it can install the docker image which contains hermes and the HermesAgent-20 tasks.

dont know if you on windows, but with only WSL (not WSL2) a minimal docker would run?
(why all is docker, WSL/2 is unsecure on windows) :..(

I'm on windows (with WSL2) as well. But as far as I know docker desktop does not work with WSL1. I'm not a docker fan myself, but I makes it quite easy for people to share complex installations across different operating systems.

ohm... thx anyway ... dont know if I will do that ;)

@Neiko2002
seems teh test is very unstable (or the model...) seem you must run 5-10 times the same test to have a mean.
i only run the test once for now on this model
negentropy-claude-opus-4.7-9b-q6_k -> hermes -> 78 -> yours was 68. any explanation ? i mean the model temp is zero so it should not that differ every time...

grafik

I know what you mean with unstable. But a model never answers exactly the same given the same input text, even with temp 0. So yeah, there are a few fluctuations in the points, but not much. What I did in all my tests is disable thinking. This reduces the fluctuations even more, and it's also the reason why my points are so much lower. I feel your 78 with thinking is in line with my 68 without thinking.

i see, sou you changed the system prompt from the provider ...
but even with thinking all bug-find test with tuned models are worse than original ...

btw most tuned models they claim the thinking procces is tuned.
i have still start, but will see ;)

I was planning to run all tests again with thinking mode on. But some models just spend a stupid amount of tokens for a single tasks, that it takes forever or even fails. I'm using vLLM as a LLM inference server and disabled thinking there. For the Qwen models I could use a thinking budget and reasoning end string to force the model to stop thinking after the budget got spend, but the other models do not have it.

Yes, same way … even if model pass a test, it’s more or less pointless if it takes five minutes to do...

Sign up or log in to comment