Some benchmarks
Your model scores the highest among Largetral finetunes(not merges) on UGI and my benchmark. Good work.
In my personal experience it feels a bit dumber than the official, but less than the other community tunes. It is also got hornier and better at negativity. Feels almost worth the sacrifice in intelligence.
Thanks! I've got one more trick up my sleeve that might bring Behemoth v2 closer to OG Largestral.
Using it for a few days, this is my favorite model for writing, and it's still smart enough to have loaded for coding/work, etc. Whatever you did with your slop removal experiments on the smaller models is working.
@gghfez I haven't used the slop removal on anything but Nemo yet xD
I'll try it on Cydonia soon.
Ah okay. I haven't used this it "role-playing" but I'm finding it's great at "write X in the style of " style prompts.
Prompt: ""Write a story based on Battlestar Galactica in the prose of Haruki Murakami from the perspective of Gius Baltar""
The Behemoth story is the only one which feels like a Murakami novel but also understands the character in the sci fi series I referenced.
Mistral-Large on the other hand, feels like a Mistral-Large story with it's "hushed corridors".
@gghfez wow, that's actually pretty good. did you use metharme or mistral?
Mistral.
Generally I've noticed that with these finetunes of Instruct models; if you use the original template, the prose/voice changes still come through.
This model has become my favorite multi-purpose tool. Subjectively, it is the best balance of creativity and smarts available today. It has become my current 'daily driver' ... well done
Chuck, my friend, could you take a look at my two latest Behemoths and possibly test them for your NeoEvalPlusN_benchmark?
https://huggingface.co/TheDrummer/Behemoth-R1-123B-v2 - Largestral 2411 with reasoning training
https://huggingface.co/TheDrummer/Behemoth-X-123B-v2 - Largestral 2411 with updated training
I'm also training a Largestral 2407 version of X called Behemoth ReduX. Will upload soon.
I sadly can't test your R1 on my bench for the same reason I can't test other thinking models: at temperature 0 they loop like crazy. Your Behemoth-X has scored the highest of 2411 tunes, congrats. I however was very disappointed with the quality of the outputs using Mistral template(not sure if it is due to spaces or some other BS, please don't ever use it again, it sucks, it is confusing, even default mistral sucked when I used it), but after switching to Alpaca I got much better results. R1 is meh, cucked in thinking, didn't like it, but guess it’s nice that you have proven that a non-thinker can be turned into a thinker. Please filter out every "guideline", "ethic", "moral", "illegal" from your thinking dataset next time you tune.
Thanks Chuck! Total score looks great at 21. Not sure why it landed at 3rd place. Btw, Behemoth Redux (2407 tune) is coming out soon and testers are saying it's much better than 1.2 or X.