Some benchmarks

by ChuckMcSneed - opened Oct 13, 2024

Oct 13, 2024

Your model scores the highest among Largetral finetunes(not merges) on UGI and my benchmark. Good work.
In my personal experience it feels a bit dumber than the official, but less than the other community tunes. It is also got hornier and better at negativity. Feels almost worth the sacrifice in intelligence.

TheDrummer

Owner Oct 14, 2024

Thanks! I've got one more trick up my sleeve that might bring Behemoth v2 closer to OG Largestral.

gghfez

Oct 14, 2024

Using it for a few days, this is my favorite model for writing, and it's still smart enough to have loaded for coding/work, etc. Whatever you did with your slop removal experiments on the smaller models is working.

TheDrummer

Owner Oct 14, 2024

@gghfez I haven't used the slop removal on anything but Nemo yet xD

I'll try it on Cydonia soon.

gghfez

Oct 14, 2024

Ah okay. I haven't used this it "role-playing" but I'm finding it's great at "write X in the style of " style prompts.

Prompt: ""Write a story based on Battlestar Galactica in the prose of Haruki Murakami from the perspective of Gius Baltar""

https://imgur.com/a/vyMdeES

The Behemoth story is the only one which feels like a Murakami novel but also understands the character in the sci fi series I referenced.
Mistral-Large on the other hand, feels like a Mistral-Large story with it's "hushed corridors".

TheDrummer

Owner Oct 14, 2024

@gghfez wow, that's actually pretty good. did you use metharme or mistral?

gghfez

Oct 15, 2024

Mistral.

Generally I've noticed that with these finetunes of Instruct models; if you use the original template, the prose/voice changes still come through.

BigHuggyD

Oct 15, 2024

This model has become my favorite multi-purpose tool. Subjectively, it is the best balance of creativity and smarts available today. It has become my current 'daily driver' ... well done

TheDrummer

Owner 17 days ago

Chuck, my friend, could you take a look at my two latest Behemoths and possibly test them for your NeoEvalPlusN_benchmark?

https://huggingface.co/TheDrummer/Behemoth-R1-123B-v2 - Largestral 2411 with reasoning training

https://huggingface.co/TheDrummer/Behemoth-X-123B-v2 - Largestral 2411 with updated training

I'm also training a Largestral 2407 version of X called Behemoth ReduX. Will upload soon.

ChuckMcSneed

17 days ago

I sadly can't test your R1 on my bench for the same reason I can't test other thinking models: at temperature 0 they loop like crazy. Your Behemoth-X has scored the highest of 2411 tunes, congrats. I however was very disappointed with the quality of the outputs using Mistral template(not sure if it is due to spaces or some other BS, please don't ever use it again, it sucks, it is confusing, even default mistral sucked when I used it), but after switching to Alpaca I got much better results. R1 is meh, cucked in thinking, didn't like it, but guess it’s nice that you have proven that a non-thinker can be turned into a thinker. Please filter out every "guideline", "ethic", "moral", "illegal" from your thinking dataset next time you tune.

TheDrummer

Owner 16 days ago

Thanks Chuck! Total score looks great at 21. Not sure why it landed at 3rd place. Btw, Behemoth Redux (2407 tune) is coming out soon and testers are saying it's much better than 1.2 or X.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment