Content diversity.

#4
by MateoTeo - opened

Tried making a custom Konosuba RPG card. Crafted and polished custom instructions for each of the main characters. And what do we get in the process?

  • Darkness strikes down her foes with 100% precision (10 reties in a row xD).
  • Aqua is depicted as wise, all-knowing, and helpful.
  • Megumin throws fireballs and myriads of other spells.
  • All kinds and types of instructions are ignored (tried positive, negative, positive and negative, provide examples, removing everything.)
  • Pure comedy... but not in the right way.

This finetune is too clingy to common cliches and character archetypes. Heroes are always good, baddies are always bad. Warriors always strike true, Gods are wise, and mages are full of arcane knowledge. Adding a dataset with atypical and negative patterns would be beneficial.

Still, this finetune with overly positive and righteous leaning shows fun results on more vanilla and common stuff.

P.S. Model that followed instructions correctly, but this is a merge, so I don't know if it will be helpful: https://huggingface.co/Steelskull/L3.3-San-Mai-R1-70b

Just chiming in to say that I for one, greatly prefer Vulpecula over San-Mai. San-Mai, and every model in that series requires fenagling to have it get into NSFW stuff. Even then, it's clearly reluctant to engage with it, and can lose coherency. I've not had that issue with Vulpecula. In fact, out of every model I've used - San-Mai, Evathene, Electranova, Cirrus, Wayfarer, New Dawn, Nova Tempus, Nevoria, Mokume Gane, or FallenLlama... I much, much prefer Vulpecula's coherence, response length, and the ability to engage in NSFW content without tard-wrangling it with multiple samplers. Min-P, temp, and XTC are enough. The model works even when using different LLAMA3 formats, such as LeCeption, or Llama 3 Roleplay2.0, which is nice. That's a sign of a solid model.

Hell, it works perfectly with the reasoning formatting engaged, and doesn't absolutely flood the tag with massive walls of text that gobbles context like I've had other models do. I can actually use a reasoning prefix to direct the style and length, and it works... which is rare.

Whatever you did with Vulpecula, please keep at it. Other models have their strengths, but Vulpecula is the first model in a while that I've not been frustrated with during long-form RP. A model that just "worked" without fancy sampler shit.

Sorry for chiming in on your post OP, just wanted to toss in my own opinions on the model. Also, if you are having problems with prompt adherence, I've found that high Min-P does that no favors for a lot of models. Too much, it starts dropping details (I despise the standard 0.05 Min-P often used for many models). Too little it can have problems with obeying and coherence. I like to use around 0.018-0.025 Min-P and 0.86 on the temp. Then a tad of XTC at 0.15 exponent and 0.05-0.25 probability. Dry set to a 0.8 multiplier, 1.75 base, 4 allowed length, and a dry penalty range of 4096. Ensure sequence breakers are properly set and validated in the JSON validator. Smoothing as well can be a double-edged sword - coherence and instruction following can take a hit, although it can help if you find a model's sweetspot. Just a suggestion. A lot of "recommended" settings out there aren't to my taste, and I'm autistic as hell about the small details in my RP's.

@Surprisekitty Hey, thanks for the advice on sampling stuff, will try that later. Because hell knows I tried to understand all of that but never got better results than the default ones :D
Can't say that San-Mai is a great merge or something, because yeah, it is very squeamish with NSFW themes, and will try to avoid and steer away from such.
I just tested a bunch of other models and found that San nailed characters every single time... but it also dried every ecchi element (insert the Thanos meme about the perfect balance here).

So, this can be the case with the model's inner alignment too, like respect of individuality, virtue ethics, etc.
There was a funny case where many models didn't want to portray a character as instructed... Before I added an instruction that negative traits will not be disrespectful to that "person", and they gave their consent and want to be portrayed fully. And everything worked first try (FFS!) xD

One of the reasons why I love FallenLlama's unhingedness, it just ignores all that crap. But the model's excessive love for em-dashes and "—if only..." patterns is kinda ruining the experience (Thanos meme again).

UPD. Ough... one moment you are yanking sampling settings left and right, achieving nearly zero difference, and the other, you just raised the temp from 0.8 to 0.85 and lowered DRY setting for around 0.20, and NOW Darkness misses every time as instructed... I hate sampling settings... Т___т"

And my Min-P is 0.015 from the start of testing. The rest were kinda close-ish to @Surprisekitty 's. Maybe Top-P 0.8 is a bit off.
So, I think this discussion turned into a shitpost instead of critique. Sorry, Sao ( ̄_ ̄|||)

UPD. Ough... one moment you are yanking sampling settings left and right, achieving nearly zero difference, and the other, you just raised the temp from 0.8 to 0.85 and lowered DRY setting for around 0.20, and NOW Darkness misses every time as instructed... I hate sampling settings... Т___т"

And my Min-P is 0.015 from the start of testing. The rest were kinda close-ish to @Surprisekitty 's. Maybe Top-P 0.8 is a bit off.
So, I think this discussion turned into a shitpost instead of critique. Sorry, Sao ( ̄_ ̄|||)

Not sure about Top-P, I've generally seen people recommend leaving that setting close to default - .95-99 may be fruitful.

However, if you've tried Typical-P, it may help if you're having repetition problems. A good range is generally .95-99 - although I generally don't lower typical-P below .95 since it tends to hurt the longer you RP. I used to settle on .97 as a typical-P value quite a bit, before I moved away from it.

And I feel you on sampler settings. The smallest changes can have a big difference, haha.

Something I like to do when testing sampler settings, is open up an existing long-form RP session and regenerate the response 20~ times. See how it's handling the memory and if it starts to mention certain details present in previous messages, that sort of thing. Testing samplers on old-RP's with a lot of messages can be great for stability testing too. If it completely falls apart, I edit the samplers until it seems stable. Once I get a good one, I start a new chat and see how it fairs. It's honestly insane how sampler settings can affect so much. Bad settings can make a model dumb as rocks, other settings can make the model spaz out - such as high DRY or rep-pen.

Instruct and Context template can make a large difference too. LLamaception with stepped thinking seems to follow what's going on quite-well through long-RP's. But without proper sampler settings, it's prone to similar-structure for each reply. Llama Roleplay V2.0 feels more natural, but can have problems handling nuances of what's going on. Whereas Llama 3 Instruct seems in the middle.

Also, if you're testing, just a tip for this model - if you see replies start with the name of your character, like "Anon:" and then the text... set "Include Names" to "Never" under the settings. That fixes it. It doesn't always show up, but it took me a while to figure that out. I initially blamed sampler settings, but those weren't the culprit.

Llama 3 models also despise certain symbols in character cards. Especially brackets or equal signs. I would test sampler settings on cards with minimal fancy formatting - it can really make a difference. It doesn't seem to have a problem with ":" symbols, and parenthesis are well-tolerated. I found that once I started 'correcting' formatting on character cards, performance improved and testing samplers was more consistent.

Honestly, I wouldn't worry about the post being a "shit-post" as you mentioned. I feel that sampler settings and model "quirks" are really under-researched beyond official papers or graphs. Like how Llama models can sometimes have a "a slightly lower BPW can be better than high BPW" when quanting. It feels esoteric as hell sometimes.

@Surprisekitty Yeah... "esoteric" is the right word here if to consider all layers of quants, settings, quirks, and possibilities in LLM behavior. And thanks for sharing, I'm currently analyzing Llamaception. So I share a bit of my notes, too. Maybe that will be useful for other folks, or just to hear another opinion :D

  • Must say that I personally hate the W++ and similar pseudo-code-heavy formats, if you mean that. Once tried an experiment by creating a card (~1,5 tokens) in 3 formats: array/dictionary list blocks, W++, and JSON. The last one still worked better than W++ and was a bit lighter on tokens xD
  • From tests on medium-large CYAOs with rules and lorebooks, I noted that clean, concise, and structured formats with block descriptions and clean separation between them behave better on smaller models and quants. Like blocks in XML tags... but it is a bit heavy on tokens, and header-based structure with lists works nearly the same, so this is my pick for now. The KISS principle in all its glory.
  • And I found that LLMs are extremely picky about words' tone and meaning. For example, with "Creatively impersonate {{char}}", the AI will try using what it has... creatively :D , while "Masterfully impersonate {{char}}" will allow stronger deviations from the source based on the model's alignment and built-in criteria of 'Masterfulness'.
  • The same with system prompts and other instructions: clinical tone -> better adherence (if not ruined by samplings, ofc), but more "dry output" without examples. More examples = token bloat and worse adherence </3
  • Like with our biological neurons, specific words and tokens will trigger related chains of neurons (data/knowledge). The difference is that LLM's neurons are highly "narrow-visioned", and it may be beneficial to add related tags, themes, concepts, and media in the prompt/description. Basically, AI will be aware of related themes, and it will have a stronger influence on the output.
  • And if we add all LLM layers on top, including that each LLM has its own quirks, we will get...the optimization BUTTHURT and EARLY BALDING for FREE! Yeeey...👨‍🦲

Sign up or log in to comment