Feedback
Hi!
I gave your model a try after seeing you shared it in r/SillyTavern, your mention of the 12B model having better prose really peaked my interest.
The model is... really dumb. It confuses characters, intention, context and occasionally writes nonsense. It might be due to my settings since I run models warm: t=1.4 and min_p=0.02. I tried going down closer to your tested settings but as I approached 0.8 the model became much less creative in its responses, completely losing it's charm. It's not a big deal, I just keep swiping until I get a reply that works but I believe this may be a good place for improvement.
It's hard to tell whether the prose is good, different people have different concepts for what is good writing and a few years ago I would've been impressed by the "shiver down my spine" slop that today simply makes me gag. What I can say for sure, is that the model doesn't have any of the common "Mistralisms" that finetunes based on Mistral Small models have and I'm not quite sure yet if this is because of the dataset you used, or because of the possible reduced amount of synthetic data that you mentioned Nemo could have been trained with. I will soon try the 24B version and see by myself how much the base model affects the writing so I might have a better insight in a couple days.
The model also has a weird small tendency to positiveness, where even in dark scenarios it tries to get out of them and evil characters become wholesome, once again a few swipes is enough to get a more appropriate response so this doesn't quite limit the performance of the model, as it's still capable of handling darker themes like it's nothing.
Recently - tired of the modern Mistralisms - I went from the Cydonias and the Magnum-Diamonds back to Fimbulvetr, but the context length was too poor for my modern standards. So far your model gives me an identical feeling of freshness, while being able to handle much more appropriate context lengths, so it feels like a perfect replacement for those of us who want something different.
Hi there!
Thanks SO much for the comprehensive testing and quants! I have no experience with making quants at all and just run BF16 myself during testing.
The temp sensitivity is something I suspected but didn't test thoroughly. Since the model was trained purely on character RP conversations (no instruct, no reasoning tasks), it doesn't have much "intelligence headroom" to begin with - it's specialized rather than general. At 0.7 it stays on rails mostly, but I can see how that higher temp range would push it past its capabilities. On top of that, Mistral-Nemo / Small actually have a surprisingly low default temp according to their model cards, but I always used 0.7 myself (though as you noted though that does reduce creativity.)
According to UGI it does have small positivity bias, which I suppose makes sense, I didn't explicitly train it on many darker scenarios (nor did I train it on scenarios like multiple characters, only on 1-1 RP.)
I'm really curious about how it breaks at higher temps though; You mentioned it confuses characters, intention, and context. Is this in scenarios like:
- Multiple characters in one conversation? (The training data is purely 1-on-1)
- Lorebook integration? (Not in the dataset at all)
- Long context (50+ turns)?
- Complex plot tracking?
Basically trying to figure out if it's "model gets incoherent at high temp" or "model lacks training for this specific use case." If you're willing to share examples of where it goes off the rails, that would be super helpful for understanding the limitations and where I can improve it.
Re: the prose quality: totally fair that it's subjective. The main thing I was trying to avoid was the repetitive AI patterns ("something raw and vulnerable," constant "blue eyes + descriptor," etc.), so I'm glad that at least reads as different from typical Mistral finetunes! :)
Thanks again for testing it outside my comfort zone; this is exactly the kind of feedback that helps me understand what the model can and can't do.
PS: If you'd like, I can share one of my testing runs, a 10-15 turn RP between an AI model and Trouper / Prima. (outside of train / test set). Feel free to DM me on reddit or Discord if that's easier. This model comparison RP was the baseline I used to compare prose; IE:
Here's the same emotional beat from both models (Gwenpool character, vulnerable moment):
Prima-24B:
*Gwenpool's expression softens slightly when you understand her, and there's
something raw and vulnerable in her eyes for just a moment before she quickly
recovers.*
"That's sweet of you to say, but it's complicated. See, I'm not actually a
mutant- I'm from a different reality where I read comics about Marvel and stuff."
*She looks down, fidgeting with her hands.*
Trouper (12b):
*Gwen's eyes get a little misty at your words, and she lets out a quiet laugh.*
"Oh my god, you're making me emotional and I don't even cry. Or I haven't cried
since I got here, I don't know if that's a thing in this universe."
*She looks at you with genuine warmth.*
Notice how Prima uses "something raw and vulnerable" (telling you the emotion) while Trouper uses "eyes get a little misty" (showing you the emotion). Prima's speech is more theatrical and formal, while Trouper's sounds like actual conversation ("Oh my god, you're making me emotional").
I picked Gwenpool due her being a niche character that likely doesn't have much coverage in the base model
Update: just tried out the quantized versions on my laptop / HF inference endpoints- the model is MUCH dumber than when I was testing it; I'm currently investigating if something went wrong during the final export (of the weights I uploaded here, not anything the quanting did)
Okay so after re-exporting the weights to F16 instead of BF16, it seems to perform better. I also re-did the quant I had made originally (Q4KM). I'm not getting as many empty responses as I was with the previous quant, so that helps. Let me know if you observe any better behavior with this re-upload, and I'll move to do the same to Prima-24B which presumably suffered the same fate. @DakkaWolf
I'm using the quant of this model generated by mradermacher (Q5_K_M to be exact) and I have not run into anything I would consider dumb. Sure you get something kinda iffy or lacking from time to time, especially if using somewhat liberal sampling to promote creativity but that's just a limitation of Nemo itself which is only a 12b model and can't perform miracles.
I haven't been using Trouper too much yet so this is just a first impression but I think the strengths promised on the model readme seem to hold up, it's a very solid tune that doesn't feel too "AI-slopped". Some of its responses surprised me in a positive way, using phrases that felt clever and unique instead of just coming across as "yet another Nemo RP tune". If there's one limitation so far that I've noticed it's that the outputs seem to somewhat lean towards the positive/nice. Like if I already have a scenario going and switch to this model mid RP the character who was vicious up until now might get toned down into a tamer version. But overall I think it portrays characters pretty well.
The Trouper vs Prima thoughts are interesting. I always preferred Nemo to Mistral Small because I felt like Nemo, especially with finetuning, tends to be more capable of writing in a casual, authentic manner compared to Small which has more AI flourishes. Mistral Small seems to deal with longer context better without getting confused though, it is twice as big after all.
I'm using the quant of this model generated by mradermacher (Q5_K_M to be exact) and I have not run into anything I would consider dumb. Sure you get something kinda iffy or lacking from time to time, especially if using somewhat liberal sampling to promote creativity but that's just a limitation of Nemo itself which is only a 12b model and can't perform miracles.
I haven't been using Trouper too much yet so this is just a first impression but I think the strengths promised on the model readme seem to hold up, it's a very solid tune that doesn't feel too "AI-slopped". Some of its responses surprised me in a positive way, using phrases that felt clever and unique instead of just coming across as "yet another Nemo RP tune". If there's one limitation so far that I've noticed it's that the outputs seem to somewhat lean towards the positive/nice. Like if I already have a scenario going and switch to this model mid RP the character who was vicious up until now might get toned down into a tamer version. But overall I think it portrays characters pretty well.
The Trouper vs Prima thoughts are interesting. I always preferred Nemo to Mistral Small because I felt like Nemo, especially with finetuning, tends to be more capable of writing in a casual, authentic manner compared to Small which has more AI flourishes. Mistral Small seems to deal with longer context better without getting confused though, it is twice as big after all.
Hey there, thanks for checking it out! I fear some of the inference issues I faced may be down to my build of llama.cpp or my hardware; in koboldcpp the model behaves, but yesterday I was having empty replies or even just pure garbage output from llama.cpp; so I'm not entirely sure what the issue was. Either way, if there are issues still I'd love to hear about them
Inference aside, I'm really glad that Trouper is indeed producing unique writing / a good experience; that was my goal after all. I suspected that 24B would handle longer contexts better, as you say it is literally twice the size and thus just has more headroom to do these things with. Darker RP I don't have much experience with, is there any characters or samples I should look at to help improve this? Is the positivity bias incredibly egregious or manageable?
Have you tried out Prima? I'm curious how that model performs in comparison, I haven't really heard feedback about it yet so far
Thanks again for your time and the kind words :)
Hi again!
After a few days of silence while testing I am back with some interesting numbers. For starters, good catch on the FP16 quantization improvement! I converted the model to GGUF FP16, re-computed the importance matrix using that instead, re-quantized Q4_K_M and IQ4_XS and tested them both against the logits I got from the BF16 model (to compare the probabilistic distribution of the quants made with FP16 to the ones made with BF16.
| Quant | BF16 KLD99 | FP16 KLD99 | Difference |
|---|---|---|---|
| Q4_K_M | 0.130179 | 0.132086 | 0.001907 |
| IQ4_XS | 0.171350 | 0.168009 | 0.003341 |
As you can see, in the case of the Q4_K_M quant, the BF16 version performs better, while in the IQ4_XS quant, the FP16 version is the better one. That being said, in neither of these cases the difference should be enough to be noticeable and that's also what I found in inference: both versions of the model face the same issues; empty replies, confusing quotation marks for asterisks, characters sometimes thinking they are the player (although this is likely to be caused because I don't append character names!), or the model writing it's reply in a '-' Markdown list, as well as failing to handle nuance in some cases while nailing it in others.
You and me having this experience with ~4-bit quants while bobp has no problems with mradermacher's Q5_K_M makes me think that the problem isn't with Llama.cpp or the specific quant, but with the size of the quant. Maybe this model simply doesn't deal with lower bits-per-weight for some reason, I'm nowhere near smart enough to have an explanation for it.
Once again though - just to clarify - to me, these problems represent only a small bump on the road and I can easily live with them as long as the model has a rich varied prose that escapes the "Mistralisms" of the base model, which this model does perfectly.
In conclusion...
- The negative: Sometimes it adds the odd out of place token, confuse characters or fail to understand what characters are doing or what is happening around them. The model is generally "dumb".
- The neutral: The model leans to positive, wholesome and sweet experiences which are not my cup of tea. Some seeds however do allow the model to engage in darker themes, so swiping a couple feet will often get me back on a dark path.
- The positive: This model excels at non-slop creative writing, it feels like a completely different model which is extremely refreshing in the current landscape for creative-purposes open weights LLMs. The small size of the model makes it blazing fast and allows me to swipe for new replies near instantly (70+ t/s), mostly mitigating a lot of the negatives of the model.
I still have yet to try the 24B model which may perform much better due to its sheer size, so I will likely add some feedback to that one after a week or two and if possible, publish some quantizations as well... although, seeing as mradermacher already published his I might see as unnecessary.
Thank you for your models!
@DakkaWolf Hey there!
Thanks again for all your investigative work, I really appreciate it and it is exactly the kind of feedback that I was looking for; I'm glad to see that multiple people confirm that it is indeed a refreshing new model, if leaning positive.
You may be right in that it is more sensitive to quanting, it's not a typical Nemo after all. I may consider using Q5 as the default going forward.
That aside, I'm already working on some new ideas & experiments, and I'll be sure to include your feedback. If you have any features or specific scenario / character examples that I could use to jump off from (to strengthen darker themes), do let me know!
I may post an experiment in the coming weeks based on Trouper, which I'm hoping will increase the intelligence a bit without compromising it's writing ability. I'm also very much looking forward to your review of Prima, since I haven't really gotten any so far for that model, seems like 12B is more popular among the RP community (which I assume is due to the VRAM requirements?)
Looking forward to your review and thanks again!