Gemma 3 4b unslop experiment v3

An unslop finetune of google/gemma-3-4b-it

Temperature during training was at 1.0 this time around, model is a lot less weird
Rewards changed a little bit. I allowed a small number of sentences with 4+ commas instead of penalizing them all. This has cut down on the number of paranthetical phrases without completely eliminating them.
Lexical diversity score is a bit fancier this time. First I calculated MTLD for 600+ books I have and looked at the mean score. It was almost exactly 100.0, so that's the baseline I aimed for. MTLD of 80-120 all received full points (to avoid too much GRPO chaos), but deviations further than that get increasingly penalized.
I've uploaded a UD-Q4_K_XL GGUF with settings that I grabbed from Unsloth's quant using my lil utility: quant_clone

Basically the same as last time plus the minor changes above.

training code: train.py

Safetensors

Model size

4.3B params

Tensor type

BF16

Model tree for electroglyph/gemma-3-4b-it-unslop-GRPO-v3

Base model

Finetuned

Quantized

(139)

this model