Novel training procedure to deslopify instruct/assistant models.

No SFT.

Pure RL with a good signal.