Novel training procedure to deslopify instruct/assistant models. No SFT. Pure RL with a good signal.