pere's picture
GRPO model (assistant split heuristic reward)
78decbe verified