pere's picture
GRPO model (assistant split heuristic reward)
d12c18a verified