Is the last figure caption correct?
"The above diagram makes it look like both models generate different responses for the same prompt, but what really happens is that the RL policy generates text, and that text is fed into the initial model to produce its relative probabilities for the KL penalty."
To me it makes more sense to feed the prompt to the reference and the policy models then to compare the reference model probability distribution with the policy's.
Any insight would be appreciated.
Best,
Chiron