[marin-community/marin#1726] Fix KL loss in the RL training.
A few fixes for RL training.
- We were computing tokens incorrectly by decoding from the token text vs from the logprob tokens. These can diverge when there are special tokens in the output.
- Our KL loss was calculating the KL divergence, but not actually a penalty - the model was encouraged to diverge from the reference.
- We were using the "old" mesh syntax in a number of locations.
