allenai/llama-3.1-tulu-2-dpo-8b · How to interpret the performance drop compared with the original llama-3.1-8b?

Apr 15

Hi! Thank you for releasing the model checkpoint and evaluation results. I noticed that the average performance of this model is worse than llama-3.1-8b-instruct. Does this mean the training process of tulu hurt llama-3.1? Should I always use the original llama-3.1-b-instruct over this model?

hamishivi

Ai2 org Apr 15

Hi!
You're correct in that this model underperforms Llama 3.1 Instruct. We have a newer release, Tulu 3, which outperforms Llama 3.1 Instruct on a number of benchmarks, so you could use that.

More generally, if you are doing research or want to use models where knowing the post-training recipe is important, using the Tulu models may be preferable since we release all data, code, and hyperparameters used to train the model, making them fully transparent (here for Tulu 2 and here for Tulu 3). If you are just interested in using the most performant model, then Tulu 3 or Llama 3.1 Instruct might be best for you (or newer models such as Llama 4 or Qwen 2.5).

hamishivi changed discussion status to closed Apr 15