view reply Thanks for sharing!!!I might have spotted a minor mistake, should "To ensure full coverage of all domains in the non-thinking dataset" really be "To ensure full coverage of all domains in the thinking dataset"?
Running 2.87k 2.87k The Ultra-Scale Playbook 🌌 The ultimate guide to training LLM on large GPU Clusters