Hello, may I ask approximately how many tokens were used for model training?
May I ask approximately how much data was used for model training, and which datasets do you recommend using?
May I ask approximately how much data was used for model training, and which datasets do you recommend using?
Hi, since it has been quite a while, I don't quite remember the specific dataset and number of tokens used for training. The number of tokens should be <200B. As for the datasets, I vaguely recall using more than ten different Chinese and English datasets, covering general, mathematical, code-related corpora, etc. The ones listed in the readme are the ones I mainly used.
Thank you so much! I'm curious, when you were training, did you use something like curriculum learning to adjust the dataset mix ratio? I'm really excited to give your project a try. Could you please share what your final pre-training loss value was?
Thank you so much! I'm curious, when you were training, did you use something like curriculum learning to adjust the dataset mix ratio? I'm really excited to give your project a try. Could you please share what your final pre-training loss value was?
I actually didn't use curriculum learning here. To be honest, I've always found it a bit of a hassle to get right in practice. From what I've seen in my other work, you can get a surprisingly solid baseline just by carefully balancing the mix of general and specialized data.
An interesting idea you could play with, though, is to feed in some instruction data or other high-quality stuff during the decay phase of WSD schedule. This has been proven to be a good strategy.
Regarding the final loss value, apologies, but I can't seem to find those logs anymore, it's been a while. Maybe you could try calculating the model's ppl on the datasets I listed?
Good luck, and have fun with it!
Thanks so much! I need to do a deeper dive into this.