Training Tokens
Assuming 1.5 Tera tokens = 1.5 Trillion Tokens as mentioned in the blog post. These models are aligned with the scaling law (Num Parameters * Num tokens), ive tested these alongside many other models on a private benchmark and they roughly follow the line of best fit. Which indicates there is no downside to training bitnet models and they scale just as well as their fp16 counterparts, maybe apart from the lack of adoption.
Im curious to hear what your experience/thoughts are when scaling training up to the standards of Qwen 3 or even the new falcon hybrid models like the same 10T token run on a 1B or 3B bitnet model. I think it would scale well.
Although these are very good models for the size and number of tokens theyve been trained on, theres practically no reason to use them over the Qwen 3 4B equivalent because it was trained on a humungous 30T tokens. It outsclasses everything in its size bracket due to how many tokens its seen.
Thank you very much for this take @srinivasbilla - is it possible to share more details about the private benchmark you did, mainly curious on how it performs compared to other existing models (of course no need to share every detail of the benchmark but mainly curious about the results). On our side, we also think it'd be interesting to scale the number of training tokens indeed (Microsfot team did trained up to 4TT with no major issue so it should scale well)