tiiuae/Falcon-E-3B-Instruct · Training Tokens

Assuming 1.5 Tera tokens = 1.5 Trillion Tokens as mentioned in the blog post. These models are aligned with the scaling law (Num Parameters * Num tokens), ive tested these alongside many other models on a private benchmark and they roughly follow the line of best fit. Which indicates there is no downside to training bitnet models and they scale just as well as their fp16 counterparts, maybe apart from the lack of adoption.

Im curious to hear what your experience/thoughts are when scaling training up to the standards of Qwen 3 or even the new falcon hybrid models like the same 10T token run on a 1B or 3B bitnet model. I think it would scale well.

Although these are very good models for the size and number of tokens theyve been trained on, theres practically no reason to use them over the Qwen 3 4B equivalent because it was trained on a humungous 30T tokens. It outsclasses everything in its size bracket due to how many tokens its seen.