Design choice for Q&A token length

#1
by lewtun - opened
HuggingFaceM4 org

Hi guys, you note in the blog post that:

We removed all individual turns whose combined question and answer length exceeds 8192 tokens

I was curious why you picked 8192 tokens as the threshold? Is it because the vast majority of samples fall in this window or some other reason?

HuggingFaceM4 org

Hi, the main reason was that we wanted to clean some outlier samples with ridiculous length, for example we found one random turn that had a length of 1M. We decided on 8k because it was the original context length of our language backbone and after analysing the sample lengths it turned out that the vast majority of data was under this threshold as well (probably >99.9%).

HuggingFaceM4 org

Makes sense, thanks!

Sign up or log in to comment