Spaces:
Running
Running
Design choice for Q&A token length
#1
by
lewtun
- opened
Hi guys, you note in the blog post that:
We removed all individual turns whose combined question and answer length exceeds 8192 tokens
I was curious why you picked 8192 tokens as the threshold? Is it because the vast majority of samples fall in this window or some other reason?
Hi, the main reason was that we wanted to clean some outlier samples with ridiculous length, for example we found one random turn that had a length of 1M. We decided on 8k because it was the original context length of our language backbone and after analysing the sample lengths it turned out that the vast majority of data was under this threshold as well (probably >99.9%).
Makes sense, thanks!