FineVision

Running

App Files Files Community

Design choice for Q&A token length

by lewtun - opened 2 days ago

Discussion

lewtun

HuggingFaceM4 org 2 days ago

Hi guys, you note in the blog post that:

We removed all individual turns whose combined question and answer length exceeds 8192 tokens

I was curious why you picked 8192 tokens as the threshold? Is it because the vast majority of samples fall in this window or some other reason?

lusxvr

HuggingFaceM4 org 2 days ago

Hi, the main reason was that we wanted to clean some outlier samples with ridiculous length, for example we found one random turn that had a length of 1M. We decided on 8k because it was the original context length of our language backbone and after analysing the sample lengths it turned out that the vast majority of data was under this threshold as well (probably >99.9%).

lewtun

HuggingFaceM4 org 2 days ago

Makes sense, thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment