Make compatible to sentence-transformers
@Shitao Can you merge this? This will make it compatible to last-token pooling in sentence transformers
@michaelfeil thank you for making these change, i believe it will bring in benefits for the model serving.
one question regarding the max_seq_length
, though not officially mentioned, the bge-en-icl model context window seems to be 32,768
according to MTEB leaderboard. for any reason you set max_seq_length
to 4,096
in this change?
@starsy Fair point. 4096 is only be the default max-length for sentence-transformer loading.
32768 will lead to an OOM for some users, yet 32768 is technically correct.
Beyond, for huggingface-transformer implementation, there is a sliding window of 4096. Beyond 4096, you need the flash-attn cuda extension installed to receive correct output, otherwise you will just have silently incorrect output as torch.sdpa does not support window_size=4096 causal fwd attention.
Leaving in 32768 for now! @Shitao appreciate your review.
@Shitao Can you please review?
Thanks for your contribution! @michaelfeil