Finetuning snowflake-arctic-embed-l-v2.0 for a specific domain

by wilfoderek - opened Dec 17, 2024

Dec 17, 2024

What is the recommended process for fine-tuning the Snowflake-Arctic-Embed-L-v2.0 model in Spanish, specifically for the legal domain? What tools, legal datasets in Spanish, and training configurations are required to adapt the embeddings effectively for this domain?

tapos999

Jan 17

Hi, I am also interested in knowing some fine-tuning strategy. I plan to use it for retrieval on a specific domain. I have some dataset with context and question pair. not sure if any proper fine-tuning strategy/tutorial available for this model.

tomaarsen

Jan 17

Hello!

I'm not from Snowflake, but I do have experience finetuning embedding models. This model is fully integrated with Sentence Transformers, which can be used to further finetune embedding models. I would recommend having a read through the Training Overview documentation page. At the end is a full training script which can easily be adapted to Snowflake/snowflake-arctic-embed-l-v2.0 instead of microsoft/mpnet-base (the model in the example).

Beyond that, this model uses prompts, see the config. In Sentence Transformers, you can train with such prompts by setting the prompts argument in the SentenceTransformerTrainingArguments. See Training with Prompts for more details.
P.s. you can also ignore the prompts; with finetuning you can teach the model to not rely on them, and it will likely only result in a 1% worse model.

Disclaimer: I maintain Sentence Transformers and try and help users of Sentence Transformer models, including this one.

Tom Aarsen

tapos999

Jan 17

•

edited Jan 17

Hello!

I'm not from Snowflake, but I do have experience finetuning embedding models. This model is fully integrated with Sentence Transformers, which can be used to further finetune embedding models. I would recommend having a read through the Training Overview documentation page. At the end is a full training script which can easily be adapted to Snowflake/snowflake-arctic-embed-l-v2.0 instead of microsoft/mpnet-base (the model in the example).

Beyond that, this model uses prompts, see the config. In Sentence Transformers, you can train with such prompts by setting the prompts argument in the SentenceTransformerTrainingArguments. See Training with Prompts for more details.
P.s. you can also ignore the prompts; with finetuning you can teach the model to not rely on them, and it will likely only result in a 1% worse model.

Disclaimer: I maintain Sentence Transformers and try and help users of Sentence Transformer models, including this one.

Tom Aarsen

Hi Tom,

thank you for clarifying. good to know. I already used sentence transformer to use a bge embedder fine-tuning. then I would guess its more or less same. I wasn't sure architecture wise. I will check again the provided links here.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment