Finetuning snowflake-arctic-embed-l-v2.0 for a specific domain
What is the recommended process for fine-tuning the Snowflake-Arctic-Embed-L-v2.0 model in Spanish, specifically for the legal domain? What tools, legal datasets in Spanish, and training configurations are required to adapt the embeddings effectively for this domain?
Hi, I am also interested in knowing some fine-tuning strategy. I plan to use it for retrieval on a specific domain. I have some dataset with context and question pair. not sure if any proper fine-tuning strategy/tutorial available for this model.
Hello!
I'm not from Snowflake, but I do have experience finetuning embedding models. This model is fully integrated with Sentence Transformers, which can be used to further finetune embedding models. I would recommend having a read through the Training Overview documentation page. At the end is a full training script which can easily be adapted to Snowflake/snowflake-arctic-embed-l-v2.0
instead of microsoft/mpnet-base
(the model in the example).
Beyond that, this model uses prompts, see the config. In Sentence Transformers, you can train with such prompts by setting the prompts
argument in the SentenceTransformerTrainingArguments
. See Training with Prompts for more details.
P.s. you can also ignore the prompts
; with finetuning you can teach the model to not rely on them, and it will likely only result in a 1% worse model.
Disclaimer: I maintain Sentence Transformers and try and help users of Sentence Transformer models, including this one.
- Tom Aarsen
Hello!
I'm not from Snowflake, but I do have experience finetuning embedding models. This model is fully integrated with Sentence Transformers, which can be used to further finetune embedding models. I would recommend having a read through the Training Overview documentation page. At the end is a full training script which can easily be adapted to
Snowflake/snowflake-arctic-embed-l-v2.0
instead ofmicrosoft/mpnet-base
(the model in the example).Beyond that, this model uses prompts, see the config. In Sentence Transformers, you can train with such prompts by setting the
prompts
argument in theSentenceTransformerTrainingArguments
. See Training with Prompts for more details.
P.s. you can also ignore theprompts
; with finetuning you can teach the model to not rely on them, and it will likely only result in a 1% worse model.Disclaimer: I maintain Sentence Transformers and try and help users of Sentence Transformer models, including this one.
- Tom Aarsen
Hi Tom,
thank you for clarifying. good to know. I already used sentence transformer to use a bge embedder fine-tuning. then I would guess its more or less same. I wasn't sure architecture wise. I will check again the provided links here.