Supposed to be same or better than v1?

#44

by persijano - opened Mar 25

Mar 25

Hi there,

Just wanted to highlight that one might want to experiment with v1 and v1.5 and compare them side-by-side before making a decision.

On my information retrieval tasks & eval data, nomic-embed-text-v1 shows MUCH better performance than v1.5.

nomic-v1: MRR@10 = 0.41
nomic-v1.5: MRR@10 = 0.27

Other metrics reported by InformationRetrievalEvaluator look similar, however, MRR is my primary one.

I'm not sure what the reason might be, but I figured it's worth sharing with the community!

P.S. I have a rather long documents, markdown webpages, that's why I turned to nomic in the first place.

zpn

Mar 25

Hm that's interesting! Would you mind sharing how you eval'd the model and what the data looks like?

persijano

Mar 25

Sure, I'm using sentence-transformers to finetune / evaluate.

Here's a snippet:

evaluator = InformationRetrievalEvaluator(hard_queries, hard_corpus, hard_relevant_docs, corpus_chunk_size=BATCH_SIZE, batch_size=BATCH_SIZE, show_progress_bar=True, query_prompt='search_query: ', corpus_prompt='search_document: ')

I'm working with search queries like b2b marketing automation platforms and documents being scraped from company websites, formatted as Markdown (median seq length ~ 1800 tokens).

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment