Semantic similarity

#29
by ZijieAsus - opened

I am trying to use this model for multilingual semantic search.

model = SentenceTransformer('intfloat/multilingual-e5-base')
prefix = "query: "
en_emb = model.encode(prefix + "how do i change my google profile photo?", normalize_embeddings=True)
zh_emb = model.encode(prefix + "我如何更改我的Google個人照片?", normalize_embeddings=True)

from sentence_transformers.util import cos_sim
print(cos_sim(en_emb, zh_emb)) # 0.9223

# When the input is a word, it seems to be more obvious.
en_emb = model.encode(prefix + "Apple", normalize_embeddings=True)
jp_ emb = model.encode(prefix + "リンゴ", normalize_embeddings=True)

print(cos_sim(en_emb, jp_emb)) # 0.7541

In the first case, I expected the cosine similarity to be very close to 1.0 (for example, 0.99, 0.98), but the result was 0.9223. Is this within expectations?
or is there a reason for this?

Thanks !

Sign up or log in to comment