fix/refactor-readme

by saahil-ognawala - opened Jan 25, 2024

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

+26

-95

Files changed (6) hide show

README.md +20 -79
config.json +0 -1
onnx/model_fp16.onnx → model-w-mean-pooling.onnx +2 -2
onnx/model.onnx → model.onnx +2 -2
modules.json +0 -6
onnx/model_quantized.onnx +0 -3

README.md CHANGED Viewed

@@ -3,10 +3,6 @@ tags:
   - sentence-transformers
   - feature-extraction
   - sentence-similarity
-  - mteb
-  - transformers
-  - transformers.js
-inference: false
 license: apache-2.0
 language:
 - en
@@ -1068,7 +1064,7 @@ model-index:
 <br><br>
 <p align="center">
-<img src="https://huggingface.co/datasets/jinaai/documentation-images/resolve/main/logo.webp" alt="Jina AI: Your Search Foundation, Supercharged!" width="150px">
 </p>
@@ -1076,9 +1072,6 @@ model-index:
 <b>The text embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
 </p>
-## Quick Start
-The easiest way to starting using `jina-embeddings-v2-base-zh` is to use Jina AI's [Embedding API](https://jina.ai/embeddings/).
 ## Intended Usage & Model Info
@@ -1094,15 +1087,13 @@ Additionally, we provide the following embedding models:
 - [`jina-embeddings-v2-small-en`](https://huggingface.co/jinaai/jina-embeddings-v2-small-en): 33 million parameters.
 - [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
-- [`jina-embeddings-v2-base-zh`](https://huggingface.co/jinaai/jina-embeddings-v2-base-zh): 161 million parameters Chinese-English Bilingual embeddings **(you are here)**.
-- [`jina-embeddings-v2-base-de`](https://huggingface.co/jinaai/jina-embeddings-v2-base-de): 161 million parameters German-English Bilingual embeddings.
 - [`jina-embeddings-v2-base-es`](): Spanish-English Bilingual embeddings (soon).
-- [`jina-embeddings-v2-base-code`](https://huggingface.co/jinaai/jina-embeddings-v2-base-code): 161 million parameters code embeddings.
 ## Data & Parameters
-The data and training details are described in this [technical report](https://arxiv.org/abs/2402.17016).
 ## Usage
@@ -1130,7 +1121,7 @@ def mean_pooling(model_output, attention_mask):
 sentences = ['How is the weather today?', '今天天气怎么样?']
 tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-zh')
-model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True, torch_dtype=torch.bfloat16)
 encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
@@ -1144,16 +1135,14 @@ embeddings = F.normalize(embeddings, p=2, dim=1)
 </p>
 </details>
-You can use Jina Embedding models directly from transformers package.
 ```python
 !pip install transformers
-import torch
 from transformers import AutoModel
 from numpy.linalg import norm
 cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
-model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True, torch_dtype=torch.bfloat16)
 embeddings = model.encode(['How is the weather today?', '今天天气怎么样?'])
 print(cos_sim(embeddings[0], embeddings[1]))
 ```
@@ -1167,45 +1156,9 @@ embeddings = model.encode(
 )
 ```
-If you want to use the model together with the [sentence-transformers package](https://github.com/UKPLab/sentence-transformers/), make sure that you have installed the latest release and set `trust_remote_code=True` as well:
-```python
-!pip install -U sentence-transformers
-from sentence_transformers import SentenceTransformer
-from numpy.linalg import norm
-cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
-model = SentenceTransformer('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True)
-embeddings = model.encode(['How is the weather today?', '今天天气怎么样?'])
-print(cos_sim(embeddings[0], embeddings[1]))
-```
-Using the its latest release (v2.3.0) sentence-transformers also supports Jina embeddings (Please make sure that you are logged into huggingface as well):
-```python
-!pip install -U sentence-transformers
-from sentence_transformers import SentenceTransformer
-from sentence_transformers.util import cos_sim
-model = SentenceTransformer(
-    "jinaai/jina-embeddings-v2-base-zh", # switch to en/zh for English or Chinese
-    trust_remote_code=True
-)
-# control your input sequence length up to 8192
-model.max_seq_length = 1024
-embeddings = model.encode([
-    'How is the weather today?',
-    '今天天气怎么样?'
-])
-print(cos_sim(embeddings[0], embeddings[1]))
-```
-## Alternatives to Using Transformers Package
-1. _Managed SaaS_: Get started with a free key on Jina AI's [Embedding API](https://jina.ai/embeddings/).
-2. _Private and high-performance deployment_: Get started by picking from our suite of models and deploy them on [AWS Sagemaker](https://aws.amazon.com/marketplace/seller-profile?id=seller-stch2ludm6vgy).
 ## Use Jina Embeddings for RAG
@@ -1215,26 +1168,12 @@ According to the latest blog post from [LLamaIndex](https://blog.llamaindex.ai/b
 <img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*ZP2RVejCZovF3FDCg-Bx3A.png" width="780px">
-## Trouble Shooting
-**Loading of Model Code failed**
-If you forgot to pass the `trust_remote_code=True` flag when calling `AutoModel.from_pretrained` or initializing the model via the `SentenceTransformer` class, you will receive an error that the model weights could not be initialized.
-This is caused by tranformers falling back to creating a default BERT model, instead of a jina-embedding model:
-```bash
-Some weights of the model checkpoint at jinaai/jina-embeddings-v2-base-zh were not used when initializing BertModel: ['encoder.layer.2.mlp.layernorm.weight', 'encoder.layer.3.mlp.layernorm.weight', 'encoder.layer.10.mlp.wo.bias', 'encoder.layer.5.mlp.wo.bias', 'encoder.layer.2.mlp.layernorm.bias', 'encoder.layer.1.mlp.gated_layers.weight', 'encoder.layer.5.mlp.gated_layers.weight', 'encoder.layer.8.mlp.layernorm.bias', ...
-```
-**User is not logged into Huggingface**
-The model is only availabe under [gated access](https://huggingface.co/docs/hub/models-gated).
-This means you need to be logged into huggingface load load it.
-If you receive the following error, you need to provide an access token, either by using the huggingface-cli or providing the token via an environment variable as described above:
-```bash
-OSError: jinaai/jina-embeddings-v2-base-zh is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
-If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.
-```
 ## Contact
@@ -1245,10 +1184,12 @@ Join our [Discord community](https://discord.jina.ai) and chat with other commun
 If you find Jina Embeddings useful in your research, please cite the following paper:
 ```
-@article{mohr2024multi,
-  title={Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings},
-  author={Mohr, Isabelle and Krimmel, Markus and Sturua, Saba and Akram, Mohammad Kalim and Koukounas, Andreas and G{\"u}nther, Michael and Mastrapas, Georgios and Ravishankar, Vinit and Mart{\'\i}nez, Joan Fontanals and Wang, Feng and others},
-  journal={arXiv preprint arXiv:2402.17016},
-  year={2024}
 }
 ```

   - sentence-transformers
   - feature-extraction
   - sentence-similarity
 license: apache-2.0
 language:
 - en
 <br><br>
 <p align="center">
+<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
 </p>
 <b>The text embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
 </p>
 ## Intended Usage & Model Info
 - [`jina-embeddings-v2-small-en`](https://huggingface.co/jinaai/jina-embeddings-v2-small-en): 33 million parameters.
 - [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
+- [`jina-embeddings-v2-base-zh`](): Chinese-English Bilingual embeddings (soon) **(you are here)**.
+- [`jina-embeddings-v2-base-de`](): German-English Bilingual embeddings (soon).
 - [`jina-embeddings-v2-base-es`](): Spanish-English Bilingual embeddings (soon).
 ## Data & Parameters
+Jina Embeddings V2 [technical report](https://arxiv.org/abs/2310.19923)
 ## Usage
 sentences = ['How is the weather today?', '今天天气怎么样?']
 tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-zh')
+model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True)
 encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
 </p>
 </details>
+You can use Jina Embedding models directly from transformers package:
 ```python
 !pip install transformers
 from transformers import AutoModel
 from numpy.linalg import norm
 cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
+model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True) # trust_remote_code is needed to use the encode method
 embeddings = model.encode(['How is the weather today?', '今天天气怎么样?'])
 print(cos_sim(embeddings[0], embeddings[1]))
 ```
 )
 ```
+## Fully-managed Embeddings Service
+Alternatively, you can use Jina AI's [Embedding platform](https://jina.ai/embeddings/) for fully-managed access to Jina Embeddings models.
 ## Use Jina Embeddings for RAG
 <img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*ZP2RVejCZovF3FDCg-Bx3A.png" width="780px">
+## Plans
+1. Bilingual embedding models supporting more European & Asian languages, including Spanish, French, Italian and Japanese.
+2. Multimodal embedding models enable Multimodal RAG applications.
+3. High-performt rerankers.
 ## Contact
 If you find Jina Embeddings useful in your research, please cite the following paper:
 ```
+@misc{günther2023jina,
+      title={Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents},
+      author={Michael Günther and Jackmin Ong and Isabelle Mohr and Alaeddine Abdessalem and Tanguy Abel and Mohammad Kalim Akram and Susana Guzman and Georgios Mastrapas and Saba Sturua and Bo Wang and Maximilian Werk and Nan Wang and Han Xiao},
+      year={2023},
+      eprint={2310.19923},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL}
 }
 ```

config.json CHANGED Viewed

@@ -24,7 +24,6 @@
   "intermediate_size": 3072,
   "layer_norm_eps": 1e-12,
   "max_position_embeddings": 8192,
-  "model_max_length": 8192,
   "model_type": "bert",
   "num_attention_heads": 12,
   "num_hidden_layers": 12,

   "intermediate_size": 3072,
   "layer_norm_eps": 1e-12,
   "max_position_embeddings": 8192,
   "model_type": "bert",
   "num_attention_heads": 12,
   "num_hidden_layers": 12,

onnx/model_fp16.onnx → model-w-mean-pooling.onnx RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e180e57cda672d10f636abd163e76b6bf9a870fe7076500b6578eaadbb7b1b45
-size 320852338

 version https://git-lfs.github.com/spec/v1
+oid sha256:40701019a8d9d961bd5df9a99f6ac8754740b70c14077176f3235bb518d99ab6
+size 641147938

onnx/model.onnx → model.onnx RENAMED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:4b0e9fa6e5c77cff56e0c9c673ba1aad61e793e592fdd4b05690b68826b7d3a2
-size 641212851

 version https://git-lfs.github.com/spec/v1
+oid sha256:214692104e1a0c7edfd9fed55701161c3a9b59c401114efaf8a51c8726d64fb7
+size 641145418

modules.json CHANGED Viewed

@@ -10,11 +10,5 @@
     "name": "1",
     "path": "1_Pooling",
     "type": "sentence_transformers.models.Pooling"
-  },
-  {
-    "idx": 2,
-    "name": "2",
-    "path": "2_Normalize",
-    "type": "sentence_transformers.models.Normalize"
   }
 ]

     "name": "1",
     "path": "1_Pooling",
     "type": "sentence_transformers.models.Pooling"
   }
 ]

onnx/model_quantized.onnx DELETED Viewed

@@ -1,3 +0,0 @@
-version https://git-lfs.github.com/spec/v1
-oid sha256:0a221ee9e6a6647ccc59cee7bdd26a7b8cf0c0cd3481a65f358d9585a23f02f4
-size 161565239