README.md CHANGED
@@ -3,10 +3,6 @@ tags:
3
  - sentence-transformers
4
  - feature-extraction
5
  - sentence-similarity
6
- - mteb
7
- - transformers
8
- - transformers.js
9
- inference: false
10
  license: apache-2.0
11
  language:
12
  - en
@@ -1068,7 +1064,7 @@ model-index:
1068
  <br><br>
1069
 
1070
  <p align="center">
1071
- <img src="https://huggingface.co/datasets/jinaai/documentation-images/resolve/main/logo.webp" alt="Jina AI: Your Search Foundation, Supercharged!" width="150px">
1072
  </p>
1073
 
1074
 
@@ -1076,9 +1072,6 @@ model-index:
1076
  <b>The text embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
1077
  </p>
1078
 
1079
- ## Quick Start
1080
-
1081
- The easiest way to starting using `jina-embeddings-v2-base-zh` is to use Jina AI's [Embedding API](https://jina.ai/embeddings/).
1082
 
1083
  ## Intended Usage & Model Info
1084
 
@@ -1094,15 +1087,13 @@ Additionally, we provide the following embedding models:
1094
 
1095
  - [`jina-embeddings-v2-small-en`](https://huggingface.co/jinaai/jina-embeddings-v2-small-en): 33 million parameters.
1096
  - [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
1097
- - [`jina-embeddings-v2-base-zh`](https://huggingface.co/jinaai/jina-embeddings-v2-base-zh): 161 million parameters Chinese-English Bilingual embeddings **(you are here)**.
1098
- - [`jina-embeddings-v2-base-de`](https://huggingface.co/jinaai/jina-embeddings-v2-base-de): 161 million parameters German-English Bilingual embeddings.
1099
  - [`jina-embeddings-v2-base-es`](): Spanish-English Bilingual embeddings (soon).
1100
- - [`jina-embeddings-v2-base-code`](https://huggingface.co/jinaai/jina-embeddings-v2-base-code): 161 million parameters code embeddings.
1101
 
1102
  ## Data & Parameters
1103
 
1104
- The data and training details are described in this [technical report](https://arxiv.org/abs/2402.17016).
1105
-
1106
 
1107
  ## Usage
1108
 
@@ -1130,7 +1121,7 @@ def mean_pooling(model_output, attention_mask):
1130
  sentences = ['How is the weather today?', '今天天气怎么样?']
1131
 
1132
  tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-zh')
1133
- model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True, torch_dtype=torch.bfloat16)
1134
 
1135
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
1136
 
@@ -1144,16 +1135,14 @@ embeddings = F.normalize(embeddings, p=2, dim=1)
1144
  </p>
1145
  </details>
1146
 
1147
- You can use Jina Embedding models directly from transformers package.
1148
-
1149
  ```python
1150
  !pip install transformers
1151
- import torch
1152
  from transformers import AutoModel
1153
  from numpy.linalg import norm
1154
 
1155
  cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
1156
- model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True, torch_dtype=torch.bfloat16)
1157
  embeddings = model.encode(['How is the weather today?', '今天天气怎么样?'])
1158
  print(cos_sim(embeddings[0], embeddings[1]))
1159
  ```
@@ -1167,45 +1156,9 @@ embeddings = model.encode(
1167
  )
1168
  ```
1169
 
1170
- If you want to use the model together with the [sentence-transformers package](https://github.com/UKPLab/sentence-transformers/), make sure that you have installed the latest release and set `trust_remote_code=True` as well:
1171
-
1172
- ```python
1173
- !pip install -U sentence-transformers
1174
- from sentence_transformers import SentenceTransformer
1175
- from numpy.linalg import norm
1176
-
1177
- cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
1178
- model = SentenceTransformer('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True)
1179
- embeddings = model.encode(['How is the weather today?', '今天天气怎么样?'])
1180
- print(cos_sim(embeddings[0], embeddings[1]))
1181
- ```
1182
-
1183
- Using the its latest release (v2.3.0) sentence-transformers also supports Jina embeddings (Please make sure that you are logged into huggingface as well):
1184
 
1185
- ```python
1186
- !pip install -U sentence-transformers
1187
- from sentence_transformers import SentenceTransformer
1188
- from sentence_transformers.util import cos_sim
1189
-
1190
- model = SentenceTransformer(
1191
- "jinaai/jina-embeddings-v2-base-zh", # switch to en/zh for English or Chinese
1192
- trust_remote_code=True
1193
- )
1194
-
1195
- # control your input sequence length up to 8192
1196
- model.max_seq_length = 1024
1197
-
1198
- embeddings = model.encode([
1199
- 'How is the weather today?',
1200
- '今天天气怎么样?'
1201
- ])
1202
- print(cos_sim(embeddings[0], embeddings[1]))
1203
- ```
1204
-
1205
- ## Alternatives to Using Transformers Package
1206
-
1207
- 1. _Managed SaaS_: Get started with a free key on Jina AI's [Embedding API](https://jina.ai/embeddings/).
1208
- 2. _Private and high-performance deployment_: Get started by picking from our suite of models and deploy them on [AWS Sagemaker](https://aws.amazon.com/marketplace/seller-profile?id=seller-stch2ludm6vgy).
1209
 
1210
  ## Use Jina Embeddings for RAG
1211
 
@@ -1215,26 +1168,12 @@ According to the latest blog post from [LLamaIndex](https://blog.llamaindex.ai/b
1215
 
1216
  <img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*ZP2RVejCZovF3FDCg-Bx3A.png" width="780px">
1217
 
1218
- ## Trouble Shooting
1219
-
1220
- **Loading of Model Code failed**
1221
-
1222
- If you forgot to pass the `trust_remote_code=True` flag when calling `AutoModel.from_pretrained` or initializing the model via the `SentenceTransformer` class, you will receive an error that the model weights could not be initialized.
1223
- This is caused by tranformers falling back to creating a default BERT model, instead of a jina-embedding model:
1224
 
1225
- ```bash
1226
- Some weights of the model checkpoint at jinaai/jina-embeddings-v2-base-zh were not used when initializing BertModel: ['encoder.layer.2.mlp.layernorm.weight', 'encoder.layer.3.mlp.layernorm.weight', 'encoder.layer.10.mlp.wo.bias', 'encoder.layer.5.mlp.wo.bias', 'encoder.layer.2.mlp.layernorm.bias', 'encoder.layer.1.mlp.gated_layers.weight', 'encoder.layer.5.mlp.gated_layers.weight', 'encoder.layer.8.mlp.layernorm.bias', ...
1227
- ```
1228
-
1229
- **User is not logged into Huggingface**
1230
 
1231
- The model is only availabe under [gated access](https://huggingface.co/docs/hub/models-gated).
1232
- This means you need to be logged into huggingface load load it.
1233
- If you receive the following error, you need to provide an access token, either by using the huggingface-cli or providing the token via an environment variable as described above:
1234
- ```bash
1235
- OSError: jinaai/jina-embeddings-v2-base-zh is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
1236
- If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.
1237
- ```
1238
 
1239
  ## Contact
1240
 
@@ -1245,10 +1184,12 @@ Join our [Discord community](https://discord.jina.ai) and chat with other commun
1245
  If you find Jina Embeddings useful in your research, please cite the following paper:
1246
 
1247
  ```
1248
- @article{mohr2024multi,
1249
- title={Multi-Task Contrastive Learning for 8192-Token Bilingual Text Embeddings},
1250
- author={Mohr, Isabelle and Krimmel, Markus and Sturua, Saba and Akram, Mohammad Kalim and Koukounas, Andreas and G{\"u}nther, Michael and Mastrapas, Georgios and Ravishankar, Vinit and Mart{\'\i}nez, Joan Fontanals and Wang, Feng and others},
1251
- journal={arXiv preprint arXiv:2402.17016},
1252
- year={2024}
 
 
1253
  }
1254
  ```
 
3
  - sentence-transformers
4
  - feature-extraction
5
  - sentence-similarity
 
 
 
 
6
  license: apache-2.0
7
  language:
8
  - en
 
1064
  <br><br>
1065
 
1066
  <p align="center">
1067
+ <img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
1068
  </p>
1069
 
1070
 
 
1072
  <b>The text embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
1073
  </p>
1074
 
 
 
 
1075
 
1076
  ## Intended Usage & Model Info
1077
 
 
1087
 
1088
  - [`jina-embeddings-v2-small-en`](https://huggingface.co/jinaai/jina-embeddings-v2-small-en): 33 million parameters.
1089
  - [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
1090
+ - [`jina-embeddings-v2-base-zh`](): Chinese-English Bilingual embeddings (soon) **(you are here)**.
1091
+ - [`jina-embeddings-v2-base-de`](): German-English Bilingual embeddings (soon).
1092
  - [`jina-embeddings-v2-base-es`](): Spanish-English Bilingual embeddings (soon).
 
1093
 
1094
  ## Data & Parameters
1095
 
1096
+ Jina Embeddings V2 [technical report](https://arxiv.org/abs/2310.19923)
 
1097
 
1098
  ## Usage
1099
 
 
1121
  sentences = ['How is the weather today?', '今天天气怎么样?']
1122
 
1123
  tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-zh')
1124
+ model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True)
1125
 
1126
  encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
1127
 
 
1135
  </p>
1136
  </details>
1137
 
1138
+ You can use Jina Embedding models directly from transformers package:
 
1139
  ```python
1140
  !pip install transformers
 
1141
  from transformers import AutoModel
1142
  from numpy.linalg import norm
1143
 
1144
  cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
1145
+ model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True) # trust_remote_code is needed to use the encode method
1146
  embeddings = model.encode(['How is the weather today?', '今天天气怎么样?'])
1147
  print(cos_sim(embeddings[0], embeddings[1]))
1148
  ```
 
1156
  )
1157
  ```
1158
 
1159
+ ## Fully-managed Embeddings Service
 
 
 
 
 
 
 
 
 
 
 
 
 
1160
 
1161
+ Alternatively, you can use Jina AI's [Embedding platform](https://jina.ai/embeddings/) for fully-managed access to Jina Embeddings models.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1162
 
1163
  ## Use Jina Embeddings for RAG
1164
 
 
1168
 
1169
  <img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*ZP2RVejCZovF3FDCg-Bx3A.png" width="780px">
1170
 
 
 
 
 
 
 
1171
 
1172
+ ## Plans
 
 
 
 
1173
 
1174
+ 1. Bilingual embedding models supporting more European & Asian languages, including Spanish, French, Italian and Japanese.
1175
+ 2. Multimodal embedding models enable Multimodal RAG applications.
1176
+ 3. High-performt rerankers.
 
 
 
 
1177
 
1178
  ## Contact
1179
 
 
1184
  If you find Jina Embeddings useful in your research, please cite the following paper:
1185
 
1186
  ```
1187
+ @misc{günther2023jina,
1188
+ title={Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents},
1189
+ author={Michael Günther and Jackmin Ong and Isabelle Mohr and Alaeddine Abdessalem and Tanguy Abel and Mohammad Kalim Akram and Susana Guzman and Georgios Mastrapas and Saba Sturua and Bo Wang and Maximilian Werk and Nan Wang and Han Xiao},
1190
+ year={2023},
1191
+ eprint={2310.19923},
1192
+ archivePrefix={arXiv},
1193
+ primaryClass={cs.CL}
1194
  }
1195
  ```
config.json CHANGED
@@ -24,7 +24,6 @@
24
  "intermediate_size": 3072,
25
  "layer_norm_eps": 1e-12,
26
  "max_position_embeddings": 8192,
27
- "model_max_length": 8192,
28
  "model_type": "bert",
29
  "num_attention_heads": 12,
30
  "num_hidden_layers": 12,
 
24
  "intermediate_size": 3072,
25
  "layer_norm_eps": 1e-12,
26
  "max_position_embeddings": 8192,
 
27
  "model_type": "bert",
28
  "num_attention_heads": 12,
29
  "num_hidden_layers": 12,
onnx/model_fp16.onnx → model-w-mean-pooling.onnx RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:e180e57cda672d10f636abd163e76b6bf9a870fe7076500b6578eaadbb7b1b45
3
- size 320852338
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:40701019a8d9d961bd5df9a99f6ac8754740b70c14077176f3235bb518d99ab6
3
+ size 641147938
onnx/model.onnx → model.onnx RENAMED
@@ -1,3 +1,3 @@
1
  version https://git-lfs.github.com/spec/v1
2
- oid sha256:4b0e9fa6e5c77cff56e0c9c673ba1aad61e793e592fdd4b05690b68826b7d3a2
3
- size 641212851
 
1
  version https://git-lfs.github.com/spec/v1
2
+ oid sha256:214692104e1a0c7edfd9fed55701161c3a9b59c401114efaf8a51c8726d64fb7
3
+ size 641145418
modules.json CHANGED
@@ -10,11 +10,5 @@
10
  "name": "1",
11
  "path": "1_Pooling",
12
  "type": "sentence_transformers.models.Pooling"
13
- },
14
- {
15
- "idx": 2,
16
- "name": "2",
17
- "path": "2_Normalize",
18
- "type": "sentence_transformers.models.Normalize"
19
  }
20
  ]
 
10
  "name": "1",
11
  "path": "1_Pooling",
12
  "type": "sentence_transformers.models.Pooling"
 
 
 
 
 
 
13
  }
14
  ]
onnx/model_quantized.onnx DELETED
@@ -1,3 +0,0 @@
1
- version https://git-lfs.github.com/spec/v1
2
- oid sha256:0a221ee9e6a6647ccc59cee7bdd26a7b8cf0c0cd3481a65f358d9585a23f02f4
3
- size 161565239