fix/refactor-readme
#2
by
saahil-ognawala
- opened
- README.md +20 -79
- config.json +0 -1
- onnx/model_fp16.onnx → model-w-mean-pooling.onnx +2 -2
- onnx/model.onnx → model.onnx +2 -2
- modules.json +0 -6
- onnx/model_quantized.onnx +0 -3
README.md
CHANGED
@@ -3,10 +3,6 @@ tags:
|
|
3 |
- sentence-transformers
|
4 |
- feature-extraction
|
5 |
- sentence-similarity
|
6 |
-
- mteb
|
7 |
-
- transformers
|
8 |
-
- transformers.js
|
9 |
-
inference: false
|
10 |
license: apache-2.0
|
11 |
language:
|
12 |
- en
|
@@ -1068,7 +1064,7 @@ model-index:
|
|
1068 |
<br><br>
|
1069 |
|
1070 |
<p align="center">
|
1071 |
-
<img src="https://huggingface.co/
|
1072 |
</p>
|
1073 |
|
1074 |
|
@@ -1076,9 +1072,6 @@ model-index:
|
|
1076 |
<b>The text embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
|
1077 |
</p>
|
1078 |
|
1079 |
-
## Quick Start
|
1080 |
-
|
1081 |
-
The easiest way to starting using `jina-embeddings-v2-base-zh` is to use Jina AI's [Embedding API](https://jina.ai/embeddings/).
|
1082 |
|
1083 |
## Intended Usage & Model Info
|
1084 |
|
@@ -1094,15 +1087,13 @@ Additionally, we provide the following embedding models:
|
|
1094 |
|
1095 |
- [`jina-embeddings-v2-small-en`](https://huggingface.co/jinaai/jina-embeddings-v2-small-en): 33 million parameters.
|
1096 |
- [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
|
1097 |
-
- [`jina-embeddings-v2-base-zh`](
|
1098 |
-
- [`jina-embeddings-v2-base-de`](
|
1099 |
- [`jina-embeddings-v2-base-es`](): Spanish-English Bilingual embeddings (soon).
|
1100 |
-
- [`jina-embeddings-v2-base-code`](https://huggingface.co/jinaai/jina-embeddings-v2-base-code): 161 million parameters code embeddings.
|
1101 |
|
1102 |
## Data & Parameters
|
1103 |
|
1104 |
-
|
1105 |
-
|
1106 |
|
1107 |
## Usage
|
1108 |
|
@@ -1130,7 +1121,7 @@ def mean_pooling(model_output, attention_mask):
|
|
1130 |
sentences = ['How is the weather today?', '今天天气怎么样?']
|
1131 |
|
1132 |
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-zh')
|
1133 |
-
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True
|
1134 |
|
1135 |
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
1136 |
|
@@ -1144,16 +1135,14 @@ embeddings = F.normalize(embeddings, p=2, dim=1)
|
|
1144 |
</p>
|
1145 |
</details>
|
1146 |
|
1147 |
-
You can use Jina Embedding models directly from transformers package
|
1148 |
-
|
1149 |
```python
|
1150 |
!pip install transformers
|
1151 |
-
import torch
|
1152 |
from transformers import AutoModel
|
1153 |
from numpy.linalg import norm
|
1154 |
|
1155 |
cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
|
1156 |
-
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True
|
1157 |
embeddings = model.encode(['How is the weather today?', '今天天气怎么样?'])
|
1158 |
print(cos_sim(embeddings[0], embeddings[1]))
|
1159 |
```
|
@@ -1167,45 +1156,9 @@ embeddings = model.encode(
|
|
1167 |
)
|
1168 |
```
|
1169 |
|
1170 |
-
|
1171 |
-
|
1172 |
-
```python
|
1173 |
-
!pip install -U sentence-transformers
|
1174 |
-
from sentence_transformers import SentenceTransformer
|
1175 |
-
from numpy.linalg import norm
|
1176 |
-
|
1177 |
-
cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
|
1178 |
-
model = SentenceTransformer('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True)
|
1179 |
-
embeddings = model.encode(['How is the weather today?', '今天天气怎么样?'])
|
1180 |
-
print(cos_sim(embeddings[0], embeddings[1]))
|
1181 |
-
```
|
1182 |
-
|
1183 |
-
Using the its latest release (v2.3.0) sentence-transformers also supports Jina embeddings (Please make sure that you are logged into huggingface as well):
|
1184 |
|
1185 |
-
|
1186 |
-
!pip install -U sentence-transformers
|
1187 |
-
from sentence_transformers import SentenceTransformer
|
1188 |
-
from sentence_transformers.util import cos_sim
|
1189 |
-
|
1190 |
-
model = SentenceTransformer(
|
1191 |
-
"jinaai/jina-embeddings-v2-base-zh", # switch to en/zh for English or Chinese
|
1192 |
-
trust_remote_code=True
|
1193 |
-
)
|
1194 |
-
|
1195 |
-
# control your input sequence length up to 8192
|
1196 |
-
model.max_seq_length = 1024
|
1197 |
-
|
1198 |
-
embeddings = model.encode([
|
1199 |
-
'How is the weather today?',
|
1200 |
-
'今天天气怎么样?'
|
1201 |
-
])
|
1202 |
-
print(cos_sim(embeddings[0], embeddings[1]))
|
1203 |
-
```
|
1204 |
-
|
1205 |
-
## Alternatives to Using Transformers Package
|
1206 |
-
|
1207 |
-
1. _Managed SaaS_: Get started with a free key on Jina AI's [Embedding API](https://jina.ai/embeddings/).
|
1208 |
-
2. _Private and high-performance deployment_: Get started by picking from our suite of models and deploy them on [AWS Sagemaker](https://aws.amazon.com/marketplace/seller-profile?id=seller-stch2ludm6vgy).
|
1209 |
|
1210 |
## Use Jina Embeddings for RAG
|
1211 |
|
@@ -1215,26 +1168,12 @@ According to the latest blog post from [LLamaIndex](https://blog.llamaindex.ai/b
|
|
1215 |
|
1216 |
<img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*ZP2RVejCZovF3FDCg-Bx3A.png" width="780px">
|
1217 |
|
1218 |
-
## Trouble Shooting
|
1219 |
-
|
1220 |
-
**Loading of Model Code failed**
|
1221 |
-
|
1222 |
-
If you forgot to pass the `trust_remote_code=True` flag when calling `AutoModel.from_pretrained` or initializing the model via the `SentenceTransformer` class, you will receive an error that the model weights could not be initialized.
|
1223 |
-
This is caused by tranformers falling back to creating a default BERT model, instead of a jina-embedding model:
|
1224 |
|
1225 |
-
|
1226 |
-
Some weights of the model checkpoint at jinaai/jina-embeddings-v2-base-zh were not used when initializing BertModel: ['encoder.layer.2.mlp.layernorm.weight', 'encoder.layer.3.mlp.layernorm.weight', 'encoder.layer.10.mlp.wo.bias', 'encoder.layer.5.mlp.wo.bias', 'encoder.layer.2.mlp.layernorm.bias', 'encoder.layer.1.mlp.gated_layers.weight', 'encoder.layer.5.mlp.gated_layers.weight', 'encoder.layer.8.mlp.layernorm.bias', ...
|
1227 |
-
```
|
1228 |
-
|
1229 |
-
**User is not logged into Huggingface**
|
1230 |
|
1231 |
-
|
1232 |
-
|
1233 |
-
|
1234 |
-
```bash
|
1235 |
-
OSError: jinaai/jina-embeddings-v2-base-zh is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
|
1236 |
-
If this is a private repository, make sure to pass a token having permission to this repo with `use_auth_token` or log in with `huggingface-cli login` and pass `use_auth_token=True`.
|
1237 |
-
```
|
1238 |
|
1239 |
## Contact
|
1240 |
|
@@ -1245,10 +1184,12 @@ Join our [Discord community](https://discord.jina.ai) and chat with other commun
|
|
1245 |
If you find Jina Embeddings useful in your research, please cite the following paper:
|
1246 |
|
1247 |
```
|
1248 |
-
@
|
1249 |
-
|
1250 |
-
|
1251 |
-
|
1252 |
-
|
|
|
|
|
1253 |
}
|
1254 |
```
|
|
|
3 |
- sentence-transformers
|
4 |
- feature-extraction
|
5 |
- sentence-similarity
|
|
|
|
|
|
|
|
|
6 |
license: apache-2.0
|
7 |
language:
|
8 |
- en
|
|
|
1064 |
<br><br>
|
1065 |
|
1066 |
<p align="center">
|
1067 |
+
<img src="https://aeiljuispo.cloudimg.io/v7/https://cdn-uploads.huggingface.co/production/uploads/603763514de52ff951d89793/AFoybzd5lpBQXEBrQHuTt.png?w=200&h=200&f=face" alt="Finetuner logo: Finetuner helps you to create experiments in order to improve embeddings on search tasks. It accompanies you to deliver the last mile of performance-tuning for neural search applications." width="150px">
|
1068 |
</p>
|
1069 |
|
1070 |
|
|
|
1072 |
<b>The text embedding set trained by <a href="https://jina.ai/"><b>Jina AI</b></a>.</b>
|
1073 |
</p>
|
1074 |
|
|
|
|
|
|
|
1075 |
|
1076 |
## Intended Usage & Model Info
|
1077 |
|
|
|
1087 |
|
1088 |
- [`jina-embeddings-v2-small-en`](https://huggingface.co/jinaai/jina-embeddings-v2-small-en): 33 million parameters.
|
1089 |
- [`jina-embeddings-v2-base-en`](https://huggingface.co/jinaai/jina-embeddings-v2-base-en): 137 million parameters.
|
1090 |
+
- [`jina-embeddings-v2-base-zh`](): Chinese-English Bilingual embeddings (soon) **(you are here)**.
|
1091 |
+
- [`jina-embeddings-v2-base-de`](): German-English Bilingual embeddings (soon).
|
1092 |
- [`jina-embeddings-v2-base-es`](): Spanish-English Bilingual embeddings (soon).
|
|
|
1093 |
|
1094 |
## Data & Parameters
|
1095 |
|
1096 |
+
Jina Embeddings V2 [technical report](https://arxiv.org/abs/2310.19923)
|
|
|
1097 |
|
1098 |
## Usage
|
1099 |
|
|
|
1121 |
sentences = ['How is the weather today?', '今天天气怎么样?']
|
1122 |
|
1123 |
tokenizer = AutoTokenizer.from_pretrained('jinaai/jina-embeddings-v2-base-zh')
|
1124 |
+
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True)
|
1125 |
|
1126 |
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')
|
1127 |
|
|
|
1135 |
</p>
|
1136 |
</details>
|
1137 |
|
1138 |
+
You can use Jina Embedding models directly from transformers package:
|
|
|
1139 |
```python
|
1140 |
!pip install transformers
|
|
|
1141 |
from transformers import AutoModel
|
1142 |
from numpy.linalg import norm
|
1143 |
|
1144 |
cos_sim = lambda a,b: (a @ b.T) / (norm(a)*norm(b))
|
1145 |
+
model = AutoModel.from_pretrained('jinaai/jina-embeddings-v2-base-zh', trust_remote_code=True) # trust_remote_code is needed to use the encode method
|
1146 |
embeddings = model.encode(['How is the weather today?', '今天天气怎么样?'])
|
1147 |
print(cos_sim(embeddings[0], embeddings[1]))
|
1148 |
```
|
|
|
1156 |
)
|
1157 |
```
|
1158 |
|
1159 |
+
## Fully-managed Embeddings Service
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1160 |
|
1161 |
+
Alternatively, you can use Jina AI's [Embedding platform](https://jina.ai/embeddings/) for fully-managed access to Jina Embeddings models.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1162 |
|
1163 |
## Use Jina Embeddings for RAG
|
1164 |
|
|
|
1168 |
|
1169 |
<img src="https://miro.medium.com/v2/resize:fit:4800/format:webp/1*ZP2RVejCZovF3FDCg-Bx3A.png" width="780px">
|
1170 |
|
|
|
|
|
|
|
|
|
|
|
|
|
1171 |
|
1172 |
+
## Plans
|
|
|
|
|
|
|
|
|
1173 |
|
1174 |
+
1. Bilingual embedding models supporting more European & Asian languages, including Spanish, French, Italian and Japanese.
|
1175 |
+
2. Multimodal embedding models enable Multimodal RAG applications.
|
1176 |
+
3. High-performt rerankers.
|
|
|
|
|
|
|
|
|
1177 |
|
1178 |
## Contact
|
1179 |
|
|
|
1184 |
If you find Jina Embeddings useful in your research, please cite the following paper:
|
1185 |
|
1186 |
```
|
1187 |
+
@misc{günther2023jina,
|
1188 |
+
title={Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents},
|
1189 |
+
author={Michael Günther and Jackmin Ong and Isabelle Mohr and Alaeddine Abdessalem and Tanguy Abel and Mohammad Kalim Akram and Susana Guzman and Georgios Mastrapas and Saba Sturua and Bo Wang and Maximilian Werk and Nan Wang and Han Xiao},
|
1190 |
+
year={2023},
|
1191 |
+
eprint={2310.19923},
|
1192 |
+
archivePrefix={arXiv},
|
1193 |
+
primaryClass={cs.CL}
|
1194 |
}
|
1195 |
```
|
config.json
CHANGED
@@ -24,7 +24,6 @@
|
|
24 |
"intermediate_size": 3072,
|
25 |
"layer_norm_eps": 1e-12,
|
26 |
"max_position_embeddings": 8192,
|
27 |
-
"model_max_length": 8192,
|
28 |
"model_type": "bert",
|
29 |
"num_attention_heads": 12,
|
30 |
"num_hidden_layers": 12,
|
|
|
24 |
"intermediate_size": 3072,
|
25 |
"layer_norm_eps": 1e-12,
|
26 |
"max_position_embeddings": 8192,
|
|
|
27 |
"model_type": "bert",
|
28 |
"num_attention_heads": 12,
|
29 |
"num_hidden_layers": 12,
|
onnx/model_fp16.onnx → model-w-mean-pooling.onnx
RENAMED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:40701019a8d9d961bd5df9a99f6ac8754740b70c14077176f3235bb518d99ab6
|
3 |
+
size 641147938
|
onnx/model.onnx → model.onnx
RENAMED
@@ -1,3 +1,3 @@
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:
|
3 |
-
size
|
|
|
1 |
version https://git-lfs.github.com/spec/v1
|
2 |
+
oid sha256:214692104e1a0c7edfd9fed55701161c3a9b59c401114efaf8a51c8726d64fb7
|
3 |
+
size 641145418
|
modules.json
CHANGED
@@ -10,11 +10,5 @@
|
|
10 |
"name": "1",
|
11 |
"path": "1_Pooling",
|
12 |
"type": "sentence_transformers.models.Pooling"
|
13 |
-
},
|
14 |
-
{
|
15 |
-
"idx": 2,
|
16 |
-
"name": "2",
|
17 |
-
"path": "2_Normalize",
|
18 |
-
"type": "sentence_transformers.models.Normalize"
|
19 |
}
|
20 |
]
|
|
|
10 |
"name": "1",
|
11 |
"path": "1_Pooling",
|
12 |
"type": "sentence_transformers.models.Pooling"
|
|
|
|
|
|
|
|
|
|
|
|
|
13 |
}
|
14 |
]
|
onnx/model_quantized.onnx
DELETED
@@ -1,3 +0,0 @@
|
|
1 |
-
version https://git-lfs.github.com/spec/v1
|
2 |
-
oid sha256:0a221ee9e6a6647ccc59cee7bdd26a7b8cf0c0cd3481a65f358d9585a23f02f4
|
3 |
-
size 161565239
|
|
|
|
|
|
|
|