Update README.md
#23
by
MaxNomic
- opened
README.md
CHANGED
@@ -2612,7 +2612,7 @@ language:
|
|
2612 |
|
2613 |
`nomic-embed-text-v1` is 8192 context length text encoder that surpasses OpenAI text-embedding-ada-002 and text-embedding-3-small performance on short and long context tasks.
|
2614 |
|
2615 |
-
|
2616 |
|
2617 |
| Name | SeqLen | MTEB | LoCo | Jina Long Context | Open Weights | Open Training Code | Open Data |
|
2618 |
| :-------------------------------:| :----- | :-------- | :------: | :---------------: | :-----------: | :----------------: | :---------- |
|
@@ -2624,43 +2624,6 @@ language:
|
|
2624 |
|
2625 |
**Exciting Update!**: `nomic-embed-text-v1` is now multimodal! [nomic-embed-vision-v1](https://huggingface.co/nomic-ai/nomic-embed-vision-v1) is aligned to the embedding space of `nomic-embed-text-v1`, meaning any text embedding is multimodal!
|
2626 |
|
2627 |
-
## Hosted Inference API
|
2628 |
-
|
2629 |
-
The easiest way to get started with Nomic Embed is through the Nomic Embedding API.
|
2630 |
-
|
2631 |
-
Generating embeddings with the `nomic` Python client is as easy as
|
2632 |
-
|
2633 |
-
```python
|
2634 |
-
from nomic import embed
|
2635 |
-
|
2636 |
-
output = embed.text(
|
2637 |
-
texts=['Nomic Embedding API', '#keepAIOpen'],
|
2638 |
-
model='nomic-embed-text-v1',
|
2639 |
-
task_type='search_document'
|
2640 |
-
)
|
2641 |
-
|
2642 |
-
print(output)
|
2643 |
-
```
|
2644 |
-
|
2645 |
-
For more information, see the [API reference](https://docs.nomic.ai/reference/endpoints/nomic-embed-text)
|
2646 |
-
|
2647 |
-
## Data Visualization
|
2648 |
-
Click the Nomic Atlas map below to visualize a 5M sample of our contrastive pretraining data!
|
2649 |
-
|
2650 |
-
|
2651 |
-
[![image/webp](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/pjhJhuNyRfPagRd_c_iUz.webp)](https://atlas.nomic.ai/map/nomic-text-embed-v1-5m-sample)
|
2652 |
-
|
2653 |
-
## Training Details
|
2654 |
-
|
2655 |
-
We train our embedder using a multi-stage training pipeline. Starting from a long-context [BERT model](https://huggingface.co/nomic-ai/nomic-bert-2048),
|
2656 |
-
the first unsupervised contrastive stage trains on a dataset generated from weakly related text pairs, such as question-answer pairs from forums like StackExchange and Quora, title-body pairs from Amazon reviews, and summarizations from news articles.
|
2657 |
-
|
2658 |
-
In the second finetuning stage, higher quality labeled datasets such as search queries and answers from web searches are leveraged. Data curation and hard-example mining is crucial in this stage.
|
2659 |
-
|
2660 |
-
For more details, see the Nomic Embed [Technical Report](https://static.nomic.ai/reports/2024_Nomic_Embed_Text_Technical_Report.pdf) and corresponding [blog post](https://blog.nomic.ai/posts/nomic-embed-text-v1).
|
2661 |
-
|
2662 |
-
Training data to train the models is released in its entirety. For more details, see the `contrastors` [repository](https://github.com/nomic-ai/contrastors)
|
2663 |
-
|
2664 |
## Usage
|
2665 |
|
2666 |
**Important**: the text prompt *must* include a *task instruction prefix*, instructing the model which task is being performed.
|
@@ -2794,6 +2757,42 @@ const embeddings = await extractor(texts, { pooling: 'mean', normalize: true });
|
|
2794 |
console.log(embeddings);
|
2795 |
```
|
2796 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2797 |
# Join the Nomic Community
|
2798 |
|
2799 |
- Nomic: [https://nomic.ai](https://nomic.ai)
|
|
|
2612 |
|
2613 |
`nomic-embed-text-v1` is 8192 context length text encoder that surpasses OpenAI text-embedding-ada-002 and text-embedding-3-small performance on short and long context tasks.
|
2614 |
|
2615 |
+
# Performance Benchmarks
|
2616 |
|
2617 |
| Name | SeqLen | MTEB | LoCo | Jina Long Context | Open Weights | Open Training Code | Open Data |
|
2618 |
| :-------------------------------:| :----- | :-------- | :------: | :---------------: | :-----------: | :----------------: | :---------- |
|
|
|
2624 |
|
2625 |
**Exciting Update!**: `nomic-embed-text-v1` is now multimodal! [nomic-embed-vision-v1](https://huggingface.co/nomic-ai/nomic-embed-vision-v1) is aligned to the embedding space of `nomic-embed-text-v1`, meaning any text embedding is multimodal!
|
2626 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2627 |
## Usage
|
2628 |
|
2629 |
**Important**: the text prompt *must* include a *task instruction prefix*, instructing the model which task is being performed.
|
|
|
2757 |
console.log(embeddings);
|
2758 |
```
|
2759 |
|
2760 |
+
## Nomic API
|
2761 |
+
|
2762 |
+
The easiest way to get started with Nomic Embed is through the Nomic Embedding API.
|
2763 |
+
|
2764 |
+
Generating embeddings with the `nomic` Python client is as easy as
|
2765 |
+
|
2766 |
+
```python
|
2767 |
+
from nomic import embed
|
2768 |
+
|
2769 |
+
output = embed.text(
|
2770 |
+
texts=['Nomic Embedding API', '#keepAIOpen'],
|
2771 |
+
model='nomic-embed-text-v1',
|
2772 |
+
task_type='search_document'
|
2773 |
+
)
|
2774 |
+
|
2775 |
+
print(output)
|
2776 |
+
```
|
2777 |
+
|
2778 |
+
For more information, see the [API reference](https://docs.nomic.ai/reference/endpoints/nomic-embed-text)
|
2779 |
+
|
2780 |
+
|
2781 |
+
## Training
|
2782 |
+
Click the Nomic Atlas map below to visualize a 5M sample of our contrastive pretraining data!
|
2783 |
+
|
2784 |
+
[![image/webp](https://cdn-uploads.huggingface.co/production/uploads/607997c83a565c15675055b3/pjhJhuNyRfPagRd_c_iUz.webp)](https://atlas.nomic.ai/map/nomic-text-embed-v1-5m-sample)
|
2785 |
+
|
2786 |
+
We train our embedder using a multi-stage training pipeline. Starting from a long-context [BERT model](https://huggingface.co/nomic-ai/nomic-bert-2048),
|
2787 |
+
the first unsupervised contrastive stage trains on a dataset generated from weakly related text pairs, such as question-answer pairs from forums like StackExchange and Quora, title-body pairs from Amazon reviews, and summarizations from news articles.
|
2788 |
+
|
2789 |
+
In the second finetuning stage, higher quality labeled datasets such as search queries and answers from web searches are leveraged. Data curation and hard-example mining is crucial in this stage.
|
2790 |
+
|
2791 |
+
For more details, see the Nomic Embed [Technical Report](https://static.nomic.ai/reports/2024_Nomic_Embed_Text_Technical_Report.pdf) and corresponding [blog post](https://blog.nomic.ai/posts/nomic-embed-text-v1).
|
2792 |
+
|
2793 |
+
Training data to train the models is released in its entirety. For more details, see the `contrastors` [repository](https://github.com/nomic-ai/contrastors)
|
2794 |
+
|
2795 |
+
|
2796 |
# Join the Nomic Community
|
2797 |
|
2798 |
- Nomic: [https://nomic.ai](https://nomic.ai)
|