YAML Metadata
Warning:
empty or missing yaml metadata in repo card
(https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Multilingual E5 Large Instruct - 8-bit Quantized
This is an 8-bit quantized version of the intfloat/multilingual-e5-large-instruct model.
Model Details
- Original model: intfloat/multilingual-e5-large-instruct
- Quantization: 8-bit (using bitsandbytes)
- Model architecture: XLM-RoBERTa Large with instruction tuning
- Original parameters: 560M
- Embedding dimensions: 1024
- Context length: 512 tokens
- Languages supported: 94+ languages
Usage
This model can be used with the transformers
library for generating embeddings:
from transformers import AutoModel, AutoTokenizer
import torch.nn.functional as F
# Load the model
model_name = "gopersonal/multilingual-e5-large-instruct-8bit"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, load_in_8bit=True, device_map="auto")
# Define function to get embeddings
def average_pool(last_hidden_states, attention_mask):
last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]
def get_detailed_instruct(task_description, query):
return f'Instruct: task_description\nQuery: query'
# Prepare your texts
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
get_detailed_instruct(task, 'how much protein should a female eat'),
get_detailed_instruct(task, 'best restaurants in new york')
]
# Tokenize and generate embeddings
batch_dict = tokenizer(queries, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])
# Normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)
Infinity Embedding Server Usage
docker run --gpus all -v $PWD/models:/app/.cache -p 7997:7997 \
michaelf34/infinity:latest \
v2 --model-id gopersonal/multilingual-e5-large-instruct-8bit \
--dtype int8 --batch-size 8 --engine torch --port 7997 --device auto
Benefits of 8-bit Quantization
- Approximately 50% reduction in memory usage compared to FP16
- Faster inference, especially on GPUs with limited VRAM
- Minimal impact on embedding quality and similarity calculations
License
This model inherits the license of the original model: MIT
- Downloads last month
- 7
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
馃檵
Ask for provider support