YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Multilingual E5 Large Instruct - 8-bit Quantized

This is an 8-bit quantized version of the intfloat/multilingual-e5-large-instruct model.

Model Details

  • Original model: intfloat/multilingual-e5-large-instruct
  • Quantization: 8-bit (using bitsandbytes)
  • Model architecture: XLM-RoBERTa Large with instruction tuning
  • Original parameters: 560M
  • Embedding dimensions: 1024
  • Context length: 512 tokens
  • Languages supported: 94+ languages

Usage

This model can be used with the transformers library for generating embeddings:

from transformers import AutoModel, AutoTokenizer
import torch.nn.functional as F

# Load the model
model_name = "gopersonal/multilingual-e5-large-instruct-8bit"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name, load_in_8bit=True, device_map="auto")

# Define function to get embeddings
def average_pool(last_hidden_states, attention_mask):
    last_hidden = last_hidden_states.masked_fill(~attention_mask[..., None].bool(), 0.0)
    return last_hidden.sum(dim=1) / attention_mask.sum(dim=1)[..., None]

def get_detailed_instruct(task_description, query):
    return f'Instruct: task_description\nQuery: query'

# Prepare your texts
task = 'Given a web search query, retrieve relevant passages that answer the query'
queries = [
    get_detailed_instruct(task, 'how much protein should a female eat'),
    get_detailed_instruct(task, 'best restaurants in new york')
]

# Tokenize and generate embeddings
batch_dict = tokenizer(queries, max_length=512, padding=True, truncation=True, return_tensors='pt')
outputs = model(**batch_dict)
embeddings = average_pool(outputs.last_hidden_state, batch_dict['attention_mask'])

# Normalize embeddings
embeddings = F.normalize(embeddings, p=2, dim=1)

Infinity Embedding Server Usage

docker run --gpus all -v $PWD/models:/app/.cache -p 7997:7997 \
  michaelf34/infinity:latest \
  v2 --model-id gopersonal/multilingual-e5-large-instruct-8bit \
  --dtype int8 --batch-size 8 --engine torch --port 7997 --device auto

Benefits of 8-bit Quantization

  • Approximately 50% reduction in memory usage compared to FP16
  • Faster inference, especially on GPUs with limited VRAM
  • Minimal impact on embedding quality and similarity calculations

License

This model inherits the license of the original model: MIT

Downloads last month
7
Safetensors
Model size
560M params
Tensor type
F32
F16
I8
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support