gte-multilingual-reranker-base-onnx-op14-opt-gpu-int8-quantized

This model is a quantized ONNX version of Alibaba-NLP/gte-multilingual-reranker-base using ONNX opset 14.

Model Details

Environment and Package Versions

Package Version
transformers 4.48.3
optimum 1.24.0
onnx 1.17.0
onnxruntime 1.21.0
torch 2.5.1
numpy 1.26.4
huggingface_hub 0.28.1
python 3.12.9
system Darwin 24.3.0

Applied Optimizations

Optimization Setting
Graph Optimization Level Extended
Optimize for GPU Yes
Use FP16 No
Transformers Specific Optimizations Enabled Yes
Gelu Fusion Enabled Yes
Layer Norm Fusion Enabled Yes
Attention Fusion Enabled Yes
Skip Layer Norm Fusion Enabled Yes
Gelu Approximation Enabled Yes

Usage

from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer

# Load model and tokenizer
model = ORTModelForSequenceClassification.from_pretrained("quantized_model")
tokenizer = AutoTokenizer.from_pretrained("quantized_model")

# Prepare input
text = "Your text here"
inputs = tokenizer(text, return_tensors="pt")

# Run inference
outputs = model(**inputs)

Quantization Process

This model was quantized using ONNX Runtime with int8 quantization. The quantization was performed using the Optimum library from Hugging Face with opset 14. Graph optimization was applied during export, targeting GPU devices.

Performance Comparison

Quantized models generally offer better inference speed with a slight trade-off in accuracy. This INT8 quantized model should provide significantly faster inference than the original model.

Downloads last month
95
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support