Jina Reranker M0 - ONNX FP16 Version
This repository contains the jinaai/jina-reranker-m0 model converted to the ONNX format with FP16 precision.
Model Description
Jina Reranker is designed to rerank search results or document passages based on their relevance to a given query. It takes a query and a list of documents as input and outputs relevance scores.
This version is specifically exported for use with ONNX Runtime.
Original Model Card: jinaai/jina-reranker-m0
Technical Details
- Format: ONNX
- Opset: 14
- Precision: FP16 (exported using
.half()
) - External Data: Uses ONNX external data format due to model size. All files in this repository are required.
huggingface_hub
handles downloading them automatically. - Export Source: Exported from the Hugging Face
transformers
library usingtorch.onnx.export
.
Usage
You can use this model with onnxruntime
for inference. You will also need the transformers
library to load the appropriate processor for input preparation and huggingface_hub
to download the model files.
1. Installation:
pip install onnxruntime huggingface_hub transformers torch sentencepiece
2. Inference Script:
import onnxruntime as ort
from huggingface_hub import hf_hub_download
from transformers import AutoProcessor
import numpy as np
import torch # For processor output handling
# --- Configuration ---
# Replace with your repository ID if different
repo_id = "jian-mo/jina-reranker-m0-onnx"
onnx_filename = "jina-reranker-m0.onnx" # Main ONNX file name
# Use the original model ID to load the correct processor
original_model_id = "jinaai/jina-reranker-m0"
# --- End Configuration ---
# 1. Download ONNX model files from the Hub
# hf_hub_download automatically handles external data files linked via LFS
print(f"Downloading ONNX model from {repo_id}...")
local_onnx_path = hf_hub_download(
repo_id=repo_id,
filename=onnx_filename
)
print(f"ONNX model downloaded to: {local_onnx_path}")
# 2. Load ONNX Runtime session
print("Loading ONNX Inference Session...")
# You can choose execution providers, e.g., ['CUDAExecutionProvider', 'CPUExecutionProvider']
# if you have GPU support and the necessary onnxruntime build.
session_options = ort.SessionOptions()
# session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_EXTENDED
providers = ['CPUExecutionProvider'] # Default to CPU
session = ort.InferenceSession(local_onnx_path, sess_options=session_options, providers=providers)
print(f"ONNX session loaded with provider: {session.get_providers()}")
# 3. Load the Processor
print(f"Loading processor from {original_model_id}...")
processor = AutoProcessor.from_pretrained(original_model_id, trust_remote_code=True)
print("Processor loaded.")
# 4. Prepare Input Data
query = "What is deep learning?"
document = "Deep learning is a subset of machine learning based on artificial neural networks with representation learning."
# Example with multiple documents (batch processing)
# documents = [
# "Deep learning is a subset of machine learning based on artificial neural networks with representation learning.",
# "Artificial intelligence refers to the simulation of human intelligence in machines.",
# "A transformer is a deep learning model used primarily in the field of natural language processing."
# ]
# Use processor logic suitable for query + multiple documents if needed
print("Preparing input data...")
# Process query and document together as expected by the reranker model
inputs = processor(
text=f"{query} {document}",
images=None, # Assuming text-only reranking
return_tensors="pt", # Get PyTorch tensors first
padding=True,
truncation=True,
max_length=512 # Use a reasonable max_length
)
# Convert to NumPy for ONNX Runtime
inputs_np = {
"input_ids": inputs["input_ids"].numpy(),
"attention_mask": inputs["attention_mask"].numpy()
}
print("Input data prepared.")
# print("Input shapes:", {k: v.shape for k, v in inputs_np.items()})
# 5. Run Inference
print("Running inference...")
output_names = [output.name for output in session.get_outputs()]
outputs = session.run(output_names, inputs_np)
print("Inference complete.")
# 6. Process Output
# The exact interpretation depends on the model's output structure.
# For Jina Reranker, the output is typically a logit score.
# Higher values usually indicate higher relevance. Check the original model card.
print(f"Number of outputs: {len(outputs)}")
if len(outputs) > 0:
logits = outputs[0]
print(f"Output logits shape: {logits.shape}")
# Often, the relevance score is associated
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
HF Inference deployability: The HF Inference API does not support sentence-similarity models for onnx
library.