Jina Reranker M0 - ONNX FP16 Version

This repository contains the jinaai/jina-reranker-m0 model converted to the ONNX format with FP16 precision.

Model Description

Jina Reranker is designed to rerank search results or document passages based on their relevance to a given query. It takes a query and a list of documents as input and outputs relevance scores.

This version is specifically exported for use with ONNX Runtime.

Original Model Card: jinaai/jina-reranker-m0

Technical Details

  • Format: ONNX
  • Opset: 14
  • Precision: FP16 (exported using .half())
  • External Data: Uses ONNX external data format due to model size. All files in this repository are required. huggingface_hub handles downloading them automatically.
  • Export Source: Exported from the Hugging Face transformers library using torch.onnx.export.

Usage

You can use this model with onnxruntime for inference. You will also need the transformers library to load the appropriate processor for input preparation and huggingface_hub to download the model files.

1. Installation:

pip install onnxruntime huggingface_hub transformers torch sentencepiece

2. Inference Script:

import onnxruntime as ort
from huggingface_hub import hf_hub_download
from transformers import AutoProcessor
import numpy as np
import torch # For processor output handling

# --- Configuration ---
# Replace with your repository ID if different
repo_id = "jian-mo/jina-reranker-m0-onnx"
onnx_filename = "jina-reranker-m0.onnx" # Main ONNX file name
# Use the original model ID to load the correct processor
original_model_id = "jinaai/jina-reranker-m0"
# --- End Configuration ---

# 1. Download ONNX model files from the Hub
# hf_hub_download automatically handles external data files linked via LFS
print(f"Downloading ONNX model from {repo_id}...")
local_onnx_path = hf_hub_download(
    repo_id=repo_id,
    filename=onnx_filename
)
print(f"ONNX model downloaded to: {local_onnx_path}")

# 2. Load ONNX Runtime session
print("Loading ONNX Inference Session...")
# You can choose execution providers, e.g., ['CUDAExecutionProvider', 'CPUExecutionProvider']
# if you have GPU support and the necessary onnxruntime build.
session_options = ort.SessionOptions()
# session_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_EXTENDED
providers = ['CPUExecutionProvider'] # Default to CPU
session = ort.InferenceSession(local_onnx_path, sess_options=session_options, providers=providers)
print(f"ONNX session loaded with provider: {session.get_providers()}")

# 3. Load the Processor
print(f"Loading processor from {original_model_id}...")
processor = AutoProcessor.from_pretrained(original_model_id, trust_remote_code=True)
print("Processor loaded.")

# 4. Prepare Input Data
query = "What is deep learning?"
document = "Deep learning is a subset of machine learning based on artificial neural networks with representation learning."
# Example with multiple documents (batch processing)
# documents = [
#     "Deep learning is a subset of machine learning based on artificial neural networks with representation learning.",
#     "Artificial intelligence refers to the simulation of human intelligence in machines.",
#     "A transformer is a deep learning model used primarily in the field of natural language processing."
# ]
# Use processor logic suitable for query + multiple documents if needed

print("Preparing input data...")
# Process query and document together as expected by the reranker model
inputs = processor(
    text=f"{query} {document}",
    images=None, # Assuming text-only reranking
    return_tensors="pt", # Get PyTorch tensors first
    padding=True,
    truncation=True,
    max_length=512 # Use a reasonable max_length
)

# Convert to NumPy for ONNX Runtime
inputs_np = {
    "input_ids": inputs["input_ids"].numpy(),
    "attention_mask": inputs["attention_mask"].numpy()
}
print("Input data prepared.")
# print("Input shapes:", {k: v.shape for k, v in inputs_np.items()})

# 5. Run Inference
print("Running inference...")
output_names = [output.name for output in session.get_outputs()]
outputs = session.run(output_names, inputs_np)
print("Inference complete.")

# 6. Process Output
# The exact interpretation depends on the model's output structure.
# For Jina Reranker, the output is typically a logit score.
# Higher values usually indicate higher relevance. Check the original model card.
print(f"Number of outputs: {len(outputs)}")
if len(outputs) > 0:
    logits = outputs[0]
    print(f"Output logits shape: {logits.shape}")
    # Often, the relevance score is associated
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for jian-mo/jina-reranker-m0-onnx

Base model

Qwen/Qwen2-VL-2B
Quantized
(2)
this model