CLaRa-7B-E2E (Compression-16 & 128)

The CLaRa-7B-E2E model is our fully end-to-end unified RAG model, jointly optimizing retrieval and generation with 16× and 128x document compression.

Training recipe: End-to-end finetuning with differentiable top-k retrieval and a unified language-modeling objective.
Benchmarks: Strong retrieval-augmented QA performance under aggressive compression.

More details and usage examples:

Paper: CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning
GitHub: https://github.com/apple/ml-clara

Example Usage (End-to-End Inference)

from transformers import AutoModel

unirag = AutoModel.from_pretrained(
    "/mnt/ceph_rbd/model/CLaRa-7B-E2E/compression-16",
    trust_remote_code=True
).to("cuda")

# Example documents and question
documents = [[
    "Weldenia is a monotypic genus of flowering plant in the family Commelinaceae...",
] * 20]

questions = [
    "Which genus of plant grows originally in Mexico and Guatemala, Phylica or Weldenia?"
]

# End-to-end usage (retrieval + generation)
# The effective top-k is controlled by `generation_top_k` in config.json.
out = unirag.generate_from_questions(
    questions=questions,
    documents=documents,
    max_new_tokens=64
)

print("Generated answer", out)

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for apple/CLaRa-7B-E2E

Base model

mistralai/Mistral-7B-Instruct-v0.2

Finetuned

(1069)

this model

Paper for apple/CLaRa-7B-E2E

CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning

Paper • 2511.18659 • Published Nov 24, 2025 • 24