CLaRa-7B-E2E (Compression-16 & 128)
The CLaRa-7B-E2E model is our fully end-to-end unified RAG model, jointly optimizing retrieval and generation with 16ร and 128x document compression.
Training recipe: End-to-end finetuning with differentiable top-k retrieval and a unified language-modeling objective.
Benchmarks: Strong retrieval-augmented QA performance under aggressive compression.
More details and usage examples:
Paper: CLaRa: Bridging Retrieval and Generation with Continuous Latent Reasoning
GitHub: https://github.com/apple/ml-clara
Example Usage (End-to-End Inference)
from transformers import AutoModel
unirag = AutoModel.from_pretrained(
"/mnt/ceph_rbd/model/CLaRa-7B-E2E/compression-16",
trust_remote_code=True
).to("cuda")
# Example documents and question
documents = [[
"Weldenia is a monotypic genus of flowering plant in the family Commelinaceae...",
] * 20]
questions = [
"Which genus of plant grows originally in Mexico and Guatemala, Phylica or Weldenia?"
]
# End-to-end usage (retrieval + generation)
# The effective top-k is controlled by `generation_top_k` in config.json.
out = unirag.generate_from_questions(
questions=questions,
documents=documents,
max_new_tokens=64
)
print("Generated answer", out)
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
๐
Ask for provider support
Model tree for apple/CLaRa-7B-E2E
Base model
mistralai/Mistral-7B-Instruct-v0.2