Ground-Truth-Aware Metrics: Disambiguating Evaluation in Vector Retrieval and Embedding Systems
In modern AI systems relying on vector databases and embedding models, evaluation metrics like "Recall@k" are ubiquitous — yet deeply ambiguous. The term often conflates two incompatible concepts: agreement with a baseline model's rankings (diagnostic) versus performance against real human or business ground truth (customer-relevant). This ambiguity leads to misleading claims in benchmarks and production deployments.
This article introduces a ground-truth-aware terminology standard to resolve the issue, ensuring reproducible and meaningful evaluation.
The Problem: Ambiguous Metrics in Practice
Metrics such as "Recall@k" or "Hits@k" are routinely reported without specifying the reference:
- Baseline overlap: Measures consistency with an uncompressed or baseline embedding space (common in compression papers).
- Truth-based performance: Measures utility against explicit ground truth (GT), e.g., human labels or weighted business scores.
Optimizing baseline overlap can reproduce biases, while truth-based metrics drive real improvements. Examples from legal search and e-commerce illustrate how baseline semantics diverge from business relevance.
Proposed Standard
Every metric must encode its ground truth:
- Truth-based (customer-relevant):
nDCG@k[GT-H],SetOverlap@k[GT-W](graded/weighted truth). - Diagnostic only:
BO@k[baseline=original-float32](Baseline Overlap, aliasBRecall@k).
Ground truth types:
- GT-H: Human labels
- GT-W: Weighted scores
- GT-P: Pairwise preferences
- GT-L: Log-based (with mandatory bias documentation: position/selection bias, feedback loops).
Key Benefits
- Eliminates "blind vs. one-eyed" effects (visualized in the Venn diagram below).
- Machine-readable schema for linting and dashboards.
- Legacy mapping for smooth transition.
Resources
- Full whitepaper (12 pages): https://zenodo.org/records/18152431
- Concise guide (4 pages): https://zenodo.org/records/18152757
This standard is ready for adoption in MTEB, BEIR, and vector DB tools. Feedback and contributions welcome!
Let's make embedding evaluation more rigorous and truthful. Share your thoughts — have you encountered ambiguous metrics in your work?
#MachineLearning #VectorRetrieval #EvaluationMetrics #Embeddings #Reproducibility
