Model Card: Personalized Recipe Ranking Models
Overview
This project implements a personalized recipe recommendation system using two model categories:
- Scratch-trained baseline: A simple rule-based + embedding matching ranker trained on a synthetic preference dataset (no user-specific rules).
- Rule-enhanced cold-start models: Five separate XGBRanker models trained with more complex rule-based preference signals and user-specific interaction patterns (user1–user5).
The goal is to evaluate how different user profiles affect ranking behavior and recommendation diversity, even when overall NDCG scores are lower than the baseline.
Model Category 1: Scratch-trained Baseline
Purpose
Provide a simple cold-start recommendation baseline that matches ingredients and ranks recipes without personalization. It uses parent–child ingredient overlap and a few numeric features (e.g., protein, cost, cooking time).
Data Sources
- Cleaned Food.com dataset (~180k recipes)
- 10,000 synthetic preference samples generated via uniform random selection
Training Details
- Model type: XGBRanker (
objective='rank:pairwise'
) - Features: ~1000 numeric ingredient-parent ratio features + basic nutrition/time features
- Train/test split: 80/20 (by recipe ID)
- Evaluation metric: NDCG@5, NDCG@10
Evaluation
The baseline achieves very high NDCG scores (95%+), because training and evaluation rely on synthetic signals that align perfectly with the ranking structure.
Intended Use
Serve as a sanity check and upper bound for ranking performance, not for deployment.
Limitations
- Unrealistically clean preference structure
- No user differentiation
- Inflated metrics due to synthetic evaluation
Model Category 2: Rule-enhanced Cold Start Models (User1–User5)
Purpose
Capture user-specific dietary preferences and ranking heuristics using richer rule sets, leading to more diverse recommendation patterns across different users.
Data Sources
- Cleaned Food.com dataset (~180k recipes)
- 5,000 cold-start synthetic interactions per user profile
- Additional unselected (negative) samples included to simulate realistic cold-start scenarios
Model
- Model type: XGBRanker (scratch-trained)
- Training objective:
rank:pairwise
- Feature space:
- Ingredient-parent coverage ratios (~1000 parent nodes)
- Nutrition features: protein, calories, cost, cooking time
- User preference weights: protein/time/cost
- Dietary tag filters and exclusion rules
Training Setup
- Train/valid/test split: 70/15/15 by recipe ID per profile
- No fine-tuning between profiles; each profile trained independently
- Evaluation metric: NDCG@5 and NDCG@10
Evaluation Results
User Profile | NDCG@5 | NDCG@10 |
---|---|---|
user1 | 0.4400 | 0.4400 |
user2 | 0.4342 | 0.4342 |
user3 | 0.4179 | 0.4179 |
user4 | 0.1651 | 0.1651 |
user5 | 0.4607 | 0.4607 |
Note: User4 has very restrictive dietary preferences, resulting in very few matching recipes and inherently lower achievable NDCG.
:contentReference[oaicite:0]{index=0}:contentReference[oaicite:1]{index=1}:contentReference[oaicite:2]{index=2}:contentReference[oaicite:3]{index=3}:contentReference[oaicite:4]{index=4}
Although these NDCG values are lower than the baseline, this is expected for several reasons:
- The cold-start datasets contain a large proportion of unselected recipes, leading to sparse positive signals.
- More complex preference rules increase variability and reduce alignment with NDCG’s single-label relevance assumptions.
- The models now produce more differentiated ranking behaviors across user profiles, which aligns with the intended personalization goals.
Model Selection Justification
- XGBRanker was chosen for all models due to its effectiveness on structured tabular data, fast training time, and compatibility with large feature spaces (1000+ ingredients).
- The baseline model acts as a clean control, providing an upper bound on achievable NDCG under idealized preferences.
- The rule-enhanced models trade some raw NDCG performance for greater personalization fidelity, which is critical in multi-user recommendation contexts.
Evaluation Methodology
- Metric: NDCG@5 and NDCG@10 on held-out cold-start samples
- Each user model evaluated independently
- Negative samples retained to approximate real-world recommendation class imbalance
Intended Uses and Limitations
Intended Uses
- Multi-profile recipe recommendation
- Studying personalization behaviors under sparse feedback
- Cold-start scenarios for new users
Limitations
- Synthetic user interactions do not perfectly reflect real-world feedback
- NDCG is not well aligned with multi-rule personalization behavior
- User4 performance is limited by scarcity of relevant recipes
Risks and Bias
The models are trained on the Food.com dataset, which has known biases:
- Regional bias: Western and American cuisines dominate the dataset, leading to potential under-representation of other regions.
- Popularity bias: Highly rated or frequently interacted recipes are over-represented.
- Cold-start leakage risk: Although user interactions are synthetic, overlapping ingredient-parent structures between train/test may create mild information leakage, potentially inflating baseline metrics.
These biases may affect recommendation diversity and fairness across different cuisines or dietary groups.
Cost and Latency
All models are based on XGBRanker, which runs efficiently on CPU:
- Inference latency: Approximately 1–5 ms per recipe for ranking (measured on a laptop CPU, single thread).
- Training cost: Training each user profile model on 5,000 interactions takes less than 2 minutes on CPU.
The approach is designed for real-time personalization in lightweight interfaces (e.g., Hugging Face Spaces).
Usage Disclosure
Intended Uses
- Academic and educational research on personalized recommendation
- Cold-start personalization experiments
- Recipe recommendation for diverse dietary profiles
Not Intended For
- Medical or dietary decision-making
- Real-world deployment without additional bias mitigation
- High-stakes personalization where fairness across demographic groups is critical
Citation
Tang, Xinxuan. Personalized Recipe Ranking Models. 2025.
Evaluation results
- ndcg@5 on Food.com (Cleaned)self-reported0.440
- ndcg@10 on Food.com (Cleaned)self-reported0.440