Model Card: Personalized Recipe Ranking Models

Overview

This project implements a personalized recipe recommendation system using two model categories:

Scratch-trained baseline: A simple rule-based + embedding matching ranker trained on a synthetic preference dataset (no user-specific rules).
Rule-enhanced cold-start models: Five separate XGBRanker models trained with more complex rule-based preference signals and user-specific interaction patterns (user1–user5).

The goal is to evaluate how different user profiles affect ranking behavior and recommendation diversity, even when overall NDCG scores are lower than the baseline.

Model Category 1: Scratch-trained Baseline

Purpose

Provide a simple cold-start recommendation baseline that matches ingredients and ranks recipes without personalization. It uses parent–child ingredient overlap and a few numeric features (e.g., protein, cost, cooking time).

Data Sources

Cleaned Food.com dataset (~180k recipes)
10,000 synthetic preference samples generated via uniform random selection

Training Details

Model type: XGBRanker (objective='rank:pairwise')
Features: ~1000 numeric ingredient-parent ratio features + basic nutrition/time features
Train/test split: 80/20 (by recipe ID)
Evaluation metric: NDCG@5, NDCG@10

Evaluation

The baseline achieves very high NDCG scores (95%+), because training and evaluation rely on synthetic signals that align perfectly with the ranking structure.

Intended Use

Serve as a sanity check and upper bound for ranking performance, not for deployment.

Limitations

Unrealistically clean preference structure
No user differentiation
Inflated metrics due to synthetic evaluation

Model Category 2: Rule-enhanced Cold Start Models (User1–User5)

Purpose

Capture user-specific dietary preferences and ranking heuristics using richer rule sets, leading to more diverse recommendation patterns across different users.

Data Sources

Cleaned Food.com dataset (~180k recipes)
5,000 cold-start synthetic interactions per user profile
Additional unselected (negative) samples included to simulate realistic cold-start scenarios

Model

Model type: XGBRanker (scratch-trained)
Training objective: rank:pairwise
Feature space:
- Ingredient-parent coverage ratios (~1000 parent nodes)
- Nutrition features: protein, calories, cost, cooking time
- User preference weights: protein/time/cost
- Dietary tag filters and exclusion rules

Training Setup

Train/valid/test split: 70/15/15 by recipe ID per profile
No fine-tuning between profiles; each profile trained independently
Evaluation metric: NDCG@5 and NDCG@10

Evaluation Results

User Profile	NDCG@5	NDCG@10
user1	0.4400	0.4400
user2	0.4342	0.4342
user3	0.4179	0.4179
user4	0.1651	0.1651
user5	0.4607	0.4607

Note: User4 has very restrictive dietary preferences, resulting in very few matching recipes and inherently lower achievable NDCG.

:contentReference[oaicite:0]{index=0}:contentReference[oaicite:1]{index=1}:contentReference[oaicite:2]{index=2}:contentReference[oaicite:3]{index=3}:contentReference[oaicite:4]{index=4}

Although these NDCG values are lower than the baseline, this is expected for several reasons:

The cold-start datasets contain a large proportion of unselected recipes, leading to sparse positive signals.
More complex preference rules increase variability and reduce alignment with NDCG’s single-label relevance assumptions.
The models now produce more differentiated ranking behaviors across user profiles, which aligns with the intended personalization goals.

Model Selection Justification

XGBRanker was chosen for all models due to its effectiveness on structured tabular data, fast training time, and compatibility with large feature spaces (1000+ ingredients).
The baseline model acts as a clean control, providing an upper bound on achievable NDCG under idealized preferences.
The rule-enhanced models trade some raw NDCG performance for greater personalization fidelity, which is critical in multi-user recommendation contexts.

Evaluation Methodology

Metric: NDCG@5 and NDCG@10 on held-out cold-start samples
Each user model evaluated independently
Negative samples retained to approximate real-world recommendation class imbalance

Intended Uses and Limitations

Intended Uses

Multi-profile recipe recommendation
Studying personalization behaviors under sparse feedback
Cold-start scenarios for new users

Limitations

Synthetic user interactions do not perfectly reflect real-world feedback
NDCG is not well aligned with multi-rule personalization behavior
User4 performance is limited by scarcity of relevant recipes

Risks and Bias

The models are trained on the Food.com dataset, which has known biases:

Regional bias: Western and American cuisines dominate the dataset, leading to potential under-representation of other regions.
Popularity bias: Highly rated or frequently interacted recipes are over-represented.
Cold-start leakage risk: Although user interactions are synthetic, overlapping ingredient-parent structures between train/test may create mild information leakage, potentially inflating baseline metrics.

These biases may affect recommendation diversity and fairness across different cuisines or dietary groups.

Cost and Latency

All models are based on XGBRanker, which runs efficiently on CPU:

Inference latency: Approximately 1–5 ms per recipe for ranking (measured on a laptop CPU, single thread).
Training cost: Training each user profile model on 5,000 interactions takes less than 2 minutes on CPU.

The approach is designed for real-time personalization in lightweight interfaces (e.g., Hugging Face Spaces).

Usage Disclosure

Intended Uses

Academic and educational research on personalized recommendation
Cold-start personalization experiments
Recipe recommendation for diverse dietary profiles

Not Intended For

Medical or dietary decision-making
Real-world deployment without additional bias mitigation
High-stakes personalization where fairness across demographic groups is critical

Citation

Tang, Xinxuan. Personalized Recipe Ranking Models. 2025.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Evaluation results

ndcg@5 on Food.com (Cleaned)
self-reported

0.440
ndcg@10 on Food.com (Cleaned)
self-reported

0.440

View on Papers With Code