Built for Tasty Bytes — a subscription recipe platform — to automatically identify which recipes will drive high homepage traffic, replacing manual curation with data-driven decisions.
Tasty Bytes' Product Manager selects homepage recipes manually. A popular recipe drives up to 40% more sitewide traffic — but there's no systematic way to predict which recipes will be popular.
Automatically predict which recipes will generate high homepage traffic, correctly identifying popular recipes at least 80% of the time — and minimising the chance of featuring unpopular recipes.
A popular homepage recipe drives up to 40% more traffic to the rest of the website. More traffic means more subscriptions — making accurate recipe selection directly linked to revenue growth.
The Product Manager manually selects their favourite recipe from a pool and features it on the homepage each day. This takes ~2 hours daily and produces inconsistent results with no data backing.
Train a supervised binary classification model on 895 historical recipes — using nutritional features, recipe category, and serving size — to predict High vs Low traffic with ≥80% recall.
"We have noticed that traffic to the rest of the website goes up by as much as 40% if I pick a popular recipe. But I don't know how to decide if a recipe will be popular. More traffic means more subscriptions so this is really important to the company."
895 recipes with nutritional data, category labels, and historical traffic outcomes. Every column was validated, cleaned, and documented before modelling.
| Column | Description | Type |
|---|---|---|
| recipe | Unique identifier (ID only — dropped before modelling) | int |
| calories | Calories per serving — 52 missing, imputed with median (306.2 kcal) | float |
| carbohydrate | Carbohydrates in grams — 52 missing, imputed with median (37.1g) | float |
| sugar | Sugar in grams — 52 missing, imputed with median (8.8g) | float |
| protein | Protein in grams — 52 missing, imputed with median (24.5g) | float |
| category | Recipe type — 11 categories, no missing values | str |
| servings | Number of servings — mixed types, converted to numeric | str→int |
| high_traffic | Target — "High" if popular; 373 missing → treated as "Low" | binary |
| Column | Issue | Resolution |
|---|---|---|
| recipe | ✅ No issues | Used as ID tracker, dropped before modelling |
| calories | ⚠️ 52 missing | Rows with ALL 4 nutrition fields missing removed (52); remaining imputed with median |
| carbohydrate | ⚠️ 52 missing | Same rows as calories — removed with all-missing rows |
| sugar | ⚠️ 52 missing | Same as above — median imputation after row removal |
| protein | ⚠️ 52 missing | Same as above — median imputation after row removal |
| category | ✅ No missing | 11 unique values confirmed; one-hot encoded for modelling |
| servings | ⚠️ Mixed types | "4 as a snack" → coerced to numeric; 1 NaN imputed with median |
| high_traffic | ⚠️ 373 missing | Conservative: filled as "Low" (unexplored = likely not high) |
Three key patterns emerged from exploratory analysis — category is by far the strongest predictor, followed by nutritional profile.
This is a binary classification problem — predict High vs Low traffic. We trained a baseline Logistic Regression and a comparison Random Forest, then selected the best performer.
At tuned threshold = 0.46 · Test set (179 recipes)
Default threshold (0.5) achieved 78.5% recall. By tuning the decision threshold to 0.46, we achieve exactly 80% recall while keeping precision at 82% — meeting the business target without sacrificing much precision.
Logistic Regression wins on every metric while being simpler, faster, and more interpretable for stakeholders.
The tuned Logistic Regression meets the ≥80% recall target and delivers a measurable improvement over both random and manual selection.
210 homepage slots (7 recipes/day × 30 days)
Based on 10,000 monthly visitors at 5% conversion · $10/month subscription
The trained model was applied to all 895 recipes. The 30 highest-probability predictions are shown below — all Vegetable category with 97.6–98.0% confidence.
| # | Recipe ID | Category | Calories | Protein (g) | Confidence | Recommendation |
|---|
End-to-end Python data science pipeline — from raw CSV ingestion to trained classifier, threshold tuning, and top-recipe prediction.
pd.read_csv
895 rows × 8 cols
Inspect types,
missing values
Drop 52 rows,
impute, encode
Univariate +
multivariate plots
80/20 train-test
stratified split
LR + RF
on scaled features
Recall, precision,
AUC, confusion matrix
Threshold 0.46
→ top 30 predictions
Defined metrics, alert thresholds, and a phased deployment roadmap to ensure the model delivers sustained value beyond initial deployment.
Approve model → deploy to staging → set threshold to 0.46 → brief editorial team
10% traffic on model recommendations vs 90% control (current manual). Establish baseline metrics.
30-day A/B test running. Collect recall, precision, CTR, subscription conversion. Decision point: roll out or iterate.
Deploy to 100% homepage → automate curation → free 1.5 hrs/day editorial time → expect +35–40% traffic lift
Retrain on new recipe data · monitor for seasonal drift · explore new features (prep time, ratings, difficulty)