🍴 Machine Learning · Binary Classification

A predictive model
for a food chain
to predict traffic
on their recipe menu

Built for Tasty Bytes — a subscription recipe platform — to automatically identify which recipes will drive high homepage traffic, replacing manual curation with data-driven decisions.

Python scikit-learn Logistic Regression Random Forest pandas seaborn matplotlib
80% Recall
High-traffic recipes correctly identified
82% Precision
Recommendations that are actually popular
+32.8% vs Random
Improvement over random recipe selection
01 · Project Brief

The Business Problem

Tasty Bytes' Product Manager selects homepage recipes manually. A popular recipe drives up to 40% more sitewide traffic — but there's no systematic way to predict which recipes will be popular.

🎯

Business Goal

Automatically predict which recipes will generate high homepage traffic, correctly identifying popular recipes at least 80% of the time — and minimising the chance of featuring unpopular recipes.

📈

Expected Impact

A popular homepage recipe drives up to 40% more traffic to the rest of the website. More traffic means more subscriptions — making accurate recipe selection directly linked to revenue growth.

⏱️

Current Process

The Product Manager manually selects their favourite recipe from a pool and features it on the homepage each day. This takes ~2 hours daily and produces inconsistent results with no data backing.

🤖

Our Solution

Train a supervised binary classification model on 895 historical recipes — using nutritional features, recipe category, and serving size — to predict High vs Low traffic with ≥80% recall.

02 · Data Validation

Understanding the Dataset

895 recipes with nutritional data, category labels, and historical traffic outcomes. Every column was validated, cleaned, and documented before modelling.

895
Raw Recipes
843
Clean Recipes
52
Rows Removed
11
Categories
Data Dictionary 8 columns
ColumnDescriptionType
recipeUnique identifier (ID only — dropped before modelling)int
caloriesCalories per serving — 52 missing, imputed with median (306.2 kcal)float
carbohydrateCarbohydrates in grams — 52 missing, imputed with median (37.1g)float
sugarSugar in grams — 52 missing, imputed with median (8.8g)float
proteinProtein in grams — 52 missing, imputed with median (24.5g)float
categoryRecipe type — 11 categories, no missing valuesstr
servingsNumber of servings — mixed types, converted to numericstr→int
high_trafficTarget — "High" if popular; 373 missing → treated as "Low"binary
Cleaning Steps column by column
ColumnIssueResolution
recipe✅ No issuesUsed as ID tracker, dropped before modelling
calories⚠️ 52 missingRows with ALL 4 nutrition fields missing removed (52); remaining imputed with median
carbohydrate⚠️ 52 missingSame rows as calories — removed with all-missing rows
sugar⚠️ 52 missingSame as above — median imputation after row removal
protein⚠️ 52 missingSame as above — median imputation after row removal
category✅ No missing11 unique values confirmed; one-hot encoded for modelling
servings⚠️ Mixed types"4 as a snack" → coerced to numeric; 1 NaN imputed with median
high_traffic⚠️ 373 missingConservative: filled as "Low" (unexplored = likely not high)
03 · Exploratory Analysis

What Makes a Recipe Popular?

Three key patterns emerged from exploratory analysis — category is by far the strongest predictor, followed by nutritional profile.

Single Variable · Bar Chart
Recipes per Category
Distribution of 895 recipes across 11 categories
Finding: Breakfast dominates (106 recipes, 11.2%) while One Dish Meal has fewest (71). Good diversity — no single category overwhelms the dataset.
Single Variable · Histogram
Distribution of Calories
Right-skewed — most recipes fall in 0–600 kcal range
Finding: Strongly right-skewed. Median (306 kcal) is lower than mean (436 kcal). Most recipes are moderate calorie; extreme outliers reach 3,633 kcal.
Multi-Variable · Category vs Traffic Rate
High-Traffic Rate by Recipe Category
% of recipes in each category that generate high homepage traffic — the single most powerful predictor
Key Finding: Vegetable (98.8%), Potato (94.3%) and Pork (91.7%) almost always generate high traffic. Beverages (5.4%) almost never do. Category alone is a near-perfect decision rule for top and bottom performers.
Multi-Variable · Nutritional Profile Comparison

High-Traffic vs Low-Traffic Recipes: Nutritional Differences

Calories
463.6 kcal
High Traffic
394.9 kcal
Low Traffic
+17% more calories
Protein
25.5g
High Traffic
22.2g
Low Traffic
+15% more protein
Carbohydrate
38.0g
High Traffic
30.7g
Low Traffic
+24% more carbs
Sugar
8.1g
High Traffic
10.4g
Low Traffic
-22% less sugar
Finding: Popular recipes are heartier and more substantial — higher calories, protein and carbs — but lower in sugar. Users prefer satisfying meals over light or sweet options.
04 · Model Development

Two Models Tested

This is a binary classification problem — predict High vs Low traffic. We trained a baseline Logistic Regression and a comparison Random Forest, then selected the best performer.

✓ Selected
Logistic Regression
Baseline model · sklearn.linear_model
Accuracy78.8%
Recall (High class)80.0%
Precision (High class)82.0%
AUC-ROC0.835
F1-Score0.816
Recall80%
Precision82%
AUC83.5%
Random Forest
Comparison model · sklearn.ensemble
Accuracy73.7%
Recall (High class)76.0%
Precision (High class)78.9%
AUC-ROC0.790
F1-Score0.774
Recall76%
Precision78.9%
AUC79.0%
Confusion Matrix — Logistic Regression

At tuned threshold = 0.46 · Test set (179 recipes)

Pred: Low
Pred: High
Actually Low   Actually High
57
True Negative
Correctly rejected
15
False Positive
False alarms
21
False Negative
Missed hits
86
True Positive
Correctly caught
Out of 107 truly popular recipes → correctly identified 86 (80%)
Out of 72 truly low-traffic recipes → correctly rejected 57 (79%)

⚙ Threshold Tuning

Default threshold (0.5) achieved 78.5% recall. By tuning the decision threshold to 0.46, we achieve exactly 80% recall while keeping precision at 82% — meeting the business target without sacrificing much precision.

Why Not Random Forest?
LR Recall vs RF Recall+4pp advantage
LR Precision vs RF Precision+3pp advantage
LR AUC vs RF AUC+4.5pp advantage
InterpretabilityLR wins (explainable)
Overfitting riskLR lower risk

Logistic Regression wins on every metric while being simpler, faster, and more interpretable for stakeholders.

05 · Model Results

Performance & Business Impact

The tuned Logistic Regression meets the ≥80% recall target and delivers a measurable improvement over both random and manual selection.

80%
Recall — High Traffic
82%
Precision
79%
Overall Accuracy
0.835
AUC-ROC Score

vs Random Selection

210 homepage slots (7 recipes/day × 30 days)

Manual Selection (current)
~60%
Random Selection
50%
Our Model
80%
+32.8% improvement over random: 165 high-traffic recommendations vs 105 expected from random selection across 210 slots.

Estimated Revenue Impact

Based on 10,000 monthly visitors at 5% conversion · $10/month subscription

Random — new subscribers/month~63
Our Model — new subscribers/month~255
Monthly revenue uplift+$1,920
Curation time saved-80% (1.5 hrs/day)
Time to deployWeek 1
Expected +35–40% homepage traffic uplift based on the Product Manager's own observation that popular recipes drive 40% more sitewide traffic.
06 · Model Predictions

Top 30 Predicted High-Traffic Recipes

The trained model was applied to all 895 recipes. The 30 highest-probability predictions are shown below — all Vegetable category with 97.6–98.0% confidence.

# Recipe ID Category Calories Protein (g) Confidence Recommendation
Observation: All top 30 predictions are Vegetable category recipes — consistent with the EDA finding that 98.8% of Vegetable recipes generate high traffic. This validates the model's learned behaviour and confirms category is the dominant signal. The confidence level (97.6–98.0%) reflects the model's high certainty for this category.
07 · Technical Stack

Libraries, Models & Frameworks

End-to-end Python data science pipeline — from raw CSV ingestion to trained classifier, threshold tuning, and top-recipe prediction.

🐼
pandas
Data loading, cleaning, imputation, feature engineering and manipulation
🔢
numpy
Numerical operations, array handling and statistical computations
📊
matplotlib
Static visualisations — histograms, distribution plots, bar charts
🎨
seaborn
Statistical plots — countplot, barplot, histplot with KDE overlay
⚙️
scikit-learn
ML pipeline: train_test_split, StandardScaler, LogisticRegression, RandomForest, metrics
📐
LogisticRegression
Baseline binary classifier — max_iter=1000, random_state=42, scaled features
🌲
RandomForest
Comparison model — 100 estimators, max_depth=10, handles non-linear interactions
🎯
Threshold Tuning
precision_recall_curve used to find optimal threshold (0.46) achieving ≥80% recall

Analysis Pipeline

1

Load Data

pd.read_csv
895 rows × 8 cols

2

Validate

Inspect types,
missing values

3

Clean

Drop 52 rows,
impute, encode

4

EDA

Univariate +
multivariate plots

5

Split

80/20 train-test
stratified split

6

Train

LR + RF
on scaled features

7

Evaluate

Recall, precision,
AUC, confusion matrix

8

Tune & Deploy

Threshold 0.46
→ top 30 predictions

08 · Monitoring & Roadmap

How to Measure Success

Defined metrics, alert thresholds, and a phased deployment roadmap to ensure the model delivers sustained value beyond initial deployment.

Monitoring Dashboard

High-Traffic Recall80% ✓ Target Met
Precision on Recommendations82% ✓ Exceeds 75% target
Overall Accuracy79% ✓ Above 78% target
AUC-ROC0.835 ✓ Strong
Homepage CTR (expected)+35% target
Curation time-80% saved

Alert Thresholds

HealthyRecall ≥80% · Precision ≥75% — continue monitoring
WarningRecall drops below 80% — investigate data drift
CriticalRecall drops below 75% — immediate retraining

Deployment Roadmap

Week 1–2

Deploy & Tune

Approve model → deploy to staging → set threshold to 0.46 → brief editorial team

Week 1–2

A/B Test Setup

10% traffic on model recommendations vs 90% control (current manual). Establish baseline metrics.

Week 3–4

Measure Impact

30-day A/B test running. Collect recall, precision, CTR, subscription conversion. Decision point: roll out or iterate.

Week 5–6

Full Rollout

Deploy to 100% homepage → automate curation → free 1.5 hrs/day editorial time → expect +35–40% traffic lift

Monthly

Retrain & Improve

Retrain on new recipe data · monitor for seasonal drift · explore new features (prep time, ratings, difficulty)