Author: Juan Diego Benavides
Purpose: Educational capstone demonstrating an end-to-end data-driven retail analytics pipeline and communicating findings to non-technical stakeholders.
Executive questions
1) What are the key sales patterns?
2) Which customer segments drive revenue?
3) How can short-term sales be forecast to inform inventory decisions?
Deliverables
Headline results
Note: The dataset is synthetic but realistic. The focus is on methodology, reproducibility, and clear communication.
Retailers often rely on intuition for pricing, inventory management, and promotions. With multi-table transactional data, guesswork can be replaced by descriptive analytics, customer segmentation, and predictive modeling to reduce stockouts/overstocks and target campaigns more effectively.
Why this approach
How the story unfolds
The pipeline first ensures trustworthy data (types, joins, integrity), then explores sales patterns (KPIs, seasonality, category mix), applies RFM + K-Means to segment customers, and builds a short-term sales forecast to support inventory planning. The results are published in an Executive dashboard for stakeholder exploration.
This section covers data sources, cleaning, features, models, validation, and reproducibility.
sales
, products
, customers
, employees
, cities
, countries
(Kaggle: Grocery Sales Dataset — andrexibiza).products_with_category_v7_clean.csv
— curated by joining the original products
and categories
tables and standardizing labels (merging synonyms, fixing typos, consolidating classes). It improves category granularity but may contain classification errors, so it should be treated as a best-effort educational artifact.NB01 snapshot (row counts & range)
sales : 6,758,125 |
customers : 98,759 |
products : 452 |
employees : 23 |
cities : 96 |
countries : 206 |
UnitNetPrice
& TotalPrice
from products × (1 − Discount) × Quantity.clean/
(Parquet) and audits under clean/report/
.Repository note — full vs mini feed
NB01 materializes the full history as sales_enriched.csv
. Because that file is too large for GitHub, the repo ships a size-capped sales_enriched_mini.csv
.
All KPIs and visuals in this README were computed on the full
sales_enriched.csv
. If only the mini feed is used, minor discrepancies may appear. To reproduce exact figures, run NB01 locally to regenerate the full file.
SalesID
. RFM window: 2018-01-01 → 2018-05-09, anchor t₀ = max(valid SalesDate)
.recency_days
, frequency
, monetary
.avg_ticket
, pct_discounted
, n_categories
. Geography modes (top_city
, top_country
) are reporting-only.random_state=42
); scaler parameters persisted.SalesDate
are excluded from time series but included in global KPIs.K-Means fitted on two variants (from NB03):
recency_days
, frequency
, monetary
avg_ticket
, pct_discounted
, n_categories
K selection
1) Scan K∈[2..10] with Silhouette, Davies–Bouldin, Calinski–Harabasz, and Elbow.
2) Confirm on full data with high n_init
around the best K.
3) Bootstrap stability (30 runs, 80% subsamples): agreement/ARI/NMI.
Labeling: the High-Value cluster maximizes frequency + monetary − recency
. A binary label hv01
(0=High-Value, 1=Mid/Low) was also published.
TotalRevenue
. Champion: RandomForestRegressor
.42
).clean/*.parquet
; model artifacts in clean/forecast/*
.Data lineage. EDA visuals were computed on full history from
sales_enriched.csv
.
The trend & forecast useclean/tableau_feeds/ts_forecast_feed_long.csv
and are the only visuals affected by the Window Days parameter (30/60/90; default 90).
Repo note: The repository includessales_enriched_mini.csv
due to size constraints; slight numeric differences may occur if only the mini feed is used.
Metric | Mean | Median | P90 |
---|---|---|---|
ATV (Average Transaction Value) | 641.07 | 490.77 | 1,472.30 |
UPT (Units per Transaction) | 13.00 | 13.00 | 23.00 |
Source: sales_enriched.csv
(joined with customers_segments_tableau.csv
for segment filters).
SalesDate
are excluded from the time series (≈1.0%), but they remain in global KPIs. Source: sales_enriched.csv
.sales_enriched.csv
.Note: Categories use the curated mapping products_with_category_v7_clean.csv
(educational; may include errors).
Feature | Median | P75 |
---|---|---|
Recency (days) | 1 | 2 |
Frequency (# orders) | 68 | 73 |
Monetary (total spend) | 42,557 | 63,148 |
sales_enriched.csv
by SalesID
.sales_enriched.csv
.
Assortment and perishables composition — inventory & freshness policy. Source: product joins → sales_enriched.csv
.
Note: Category labels come from products_with_category_v7_clean.csv
(educational; may include errors).
Customers: 98,759 | Daily rows: 129 | Transactions: 6,690,599.
Published matrices: FULL (6 features) & RFM3 (3 features). All scaled to [0,1] with 0 NaNs.
Source: clean/model_input/*.parquet
(summaries exported to CSV for the report).
Model quality
FULL: K = 2 | Silhouette ≈ 0.3819 | DB ≈ 1.0071 | CH ≈ 87,885.7 |
RFM3: K = 2 | Silhouette ≈ 0.3968 | DB ≈ 0.9761 | CH ≈ 85,773.5 |
Segment size & contribution (FULL, harmonized HV vs Mid/Low)
Segment | Share of customers | Share of revenue | Lift vs avg $/customer |
---|---|---|---|
High-Value | 49.2% | 73.7% | +49.7% |
Mid/Low-Value | 50.8% | 26.3% | −48.1% |
Sources: clean/tableau_feeds/cluster_profiles_full_executive.csv
, clean/tableau_feeds/revenue_by_cluster_full.csv
. Tableau uses customers_segments_tableau.csv
for interactive filtering.
Interpretation. The High-Value segment concentrates revenue with a similar share of customers, driven by higher Frequency/Monetary and lower Recency (more active). Priorities: retention, stockout prevention, premium bundles. The Mid/Low group is a candidate for activation and cross-sell.
Stability. Bootstrap agreement/ARI/NMI is ~0.99. GMM, Agglomerative, and DBSCAN were benchmarked; K-Means was retained for clarity and ease of use.
Cross-validation (mean across folds)
RMSE | MAE | R² |
---|---|---|
187,973.59 | 158,438.55 | −0.34 |
Note: MAPE is unstable with zero/near-zero sales days, and WAPE per fold is less interpretable. CV focuses on RMSE/MAE and R².
Holdout (14 days)
RMSE | MAE | MAPE | WAPE | R² | Uplift vs seasonal naïve |
---|---|---|---|---|---|
199,746.35 | 161,236.64 | 48.22% | 48.0% | −0.11 | +33.55% |
Prediction intervals (80%)
Holdout coverage ≈ 85.7%; mean/median widths tracked in clean/tableau_feeds/pi_coverage_holdout.json
.
Public link:
🔗 Open the Tableau dashboard
Executive view
clean/tableau_feeds/ts_forecast_feed_long.csv
(honors Window Days). KPI cards (e.g., Sales Last N days, Forecast Next 14 days) are Tableau calculations over that feed.sales_enriched.csv
joined to customers_segments_tableau.csv
(these do not change with Window Days; they do respond to EDA filters).EDA Explorer view
Only the trend/forecast honors Window Days. The repo includes
sales_enriched_mini.csv
for convenience; the published dashboard and this report were computed on the full feed.
The project established a reproducible pipeline from raw multi-table data (NB01) through EDA (NB02), feature engineering (NB03), customer segmentation (NB04), and a short-term forecaster (NB05). Artifacts are stored under clean/*
, and the dashboard enables interactive exploration of insights.
Key takeaways
Recommendations
1) Inventory: set min/max with the forecast and PIs; prioritize perishables and High-Value demand.
2) Promotions: activate Mid/Low customers with adjacent-category bundles; monitor weekly lift.
3) Pricing: test dynamic markdowns on overstock (wide PIs).
4) Modeling next: add exogenous signals (holidays, weather), benchmark XGBoost, keep time-aware CV.
5) Monitoring: track WAPE% and PI coverage; flag drift >10% vs CV.