Capstone-Retail-Analytics

Optimizing Retail Decisions through Customer Segmentation and Predictive Sales Modeling

🔗 View this project on GitHub

Author: Juan Diego Benavides
Purpose: Educational capstone demonstrating an end-to-end data-driven retail analytics pipeline and communicating findings to non-technical stakeholders.


Table of Contents


Executive Summary

Executive questions
1) What are the key sales patterns?
2) Which customer segments drive revenue?
3) How can short-term sales be forecast to inform inventory decisions?

Deliverables

Headline results

Note: The dataset is synthetic but realistic. The focus is on methodology, reproducibility, and clear communication.


Introduction

Retailers often rely on intuition for pricing, inventory management, and promotions. With multi-table transactional data, guesswork can be replaced by descriptive analytics, customer segmentation, and predictive modeling to reduce stockouts/overstocks and target campaigns more effectively.

Why this approach

How the story unfolds
The pipeline first ensures trustworthy data (types, joins, integrity), then explores sales patterns (KPIs, seasonality, category mix), applies RFM + K-Means to segment customers, and builds a short-term sales forecast to support inventory planning. The results are published in an Executive dashboard for stakeholder exploration.


Methods

This section covers data sources, cleaning, features, models, validation, and reproducibility.

Data sources (≥2 sources, ≥1,000 rows, ≥5 columns — met)

NB01 snapshot (row counts & range)

Data cleaning & integration (NB01)

Repository note — full vs mini feed
NB01 materializes the full history as sales_enriched.csv. Because that file is too large for GitHub, the repo ships a size-capped sales_enriched_mini.csv.

All KPIs and visuals in this README were computed on the full sales_enriched.csv. If only the mini feed is used, minor discrepancies may appear. To reproduce exact figures, run NB01 locally to regenerate the full file.

Feature engineering (NB03)

Customer clustering (NB04)

K-Means fitted on two variants (from NB03):

K selection 1) Scan K∈[2..10] with Silhouette, Davies–Bouldin, Calinski–Harabasz, and Elbow.
2) Confirm on full data with high n_init around the best K.
3) Bootstrap stability (30 runs, 80% subsamples): agreement/ARI/NMI.
Labeling: the High-Value cluster maximizes frequency + monetary − recency. A binary label hv01 (0=High-Value, 1=Mid/Low) was also published.

Forecasting (NB05)

Reproducibility

Data provenance & credits


Results

Data lineage. EDA visuals were computed on full history from sales_enriched.csv.
The trend & forecast use clean/tableau_feeds/ts_forecast_feed_long.csv and are the only visuals affected by the Window Days parameter (30/60/90; default 90).
Repo note: The repository includes sales_enriched_mini.csv due to size constraints; slight numeric differences may occur if only the mini feed is used.

1) Exploratory Data Analysis (NB02)

Core KPIs (full history)

Metric Mean Median P90
ATV (Average Transaction Value) 641.07 490.77 1,472.30
UPT (Units per Transaction) 13.00 13.00 23.00

Source: sales_enriched.csv (joined with customers_segments_tableau.csv for segment filters).


Monthly Sales Trend (illustrative)

Monthly total sales
Rows with missing SalesDate are excluded from the time series (≈1.0%), but they remain in global KPIs. Source: sales_enriched.csv.
Rows included vs excluded (time-series policy)

Category Mix

Top categories by revenue
Revenue concentrates in a few categories (assortment & promo focus). Source: sales_enriched.csv.

Note: Categories use the curated mapping products_with_category_v7_clean.csv (educational; may include errors).


RFM Distributions (customer-level, NB03 input)

Feature Median P75
Recency (days) 1 2
Frequency (# orders) 68 73
Monetary (total spend) 42,557 63,148
RFM correlation
Recency is negatively related to Frequency/Monetary (recent buyers purchase/spend more). Aggregated from sales_enriched.csv by SalesID.

Customer Concentration (Pareto)

Pareto customers vs revenue
~55.9% of customers generate 80% of revenue → segmentation is warranted. Source: sales_enriched.csv.

Optional snapshots

Products by class Perishable vs Non-Perishable

Assortment and perishables composition — inventory & freshness policy. Source: product joins → sales_enriched.csv.
Note: Category labels come from products_with_category_v7_clean.csv (educational; may include errors).


2) Feature insights (NB03)

Customers: 98,759 | Daily rows: 129 | Transactions: 6,690,599.
Published matrices: FULL (6 features) & RFM3 (3 features). All scaled to [0,1] with 0 NaNs.
Source: clean/model_input/*.parquet (summaries exported to CSV for the report).


3) Customer Segmentation (NB04)

Model quality

Segment size & contribution (FULL, harmonized HV vs Mid/Low)

Segment Share of customers Share of revenue Lift vs avg $/customer
High-Value 49.2% 73.7% +49.7%
Mid/Low-Value 50.8% 26.3% −48.1%

Sources: clean/tableau_feeds/cluster_profiles_full_executive.csv, clean/tableau_feeds/revenue_by_cluster_full.csv. Tableau uses customers_segments_tableau.csv for interactive filtering.

Interpretation. The High-Value segment concentrates revenue with a similar share of customers, driven by higher Frequency/Monetary and lower Recency (more active). Priorities: retention, stockout prevention, premium bundles. The Mid/Low group is a candidate for activation and cross-sell.

Stability. Bootstrap agreement/ARI/NMI is ~0.99. GMM, Agglomerative, and DBSCAN were benchmarked; K-Means was retained for clarity and ease of use.


4) Forecasting (NB05)

Cross-validation (mean across folds)

RMSE MAE
187,973.59 158,438.55 −0.34

Note: MAPE is unstable with zero/near-zero sales days, and WAPE per fold is less interpretable. CV focuses on RMSE/MAE and .

Holdout (14 days)

RMSE MAE MAPE WAPE Uplift vs seasonal naïve
199,746.35 161,236.64 48.22% 48.0% −0.11 +33.55%
Timeline — Actual vs RF (Backtest/Holdout/Future)
Backtest, 14-day holdout, and 14-day future forecast from the champion RF model.

Prediction intervals (80%)
Holdout coverage ≈ 85.7%; mean/median widths tracked in clean/tableau_feeds/pi_coverage_holdout.json.


5) Dashboard (Tableau)

Public link:
🔗 Open the Tableau dashboard

Executive view

EDA Explorer view

Executive overview (trend & forecast with Window Days, plus segmentation cards) EDA Explorer (filters by Segment, Month, Weekday)

Only the trend/forecast honors Window Days. The repo includes sales_enriched_mini.csv for convenience; the published dashboard and this report were computed on the full feed.


Conclusion

The project established a reproducible pipeline from raw multi-table data (NB01) through EDA (NB02), feature engineering (NB03), customer segmentation (NB04), and a short-term forecaster (NB05). Artifacts are stored under clean/*, and the dashboard enables interactive exploration of insights.

Key takeaways

Recommendations 1) Inventory: set min/max with the forecast and PIs; prioritize perishables and High-Value demand.
2) Promotions: activate Mid/Low customers with adjacent-category bundles; monitor weekly lift.
3) Pricing: test dynamic markdowns on overstock (wide PIs).
4) Modeling next: add exogenous signals (holidays, weather), benchmark XGBoost, keep time-aware CV.
5) Monitoring: track WAPE% and PI coverage; flag drift >10% vs CV.

Authorship & data credits