Cloud analytics ML pipeline

At a glance

Problem
Reproducible Spark ML pipeline (ingest → features → train → evaluate → dashboard artifacts) with local ↔ GCP parity.
Role
Data + platform engineering demo
Timeframe
Prototype build
Stack
Python • PySpark • Spark MLlib • GCS
Focus
Spark • PySpark • MLlib
Results
A repeatable analytics ML pipeline that keeps feature builds, training results, and dashboards consistent across local and cloud environments.

Problem

Product analytics pipelines often drift between laptop and cloud runs: paths change, configs diverge, and feature logic becomes inconsistent. The result is untrusted dashboards, slow iteration, and models that can’t be audited.

Why this matters: consistent outputs across environments, faster reruns, and traceable model decisions that can be explained to stakeholders.

Executive summary

Config-driven Spark pipeline: ingest → session features → train/evaluate → dashboard exports
Local ↔ GCP Dataproc Serverless parity via Makefile + config.yaml / config.gcp.yaml
Deterministic artifacts in reports/ (metrics, figures, parquet aggregates)
Data is externalized (public GCS or BYO) to keep the repo clean and reproducible

Context

Architecture at a glance

Architecture at a glance — Raw clickstream CSVs are ingested into parquet (schema enforced + partitioned), session features are built, a baseline logistic regression model is trained/evaluated, and dashboard-ready aggregates are exported for Streamlit.

How to read the architecture — Ingestion produces immutable parquet partitions, feature jobs aggregate session-level intent, training consumes versioned feature sets, evaluation writes metrics/plots, and the dashboard reads only exported aggregates (no live Spark).

Real-world framing — This models product analytics: funnel conversion and propensity scoring. Sessionization captures intent, parquet partitioning keeps jobs repeatable, leakage filtering protects metrics, and time-based splits reflect deployment drift.

Cloud migration path — Same pipeline runs locally or on Dataproc Serverless by switching configs: local paths → GCS URIs, Spark master → serverless, and storage helpers route reads/writes without code changes.

The dataset is externalized (public GCS or BYO), enabling clean local ↔ cloud runs with the same configuration model.

How to verify

Quickstart (local)

make setup
./scripts/download_data.sh
make rerun_all_force
make dashboard

Run on Dataproc Serverless

bash scripts/gcp_run_all.sh
# or submit a single job
bash scripts/gcp_submit.sh

Links

GitHub repo Pipeline overview Dataproc scripts

Architecture

Pipeline flow (ingest → model → dashboard)
- Ingest: schema-enforced parquet partitioning for reliable downstream processing.
- Feature engineering: session-level aggregates + labels with leakage-aware filtering.
- Modeling: Spark MLlib logistic regression with time-based split when available (random fallback otherwise).
- Exports: metrics, plots, and dashboard aggregates saved to reports/.
Architecture map (data movement)
- Raw → processed: CSVs are normalized into partitioned parquet for stable downstream reads.
- Processed → features: session-level joins/aggregations create the modeling table.
- Features → model: MLlib pipeline persists model artifacts to models/.
- Model → reports: metrics.json/csv + figures for audit and QA.
- Reports → dashboard: Streamlit reads only reports/dashboard aggregates.
Reproducibility + environment parity
- Makefile orchestrates the same steps locally or on Dataproc Serverless.
- config.yaml / config.gcp.yaml keep paths and Spark settings explicit.
- Storage utilities abstract local filesystem vs GCS without code changes.
Cloud migration (local → Dataproc Serverless)
- Switch configs: local paths → gs:// buckets, Spark master → serverless batch.
- Package source once and submit batches per pipeline step.
- Artifacts land in GCS reports/models buckets; dashboard can read GCS directly.
Dashboard artifacts
- Streamlit app reads exported aggregates so the dashboard is decoupled from training.
- Reports contain figures + metrics for quick review and QA checks.
Outputs and how to interpret them
- reports/ingestion_summary.json: row counts + time range for raw ingest sanity checks.
- reports/features_summary.json: feature row count + positive rate.
- reports/metrics.json: AUC plus threshold metrics for evaluation at a chosen cutoff.
- reports/figures/metrics.png: visual sanity check for model performance.

Security / Threat Model

Config-driven storage paths keep raw/processed/outputs deterministic across local and GCS.
No credentials are committed; GCP auth is handled by gcloud/gsutil locally.
Dataset is public and externalized to avoid bundling large data in the repo.
Dashboard reads precomputed aggregates only (reduces data exposure and compute costs).

Tradeoffs & Lessons

Key decisions — Spark over pandas for scale, Makefile as a single orchestrator, and config-only portability.

Trade-offs — serverless batches reduce ops but add cold-start overhead; precomputed dashboards trade freshness for speed.

Limitations — baseline model only; no experiment tracking server or model registry yet.

Next steps — MLflow tracking, CI validation of outputs, scheduled Dataproc runs, and data quality checks.

Results

A repeatable analytics ML pipeline that keeps feature builds, training results, and dashboards consistent across local and cloud environments.

Stack

PythonPySparkSpark MLlibGCSMakefileStreamlit

Back to Projects Explore more Case Studies