- ProblemReproducible Spark ML pipeline (ingest → features → train → evaluate → dashboard artifacts) with local ↔ GCP parity.
- RoleData + platform engineering demo
- TimeframePrototype build
- StackPython • PySpark • Spark MLlib • GCS
- FocusSpark • PySpark • MLlib
- ResultsA repeatable analytics ML pipeline that keeps feature builds, training results, and dashboards consistent across local and cloud environments.
Problem
Product analytics pipelines often drift between laptop and cloud runs: paths change, configs diverge, and feature logic becomes inconsistent. The result is untrusted dashboards, slow iteration, and models that can’t be audited.
Why this matters: consistent outputs across environments, faster reruns, and traceable model decisions that can be explained to stakeholders.
- Config-driven Spark pipeline: ingest → session features → train/evaluate → dashboard exports
- Local ↔ GCP Dataproc Serverless parity via Makefile + config.yaml / config.gcp.yaml
- Deterministic artifacts in reports/ (metrics, figures, parquet aggregates)
- Data is externalized (public GCS or BYO) to keep the repo clean and reproducible
Context
Architecture at a glance — Raw clickstream CSVs are ingested into parquet (schema enforced + partitioned), session features are built, a baseline logistic regression model is trained/evaluated, and dashboard-ready aggregates are exported for Streamlit.
How to read the architecture — Ingestion produces immutable parquet partitions, feature jobs aggregate session-level intent, training consumes versioned feature sets, evaluation writes metrics/plots, and the dashboard reads only exported aggregates (no live Spark).
Real-world framing — This models product analytics: funnel conversion and propensity scoring. Sessionization captures intent, parquet partitioning keeps jobs repeatable, leakage filtering protects metrics, and time-based splits reflect deployment drift.
Cloud migration path — Same pipeline runs locally or on Dataproc Serverless by switching configs: local paths → GCS URIs, Spark master → serverless, and storage helpers route reads/writes without code changes.
The dataset is externalized (public GCS or BYO), enabling clean local ↔ cloud runs with the same configuration model.
- Quickstart (local)
make setup ./scripts/download_data.sh make rerun_all_force make dashboard - Run on Dataproc Serverless
bash scripts/gcp_run_all.sh # or submit a single job bash scripts/gcp_submit.sh
Architecture
- Pipeline flow (ingest → model → dashboard)
- Ingest: schema-enforced parquet partitioning for reliable downstream processing.
- Feature engineering: session-level aggregates + labels with leakage-aware filtering.
- Modeling: Spark MLlib logistic regression with time-based split when available (random fallback otherwise).
- Exports: metrics, plots, and dashboard aggregates saved to reports/.
- Architecture map (data movement)
- Raw → processed: CSVs are normalized into partitioned parquet for stable downstream reads.
- Processed → features: session-level joins/aggregations create the modeling table.
- Features → model: MLlib pipeline persists model artifacts to models/.
- Model → reports: metrics.json/csv + figures for audit and QA.
- Reports → dashboard: Streamlit reads only reports/dashboard aggregates.
- Reproducibility + environment parity
- Makefile orchestrates the same steps locally or on Dataproc Serverless.
- config.yaml / config.gcp.yaml keep paths and Spark settings explicit.
- Storage utilities abstract local filesystem vs GCS without code changes.
- Cloud migration (local → Dataproc Serverless)
- Switch configs: local paths → gs:// buckets, Spark master → serverless batch.
- Package source once and submit batches per pipeline step.
- Artifacts land in GCS reports/models buckets; dashboard can read GCS directly.
- Dashboard artifacts
- Streamlit app reads exported aggregates so the dashboard is decoupled from training.
- Reports contain figures + metrics for quick review and QA checks.
- Outputs and how to interpret them
- reports/ingestion_summary.json: row counts + time range for raw ingest sanity checks.
- reports/features_summary.json: feature row count + positive rate.
- reports/metrics.json: AUC plus threshold metrics for evaluation at a chosen cutoff.
- reports/figures/metrics.png: visual sanity check for model performance.
Security / Threat Model
- Config-driven storage paths keep raw/processed/outputs deterministic across local and GCS.
- No credentials are committed; GCP auth is handled by gcloud/gsutil locally.
- Dataset is public and externalized to avoid bundling large data in the repo.
- Dashboard reads precomputed aggregates only (reduces data exposure and compute costs).
Tradeoffs & Lessons
Key decisions — Spark over pandas for scale, Makefile as a single orchestrator, and config-only portability.
Trade-offs — serverless batches reduce ops but add cold-start overhead; precomputed dashboards trade freshness for speed.
Limitations — baseline model only; no experiment tracking server or model registry yet.
Next steps — MLflow tracking, CI validation of outputs, scheduled Dataproc runs, and data quality checks.
Results
A repeatable analytics ML pipeline that keeps feature builds, training results, and dashboards consistent across local and cloud environments.