Cloud Analytics ML Pipeline

Reproducible Spark ML pipeline (ingest → features → train → evaluate → dashboard artifacts) with local ↔ GCP parity.

Role: Data + platform engineering demoTimeframe: Prototype buildStack: Python • PySpark • Spark MLlib • GCS • Makefile • Streamlit
SparkPySparkMLlibGCP DataprocReproducibilityData EngineeringObservability
Cloud Analytics ML Pipeline
At a glance
  • Problem
    Reproducible Spark ML pipeline (ingest → features → train → evaluate → dashboard artifacts) with local ↔ GCP parity.
  • Role
    Data + platform engineering demo
  • Timeframe
    Prototype build
  • Stack
    Python • PySpark • Spark MLlib • GCS
  • Focus
    Spark • PySpark • MLlib
  • Results
    A repeatable analytics ML pipeline that keeps feature builds, training results, and dashboards consistent across local and cloud environments.

Problem

Product analytics pipelines often drift between laptop and cloud runs: paths change, configs diverge, and feature logic becomes inconsistent. The result is untrusted dashboards, slow iteration, and models that can’t be audited.

Why this matters: consistent outputs across environments, faster reruns, and traceable model decisions that can be explained to stakeholders.

Executive summary
  • Config-driven Spark pipeline: ingest → session features → train/evaluate → dashboard exports
  • Local ↔ GCP Dataproc Serverless parity via Makefile + config.yaml / config.gcp.yaml
  • Deterministic artifacts in reports/ (metrics, figures, parquet aggregates)
  • Data is externalized (public GCS or BYO) to keep the repo clean and reproducible

Context

Architecture at a glance
Cloud analytics ML pipelineReproducible Spark flow with ingest, features, training, evaluation, and dashboard exports.IngestGCS datasetFeaturestransformsTrain + EvalMLlibExportsdashboardExecutionLocal ↔ DataprocLocal / cloud parityReproducible runs

Architecture at a glance — Raw clickstream CSVs are ingested into parquet (schema enforced + partitioned), session features are built, a baseline logistic regression model is trained/evaluated, and dashboard-ready aggregates are exported for Streamlit.

How to read the architecture — Ingestion produces immutable parquet partitions, feature jobs aggregate session-level intent, training consumes versioned feature sets, evaluation writes metrics/plots, and the dashboard reads only exported aggregates (no live Spark).

Real-world framing — This models product analytics: funnel conversion and propensity scoring. Sessionization captures intent, parquet partitioning keeps jobs repeatable, leakage filtering protects metrics, and time-based splits reflect deployment drift.

Cloud migration path — Same pipeline runs locally or on Dataproc Serverless by switching configs: local paths → GCS URIs, Spark master → serverless, and storage helpers route reads/writes without code changes.

The dataset is externalized (public GCS or BYO), enabling clean local ↔ cloud runs with the same configuration model.

How to verify
  • Quickstart (local)
    make setup
    ./scripts/download_data.sh
    make rerun_all_force
    make dashboard
  • Run on Dataproc Serverless
    bash scripts/gcp_run_all.sh
    # or submit a single job
    bash scripts/gcp_submit.sh

Architecture

  1. Pipeline flow (ingest → model → dashboard)
    • Ingest: schema-enforced parquet partitioning for reliable downstream processing.
    • Feature engineering: session-level aggregates + labels with leakage-aware filtering.
    • Modeling: Spark MLlib logistic regression with time-based split when available (random fallback otherwise).
    • Exports: metrics, plots, and dashboard aggregates saved to reports/.
  2. Architecture map (data movement)
    • Raw → processed: CSVs are normalized into partitioned parquet for stable downstream reads.
    • Processed → features: session-level joins/aggregations create the modeling table.
    • Features → model: MLlib pipeline persists model artifacts to models/.
    • Model → reports: metrics.json/csv + figures for audit and QA.
    • Reports → dashboard: Streamlit reads only reports/dashboard aggregates.
  3. Reproducibility + environment parity
    • Makefile orchestrates the same steps locally or on Dataproc Serverless.
    • config.yaml / config.gcp.yaml keep paths and Spark settings explicit.
    • Storage utilities abstract local filesystem vs GCS without code changes.
  4. Cloud migration (local → Dataproc Serverless)
    • Switch configs: local paths → gs:// buckets, Spark master → serverless batch.
    • Package source once and submit batches per pipeline step.
    • Artifacts land in GCS reports/models buckets; dashboard can read GCS directly.
  5. Dashboard artifacts
    • Streamlit app reads exported aggregates so the dashboard is decoupled from training.
    • Reports contain figures + metrics for quick review and QA checks.
  6. Outputs and how to interpret them
    • reports/ingestion_summary.json: row counts + time range for raw ingest sanity checks.
    • reports/features_summary.json: feature row count + positive rate.
    • reports/metrics.json: AUC plus threshold metrics for evaluation at a chosen cutoff.
    • reports/figures/metrics.png: visual sanity check for model performance.

Security / Threat Model

  • Config-driven storage paths keep raw/processed/outputs deterministic across local and GCS.
  • No credentials are committed; GCP auth is handled by gcloud/gsutil locally.
  • Dataset is public and externalized to avoid bundling large data in the repo.
  • Dashboard reads precomputed aggregates only (reduces data exposure and compute costs).

Tradeoffs & Lessons

Key decisions — Spark over pandas for scale, Makefile as a single orchestrator, and config-only portability.

Trade-offs — serverless batches reduce ops but add cold-start overhead; precomputed dashboards trade freshness for speed.

Limitations — baseline model only; no experiment tracking server or model registry yet.

Next steps — MLflow tracking, CI validation of outputs, scheduled Dataproc runs, and data quality checks.

Results

A repeatable analytics ML pipeline that keeps feature builds, training results, and dashboards consistent across local and cloud environments.

Stack

PythonPySparkSpark MLlibGCSMakefileStreamlit
    Cloud Analytics ML Pipeline — Case Study