End-to-End CI/CD and GitOps for a Multi-Tenant Kubernetes SaaS

From Commit to Reliable Release

Role: Platform engineering internship simulationStack: GitHub Actions (CI) • Docker + GHCR • FastAPI • Kubernetes • Helm • Argo CD (GitOps) • Prometheus • Grafana • Loki • RBAC + Namespaces • Python (pytest, ruff)
CI/CDGitOpsKubernetesArgo CDHelmPrometheusGrafanaLokiRBACFastAPIDockerGHCRDevSecOps
At a glance
  • Problem
    From Commit to Reliable Release
  • Role
    Platform engineering internship simulation
  • Stack
    GitHub Actions (CI) • Docker + GHCR • FastAPI • Kubernetes
  • Focus
    CI/CD • GitOps • Kubernetes
  • Results
    This project demonstrates a production-style delivery loop with auditability, tenant isolation, and observable runtime signals.

Problem

Release risk: manual deployments are error-prone, inconsistent, and hard to roll back under pressure. Traceability gaps make it difficult to prove which version is running where and why. Drift between Git and the cluster creates “works on my machine” incidents that are hard to reproduce. Multi-tenant risk introduces noisy-neighbor effects and expands blast radius without clear isolation boundaries. Observability gaps mean metrics and logs aren’t tied to deploys, so debugging stays slow and reactive. Security baselines require secrets to stay out of Git and enforce least-privilege access with auditability.

Why this matters: faster and safer releases, shorter MTTR, controlled blast radius per tenant, and auditability/compliance readiness.

Executive summary
  • Repeatable releases: commit → tested image → GitOps PR → Argo CD reconcile
  • Traceable runtime: dashboards + logs tied to the exact image tag deployed
  • Safer multi-tenant setup: per-tenant values, namespaces, RBAC boundaries

Context

Architecture at a glance
Multi-tenant CI/CD + GitOps architectureSwimlane diagram showing developer push to GitHub Actions, GitOps/Argo CD reconciliation, multi-tenant Kubernetes deployment, and observability stack.Dev + GitHubCI (GitHub Actions)GitOps / CDKubernetesObservabilityDeveloperpushGitHubrepogit pushGitHub ActionsworkflowLintruff check .Testspytest -qBuilddocker buildPushGHCRGHCRghcr.io/...Tagssha-<short>latesttriggerimage pushGITHUB_TOKEN (ephemeral)GitOps PRgitops/tenants/*-values.yamlPR merge gatehuman approvalGitOps reposource of truthArgo CDsync/healthvalues PR updatemergedesired statewatchSecrets not stored in GitKubernetes clusterk3d locally -> EKS in cloudtenant-a namespacetenant-b namespaceKubernetes SecretsArgo admin, Grafana credsreconcile/applyNamespace boundaryRBAC boundaryNetworkPolicy (extension)Prometheusscrape /metricsAlertmanager(optional)Promtaillog shipperLokilog storeGrafanadashboards + alertsmetrics scrapelogs shipquery dashboardsLegendsolid = deploy/reconciledashed = telemetry

Architecture at a glance — A developer push triggers GitHub Actions; CI runs ruff/pytest, builds the FastAPI image, and pushes it to GHCR. A GitOps PR updates tenant values; Argo CD reconciles the desired state into the cluster. Metrics/logs flow to Prometheus/Loki and are visualized in Grafana.

This case study focuses on delivery and operational controls; the application is intentionally minimal so the pipeline and platform behavior are easy to validate.

Implementation note: this runs on a local k3d cluster; cloud-ready next steps include an EKS migration path, HPA/VPA for autoscaling, and managed observability options (CloudWatch or Amazon Managed Prometheus/Grafana).

How to verify
  • Tenant workloads
    kubectl get pods -n tenant-a && kubectl get pods -n tenant-b
  • Argo CD sync/health
    kubectl -n argocd get applications
  • API + /metrics (tenant-a)
    kubectl -n tenant-a port-forward svc/demo-api 8080:8000
    curl http://localhost:8080/metrics
  • Prometheus targets
    kubectl -n observability port-forward svc/prometheus 9090:9090
    curl http://localhost:9090/api/v1/targets
  • Loki + Promtail
    kubectl -n observability get svc loki
    kubectl -n observability get pods -l app=promtail
  • Grafana datasource list
    kubectl -n observability port-forward svc/grafana 3000:3000
    curl -u <user>:<pass> http://localhost:3000/api/datasources

Multi-tenant GitOps CI/CD for Kubernetes SaaS

This case study models a SaaS delivery system where each tenant is isolated by namespace and deployed via GitOps pull requests.

Keeping desired state in Git makes releases auditable and rollbacks deterministic.

Argo CD + Helm delivery with Prometheus/Grafana/Loki observability

Argo CD reconciles Helm releases while Prometheus, Grafana, and Loki provide metrics and logs for incident triage.

The result is a delivery path that is observable and explainable end-to-end.

Architecture

  1. CI pipeline deep dive (GitHub Actions)
    • Triggers on push to main and on pull requests; the test job runs `ruff check .` (E/F/I rules only, no auto-format) and `pytest -q` (auto-discovers tests and fails on non-zero exit).
    • Ruff enforces lint; Black is formatting and optional, so formatting changes stay explicit.
    • Build job creates the FastAPI image, tags `sha-<short>` and `latest`, and pushes to GHCR (a container registry, not a file store).
    • SHA tags are immutable for rollback safety; `latest` is used only for quick smoke checks.
    • A GitOps PR updates `image.tag` in GitOps values (e.g., `gitops/tenants/*-values.yaml`) for auditability.
  2. GitOps + Argo CD (CD model)
    • Desired state lives in Git; Argo CD reconciles cluster state and surfaces sync/health.
    • This is Continuous Delivery by default: a human approves the GitOps PR.
    • It becomes Continuous Deployment if PR merges are automated or Argo watches an auto-updated branch.
    • Rollback = revert the Git commit to a previous tag and let Argo CD sync.
  3. Multi-tenancy & isolation model
    • One Helm chart, per-tenant values, and namespace-scoped releases (`tenant-a`, `tenant-b`).
    • Namespaces provide resource scoping, RBAC boundaries, and a place to attach NetworkPolicies.
    • Not hard isolation: without network policies or node separation, tenants still share the control plane.
    • Tenants map to customers; you do not create a namespace per end-user.
  4. Observability (Prometheus + Grafana + Loki)
    • Prometheus scrapes `/metrics` via Service/PodMonitor or scrape config; primary signals are latency, error rate, and saturation.
    • Grafana dashboards + alerts; datasource provisioning avoids the "default datasource" conflict in multi-stack clusters.
    • Promtail ships pod logs into Loki; LogQL example: `{namespace="tenant-a"} |= "error"`.
    • Verification commands are listed under Architecture at a glance.

Security / Threat Model

  • Secrets never live in Git; CI uses the ephemeral GITHUB_TOKEN scoped to the repo.
  • Registry auth uses docker/login-action with that token — no long-lived credentials.
  • Runtime credentials live in Kubernetes Secrets (Argo CD, Grafana, app env) and are mounted at deploy time.
  • Argo CD admin secret is generated at install, stored as a Kubernetes Secret, and can be rotated safely.
  • RBAC enforces least privilege across namespaces and service accounts.
DevSecOps controls
ControlToolStatus
Lint + unit testsruff (E/F/I) + pytest -qImplemented
SASTCodeQLExtension
Dependency scanningDependabot + pip-auditExtension
Secret scanninggitleaksExtension
Container scanningTrivy/GrypeExtension
IaC scanningCheckovExtension
DASTOWASP ZAP baselineExtension

Tradeoffs & Lessons

Scaling (HPA/VPA) & reliability — HPA scales replicas based on CPU/memory/custom metrics; VPA adjusts requests/limits. These are intentionally not enabled in this demo to keep resource pressure visible; in production I would add Metrics Server + Prometheus Adapter and pair with a cloud Cluster Autoscaler.

Cloud deployment design (AWS mapping) — Control plane → EKS, workers → managed node groups (or Fargate), ingress → ALB/NLB controller, registry → GHCR or ECR, secrets → AWS Secrets Manager + External Secrets Operator, observability → CloudWatch or Amazon Managed Prometheus/Grafana. Cost drivers: EKS control plane fee, EC2 nodes, EBS, NAT gateway, data egress, and Loki retention; levers: right-size requests/limits, autoscaling, spot instances, log sampling, and retention.

Trade-offs & limitations — GitOps PR gates trade speed for auditability; namespace isolation reduces blast radius but is not hard multi-tenant isolation; OSS observability adds tuning overhead; more moving parts increase operational complexity.

Extensions I can implement next — automated PR merges with policy gates, external secrets (Vault/ASM), admission control (OPA/Gatekeeper), and multi-cluster isolation for higher-risk tenants.

Extension snippets
HPA example (extension)
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: demo-api
  namespace: tenant-a
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: demo-api
  minReplicas: 2
  maxReplicas: 6
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
VPA example (extension)
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: demo-api
  namespace: tenant-a
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: demo-api
  updatePolicy:
    updateMode: "Auto"

Results

This project demonstrates a production-style delivery loop with auditability, tenant isolation, and observable runtime signals. It validates cloud readiness and platform engineering fundamentals across CI, GitOps reconciliation, multi-tenant deployment, and incident-friendly telemetry.

How to demo in 2 minutes
  • Show Argo CD apps are Synced/Healthy
    kubectl -n argocd get applications
  • Prove tenant separation
    kubectl get pods -n tenant-a && kubectl get pods -n tenant-b
  • Hit /metrics for a live signal
    kubectl -n tenant-a port-forward svc/demo-api 8080:8000
    curl http://localhost:8080/metrics
  • Grafana + Loki quick check
    kubectl -n observability port-forward svc/grafana 3000:3000
    # Explore > Loki: {namespace="tenant-a"} |= "error"

Stack

GitHub Actions (CI)Docker + GHCRFastAPIKubernetesHelmArgo CD (GitOps)PrometheusGrafanaLokiRBAC + NamespacesPython (pytest, ruff)

FAQ

Is this continuous deployment or continuous delivery?

It is continuous delivery by default because GitOps PRs require human approval. It becomes continuous deployment only if PR merges are automated.

Why namespaces instead of separate clusters per tenant?

Namespaces reduce cost and operational overhead while still providing RBAC scoping and blast-radius reduction. Stronger isolation can be added with network policies or dedicated clusters later.

How do you verify a deployment?

Check Argo CD sync/health, validate pods in tenant namespaces, and confirm /metrics and logs are visible in Grafana and Loki.

    End-to-End CI/CD and GitOps for a Multi-Tenant Kubernetes SaaS — Case Study