Multi-tenant CI/CD + GitOps architecture

At a glance

Problem
From Commit to Reliable Release
Role
Platform engineering internship simulation
Stack
GitHub Actions (CI) • Docker + GHCR • FastAPI • Kubernetes
Focus
CI/CD • GitOps • Kubernetes
Results
This project demonstrates a production-style delivery loop with auditability, tenant isolation, and observable runtime signals.

Problem

Release risk: manual deployments are error-prone, inconsistent, and hard to roll back under pressure. Traceability gaps make it difficult to prove which version is running where and why. Drift between Git and the cluster creates “works on my machine” incidents that are hard to reproduce. Multi-tenant risk introduces noisy-neighbor effects and expands blast radius without clear isolation boundaries. Observability gaps mean metrics and logs aren’t tied to deploys, so debugging stays slow and reactive. Security baselines require secrets to stay out of Git and enforce least-privilege access with auditability.

Why this matters: faster and safer releases, shorter MTTR, controlled blast radius per tenant, and auditability/compliance readiness.

Executive summary

Repeatable releases: commit → tested image → GitOps PR → Argo CD reconcile
Traceable runtime: dashboards + logs tied to the exact image tag deployed
Safer multi-tenant setup: per-tenant values, namespaces, RBAC boundaries

Context

Architecture at a glance

Architecture at a glance — A developer push triggers GitHub Actions; CI runs ruff/pytest, builds the FastAPI image, and pushes it to GHCR. A GitOps PR updates tenant values; Argo CD reconciles the desired state into the cluster. Metrics/logs flow to Prometheus/Loki and are visualized in Grafana.

This case study focuses on delivery and operational controls; the application is intentionally minimal so the pipeline and platform behavior are easy to validate.

Implementation note: this runs on a local k3d cluster; cloud-ready next steps include an EKS migration path, HPA/VPA for autoscaling, and managed observability options (CloudWatch or Amazon Managed Prometheus/Grafana).

How to verify

Tenant workloads

kubectl get pods -n tenant-a && kubectl get pods -n tenant-b

Argo CD sync/health
```
kubectl -n argocd get applications
```

API + /metrics (tenant-a)

kubectl -n tenant-a port-forward svc/demo-api 8080:8000
curl http://localhost:8080/metrics

Prometheus targets

kubectl -n observability port-forward svc/prometheus 9090:9090
curl http://localhost:9090/api/v1/targets

Loki + Promtail

kubectl -n observability get svc loki
kubectl -n observability get pods -l app=promtail

Grafana datasource list

kubectl -n observability port-forward svc/grafana 3000:3000
curl -u <user>:<pass> http://localhost:3000/api/datasources

Multi-tenant GitOps CI/CD for Kubernetes SaaS

This case study models a SaaS delivery system where each tenant is isolated by namespace and deployed via GitOps pull requests.

Keeping desired state in Git makes releases auditable and rollbacks deterministic.

Argo CD + Helm delivery with Prometheus/Grafana/Loki observability

Argo CD reconciles Helm releases while Prometheus, Grafana, and Loki provide metrics and logs for incident triage.

The result is a delivery path that is observable and explainable end-to-end.

Architecture

CI pipeline deep dive (GitHub Actions)
- Triggers on push to main and on pull requests; the test job runs `ruff check .` (E/F/I rules only, no auto-format) and `pytest -q` (auto-discovers tests and fails on non-zero exit).
- Ruff enforces lint; Black is formatting and optional, so formatting changes stay explicit.
- Build job creates the FastAPI image, tags `sha-<short>` and `latest`, and pushes to GHCR (a container registry, not a file store).
- SHA tags are immutable for rollback safety; `latest` is used only for quick smoke checks.
- A GitOps PR updates `image.tag` in GitOps values (e.g., `gitops/tenants/*-values.yaml`) for auditability.
GitOps + Argo CD (CD model)
- Desired state lives in Git; Argo CD reconciles cluster state and surfaces sync/health.
- This is Continuous Delivery by default: a human approves the GitOps PR.
- It becomes Continuous Deployment if PR merges are automated or Argo watches an auto-updated branch.
- Rollback = revert the Git commit to a previous tag and let Argo CD sync.
Multi-tenancy & isolation model
- One Helm chart, per-tenant values, and namespace-scoped releases (`tenant-a`, `tenant-b`).
- Namespaces provide resource scoping, RBAC boundaries, and a place to attach NetworkPolicies.
- Not hard isolation: without network policies or node separation, tenants still share the control plane.
- Tenants map to customers; you do not create a namespace per end-user.
Observability (Prometheus + Grafana + Loki)
- Prometheus scrapes `/metrics` via Service/PodMonitor or scrape config; primary signals are latency, error rate, and saturation.
- Grafana dashboards + alerts; datasource provisioning avoids the "default datasource" conflict in multi-stack clusters.
- Promtail ships pod logs into Loki; LogQL example: `{namespace="tenant-a"} |= "error"`.
- Verification commands are listed under Architecture at a glance.

Security / Threat Model

Secrets never live in Git; CI uses the ephemeral GITHUB_TOKEN scoped to the repo.
Registry auth uses docker/login-action with that token — no long-lived credentials.
Runtime credentials live in Kubernetes Secrets (Argo CD, Grafana, app env) and are mounted at deploy time.
Argo CD admin secret is generated at install, stored as a Kubernetes Secret, and can be rotated safely.
RBAC enforces least privilege across namespaces and service accounts.

DevSecOps controls

Control	Tool	Status
Lint + unit tests	ruff (E/F/I) + pytest -q	Implemented
SAST	CodeQL	Extension
Dependency scanning	Dependabot + pip-audit	Extension
Secret scanning	gitleaks	Extension
Container scanning	Trivy/Grype	Extension
IaC scanning	Checkov	Extension
DAST	OWASP ZAP baseline	Extension

Tradeoffs & Lessons

Scaling (HPA/VPA) & reliability — HPA scales replicas based on CPU/memory/custom metrics; VPA adjusts requests/limits. These are intentionally not enabled in this demo to keep resource pressure visible; in production I would add Metrics Server + Prometheus Adapter and pair with a cloud Cluster Autoscaler.

Cloud deployment design (AWS mapping) — Control plane → EKS, workers → managed node groups (or Fargate), ingress → ALB/NLB controller, registry → GHCR or ECR, secrets → AWS Secrets Manager + External Secrets Operator, observability → CloudWatch or Amazon Managed Prometheus/Grafana. Cost drivers: EKS control plane fee, EC2 nodes, EBS, NAT gateway, data egress, and Loki retention; levers: right-size requests/limits, autoscaling, spot instances, log sampling, and retention.

Trade-offs & limitations — GitOps PR gates trade speed for auditability; namespace isolation reduces blast radius but is not hard multi-tenant isolation; OSS observability adds tuning overhead; more moving parts increase operational complexity.

Extensions I can implement next — automated PR merges with policy gates, external secrets (Vault/ASM), admission control (OPA/Gatekeeper), and multi-cluster isolation for higher-risk tenants.

Extension snippets

HPA example (extension)

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: demo-api
  namespace: tenant-a
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: demo-api
  minReplicas: 2
  maxReplicas: 6
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70

VPA example (extension)

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: demo-api
  namespace: tenant-a
spec:
  targetRef:
    apiVersion: "apps/v1"
    kind: Deployment
    name: demo-api
  updatePolicy:
    updateMode: "Auto"

Results

This project demonstrates a production-style delivery loop with auditability, tenant isolation, and observable runtime signals. It validates cloud readiness and platform engineering fundamentals across CI, GitOps reconciliation, multi-tenant deployment, and incident-friendly telemetry.

How to demo in 2 minutes

Show Argo CD apps are Synced/Healthy
```
kubectl -n argocd get applications
```

Prove tenant separation

kubectl get pods -n tenant-a && kubectl get pods -n tenant-b

Hit /metrics for a live signal

kubectl -n tenant-a port-forward svc/demo-api 8080:8000
curl http://localhost:8080/metrics

Grafana + Loki quick check

kubectl -n observability port-forward svc/grafana 3000:3000
# Explore > Loki: {namespace="tenant-a"} |= "error"

Stack

GitHub Actions (CI)Docker + GHCRFastAPIKubernetesHelmArgo CD (GitOps)PrometheusGrafanaLokiRBAC + NamespacesPython (pytest, ruff)

FAQ

Is this continuous deployment or continuous delivery?

It is continuous delivery by default because GitOps PRs require human approval. It becomes continuous deployment only if PR merges are automated.

Why namespaces instead of separate clusters per tenant?

Namespaces reduce cost and operational overhead while still providing RBAC scoping and blast-radius reduction. Stronger isolation can be added with network policies or dedicated clusters later.

How do you verify a deployment?

Check Argo CD sync/health, validate pods in tenant namespaces, and confirm /metrics and logs are visible in Grafana and Loki.

Back to Projects Explore more Case Studies

End-to-End CI/CD and GitOps for a Multi-Tenant Kubernetes SaaS