- ProblemFrom Commit to Reliable Release
- RolePlatform engineering internship simulation
- StackGitHub Actions (CI) • Docker + GHCR • FastAPI • Kubernetes
- FocusCI/CD • GitOps • Kubernetes
- ResultsThis project demonstrates a production-style delivery loop with auditability, tenant isolation, and observable runtime signals.
Problem
Release risk: manual deployments are error-prone, inconsistent, and hard to roll back under pressure. Traceability gaps make it difficult to prove which version is running where and why. Drift between Git and the cluster creates “works on my machine” incidents that are hard to reproduce. Multi-tenant risk introduces noisy-neighbor effects and expands blast radius without clear isolation boundaries. Observability gaps mean metrics and logs aren’t tied to deploys, so debugging stays slow and reactive. Security baselines require secrets to stay out of Git and enforce least-privilege access with auditability.
Why this matters: faster and safer releases, shorter MTTR, controlled blast radius per tenant, and auditability/compliance readiness.
- Repeatable releases: commit → tested image → GitOps PR → Argo CD reconcile
- Traceable runtime: dashboards + logs tied to the exact image tag deployed
- Safer multi-tenant setup: per-tenant values, namespaces, RBAC boundaries
Context
Architecture at a glance — A developer push triggers GitHub Actions; CI runs ruff/pytest, builds the FastAPI image, and pushes it to GHCR. A GitOps PR updates tenant values; Argo CD reconciles the desired state into the cluster. Metrics/logs flow to Prometheus/Loki and are visualized in Grafana.
This case study focuses on delivery and operational controls; the application is intentionally minimal so the pipeline and platform behavior are easy to validate.
Implementation note: this runs on a local k3d cluster; cloud-ready next steps include an EKS migration path, HPA/VPA for autoscaling, and managed observability options (CloudWatch or Amazon Managed Prometheus/Grafana).
- Tenant workloads
kubectl get pods -n tenant-a && kubectl get pods -n tenant-b - Argo CD sync/health
kubectl -n argocd get applications - API + /metrics (tenant-a)
kubectl -n tenant-a port-forward svc/demo-api 8080:8000 curl http://localhost:8080/metrics - Prometheus targets
kubectl -n observability port-forward svc/prometheus 9090:9090 curl http://localhost:9090/api/v1/targets - Loki + Promtail
kubectl -n observability get svc loki kubectl -n observability get pods -l app=promtail - Grafana datasource list
kubectl -n observability port-forward svc/grafana 3000:3000 curl -u <user>:<pass> http://localhost:3000/api/datasources
Multi-tenant GitOps CI/CD for Kubernetes SaaS
This case study models a SaaS delivery system where each tenant is isolated by namespace and deployed via GitOps pull requests.
Keeping desired state in Git makes releases auditable and rollbacks deterministic.
Argo CD + Helm delivery with Prometheus/Grafana/Loki observability
Argo CD reconciles Helm releases while Prometheus, Grafana, and Loki provide metrics and logs for incident triage.
The result is a delivery path that is observable and explainable end-to-end.
Architecture
- CI pipeline deep dive (GitHub Actions)
- Triggers on push to main and on pull requests; the test job runs `ruff check .` (E/F/I rules only, no auto-format) and `pytest -q` (auto-discovers tests and fails on non-zero exit).
- Ruff enforces lint; Black is formatting and optional, so formatting changes stay explicit.
- Build job creates the FastAPI image, tags `sha-<short>` and `latest`, and pushes to GHCR (a container registry, not a file store).
- SHA tags are immutable for rollback safety; `latest` is used only for quick smoke checks.
- A GitOps PR updates `image.tag` in GitOps values (e.g., `gitops/tenants/*-values.yaml`) for auditability.
- GitOps + Argo CD (CD model)
- Desired state lives in Git; Argo CD reconciles cluster state and surfaces sync/health.
- This is Continuous Delivery by default: a human approves the GitOps PR.
- It becomes Continuous Deployment if PR merges are automated or Argo watches an auto-updated branch.
- Rollback = revert the Git commit to a previous tag and let Argo CD sync.
- Multi-tenancy & isolation model
- One Helm chart, per-tenant values, and namespace-scoped releases (`tenant-a`, `tenant-b`).
- Namespaces provide resource scoping, RBAC boundaries, and a place to attach NetworkPolicies.
- Not hard isolation: without network policies or node separation, tenants still share the control plane.
- Tenants map to customers; you do not create a namespace per end-user.
- Observability (Prometheus + Grafana + Loki)
- Prometheus scrapes `/metrics` via Service/PodMonitor or scrape config; primary signals are latency, error rate, and saturation.
- Grafana dashboards + alerts; datasource provisioning avoids the "default datasource" conflict in multi-stack clusters.
- Promtail ships pod logs into Loki; LogQL example: `{namespace="tenant-a"} |= "error"`.
- Verification commands are listed under Architecture at a glance.
Security / Threat Model
- Secrets never live in Git; CI uses the ephemeral GITHUB_TOKEN scoped to the repo.
- Registry auth uses docker/login-action with that token — no long-lived credentials.
- Runtime credentials live in Kubernetes Secrets (Argo CD, Grafana, app env) and are mounted at deploy time.
- Argo CD admin secret is generated at install, stored as a Kubernetes Secret, and can be rotated safely.
- RBAC enforces least privilege across namespaces and service accounts.
| Control | Tool | Status |
|---|---|---|
| Lint + unit tests | ruff (E/F/I) + pytest -q | Implemented |
| SAST | CodeQL | Extension |
| Dependency scanning | Dependabot + pip-audit | Extension |
| Secret scanning | gitleaks | Extension |
| Container scanning | Trivy/Grype | Extension |
| IaC scanning | Checkov | Extension |
| DAST | OWASP ZAP baseline | Extension |
Tradeoffs & Lessons
Scaling (HPA/VPA) & reliability — HPA scales replicas based on CPU/memory/custom metrics; VPA adjusts requests/limits. These are intentionally not enabled in this demo to keep resource pressure visible; in production I would add Metrics Server + Prometheus Adapter and pair with a cloud Cluster Autoscaler.
Cloud deployment design (AWS mapping) — Control plane → EKS, workers → managed node groups (or Fargate), ingress → ALB/NLB controller, registry → GHCR or ECR, secrets → AWS Secrets Manager + External Secrets Operator, observability → CloudWatch or Amazon Managed Prometheus/Grafana. Cost drivers: EKS control plane fee, EC2 nodes, EBS, NAT gateway, data egress, and Loki retention; levers: right-size requests/limits, autoscaling, spot instances, log sampling, and retention.
Trade-offs & limitations — GitOps PR gates trade speed for auditability; namespace isolation reduces blast radius but is not hard multi-tenant isolation; OSS observability adds tuning overhead; more moving parts increase operational complexity.
Extensions I can implement next — automated PR merges with policy gates, external secrets (Vault/ASM), admission control (OPA/Gatekeeper), and multi-cluster isolation for higher-risk tenants.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: demo-api
namespace: tenant-a
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: demo-api
minReplicas: 2
maxReplicas: 6
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: demo-api
namespace: tenant-a
spec:
targetRef:
apiVersion: "apps/v1"
kind: Deployment
name: demo-api
updatePolicy:
updateMode: "Auto"Results
This project demonstrates a production-style delivery loop with auditability, tenant isolation, and observable runtime signals. It validates cloud readiness and platform engineering fundamentals across CI, GitOps reconciliation, multi-tenant deployment, and incident-friendly telemetry.
- Show Argo CD apps are Synced/Healthy
kubectl -n argocd get applications - Prove tenant separation
kubectl get pods -n tenant-a && kubectl get pods -n tenant-b - Hit /metrics for a live signal
kubectl -n tenant-a port-forward svc/demo-api 8080:8000 curl http://localhost:8080/metrics - Grafana + Loki quick check
kubectl -n observability port-forward svc/grafana 3000:3000 # Explore > Loki: {namespace="tenant-a"} |= "error"
Stack
FAQ
Is this continuous deployment or continuous delivery?
It is continuous delivery by default because GitOps PRs require human approval. It becomes continuous deployment only if PR merges are automated.
Why namespaces instead of separate clusters per tenant?
Namespaces reduce cost and operational overhead while still providing RBAC scoping and blast-radius reduction. Stronger isolation can be added with network policies or dedicated clusters later.
How do you verify a deployment?
Check Argo CD sync/health, validate pods in tenant namespaces, and confirm /metrics and logs are visible in Grafana and Loki.