Projects / Kubernetes Monitoring & Autoscaling System

Kubernetes Monitoring & Autoscaling System

Kubernetes Monitoring & Autoscaling System

A production-grade monitoring and autoscaling stack built entirely from scratch , no Helm charts, no shortcuts. The system collects custom application metrics from a Go service, stores them in Prometheus, and uses a custom-written autoscaler controller to make intelligent scaling decisions based on real application behavior - not just CPU and memory.

The architecture diagram above covers the full flow: Terraform provisions everything, WatcherBot generates the signals, Prometheus scrapes them, and the autoscaler controller closes the loop by hitting the Kubernetes API to adjust replicas.


What This Actually Does

Most autoscaling setups scale on CPU. That's a bad proxy for real load — a service waiting on a DB query has low CPU but is completely backed up. This system instead scales on watcherBot_active_tasks, a custom metric that reflects what the application is actually doing.

The autoscaler runs a decision loop every 15 seconds:

  1. Queries Prometheus for sum(watcherBot_active_tasks)
  2. Calculates desired replicas: ceil(active_tasks / 10) — 10 tasks per replica is the target
  3. Clamps between 1 and 5 replicas
  4. Hits the Kubernetes Deployments API to apply the change if anything shifted

Real example: 25 active tasks → 3 replicas. Drop to 5 tasks → back to 1. The full loop — metric generation, Prometheus scrape, controller query, Kubernetes API update — works end to end.


Architecture


(See diagram above)

  • Terraform + K8s Manifests — provisions the entire stack as IaC. One terraform apply, full stack up. terraform destroy, gone.
  • Monitoring Namespace — isolated namespace containing all monitoring components
  • Node Exporter (DaemonSet) — hardware/OS metrics from every node, automatically
  • Kube-State-Metrics — Kubernetes object state (pod count, deployment status, etc.)
  • WatcherBot Exporter — custom Go service exposing watcherBot_active_tasks, watcherBot_requests_total, and request latency
  • Prometheus — scrapes all targets every 15s, stores time-series data with persistent volumes
  • Grafana — dashboards fed by Prometheus; replica count correlated against active tasks is the money shot
  • Custom Autoscaler Controller — the actual brain; queries Prometheus, calculates desired replicas, updates deployments

Components

WatcherBot (Custom Go Exporter)

Three endpoints:

  • /metrics — Prometheus scrapes this
  • /start_task — increments watcherBot_active_tasks gauge
  • /finish_task — decrements it

Dockerized with a multi-stage build. Image went from 1.2GB (naive) to 20MB (Alpine base, compiled binary only).

Autoscaler Controller (Custom Go Controller)

  • Uses in-cluster config when running inside K8s, falls back to ~/.kube/config locally
  • Exponential backoff on Prometheus query failures — doesn't panic and make bad decisions when Prometheus is temporarily unreachable
  • Needs only get + update on Deployments — minimal RBAC, nothing more
  • Logs every scaling decision for audit: active_tasks=25 current_replicas=1 desired_replicas=3 → scaling up

Terraform (IaC)

Converts the entire stack — namespace, RBAC, PVCs, Deployments, Services, ConfigMaps — into reproducible HCL. Lesson learned the hard way: renaming a Terraform resource mid-project breaks state. Run terraform plan before every apply.

Prometheus RBAC

Prometheus needs cluster-wide read permissions to discover pods, nodes, and services. Standard setup:

  1. ServiceAccount — identity for the Prometheus pod
  2. ClusterRoleget, list, watch on nodes, endpoints, pods, configmaps
  3. ClusterRoleBinding — ties them together

Spent 2 hours debugging why Prometheus couldn't discover pods. Missing list permission on endpoints. RBAC errors are silent in the worst way.


Autoscaler Algorithm

TARGET_TTR = 10.0  // tasks per replica

desiredReplicas = ceil(activeTasks / TARGET_TTR)
desiredReplicas = clamp(desiredReplicas, MIN=1, MAX=5)

Scaling table:

Active Tasks Replicas Reason
0–10 1 minimum
11–20 2 ceil(15/10) = 2
21–30 3 ceil(25/10) = 3
31–40 4 ceil(35/10) = 4
41+ 5 maximum

Tech Stack

  • Language: Go
  • Monitoring: Prometheus, Grafana, Node Exporter, Kube-State-Metrics
  • Orchestration: Kubernetes (MicroK8s)
  • IaC: Terraform
  • Containerization: Docker (multi-stage builds)
  • Kubernetes Client: client-go (for the autoscaler controller)
  • Storage: PersistentVolumeClaims for Prometheus and Grafana

Running It

Option 1: Terraform (recommended)

cd infrastructure/terraform
terraform init
terraform plan
terraform apply

Option 2: Raw Manifests

kubectl create namespace monitoring
kubectl apply -f infrastructure/monitoring/prometheus-rbac.yaml
kubectl apply -f infrastructure/monitoring/prometheus-config.yaml
kubectl apply -f infrastructure/monitoring/prometheus-deployment.yaml
kubectl apply -f infrastructure/monitoring/grafana.yaml
kubectl apply -f infrastructure/monitoring/watcher-bot.yaml

Testing the Autoscaler

# Simulate load
for i in {1..25}; do curl http://localhost:8088/start_task; done

# Watch scaling happen
kubectl get deployment watcher-bot -n monitoring -w

# Watch controller logs
kubectl logs -f -n monitoring <autoscaler-pod-name>

What I Built Over What Phases

Phase 1 — Manual deployment: Prometheus + Grafana + Node Exporter + Kube-State-Metrics + Persistent storage + RBAC. Everything deployed as raw YAML manifests to understand exactly what was happening.

Phase 2 — Converted the entire stack to Terraform. Built WatcherBot (custom exporter in Go). Built the autoscaler controller. Dockerized both with multi-stage builds.


Hard Parts

  • RBAC debugging — Prometheus couldn't discover pods. Missing list permission. 2 hours.
  • PVC on MicroK8s — needs microk8s-hostpath storage class, not standard. Not obvious.
  • Terraform state — renamed a resource, Terraform wanted to destroy and recreate it. Now I always run plan first.
  • In-cluster config for the autoscaler — ServiceAccount wasn't mounted properly, controller was failing silently.

Future

  • Alerting rules in Prometheus + Grafana alerts wired to somewhere useful (Slack)
  • Helm chart so this is actually distributable
  • Cooldown window in the autoscaler to prevent thrashing at boundary values
  • HPA integration for comparison
     

  • GitHub
    Read the Blog
    Back to Projects