What metrics do you collect and how is the platform monitored?
Every node in the cluster runs a Prometheus node exporter and a Datadog agent. The kube-prometheus-stack continuously collects:
- Node-level: CPU, memory, disk I/O, network throughput
- Kubernetes: pod health, deployment status, PVC usage, API server latency, kubelet metrics
- Application (per dyno): CPU %, memory % against allocated limits, over configurable rolling windows
-
Addon-level:
- PostgreSQL: active connections, transaction commits, cache hit ratio, database size
- Elasticsearch: query rate, fetch time, indexing time, JVM memory used, cluster health, document count
- Redis: memory usage, connections
Grafana dashboards provide real-time visibility into all of the above.
How does alerting work?
Honeybadger monitors application errors and infrastructure health. Alerts are routed to PagerDuty (for on-call paging) and Slack (for team visibility). The Deploy team is on-call for infrastructure incidents.
Users can also set their own consumption alerts — configurable CPU, memory, and storage thresholds per app or addon — with email, Slack, or PagerDuty notifications at frequencies from every 5 minutes to weekly.