Kubernetes Observability Stack: Prometheus + Grafana (Done the Right Way)
A practical guide to deploying Prometheus and Grafana in Kubernetes with sane defaults: ServiceMonitor discovery, alerting, dashboards, retention, scaling, and the gotchas that cause noisy pages.
Prometheus + Grafana is the most common "first observability stack" for Kubernetesโand it works extremely well when you set it up with clear goals: reliable scraping, consistent labels, actionable alerts, and predictable retention/costs. Here's the approach we use for production clusters.
1) Use the Prometheus Operator (kube-prometheus-stack)
In most clusters, the fastest path to a reliable baseline is the Prometheus Operator (via the kube-prometheus-stack Helm chart). You get Prometheus, Alertmanager, Grafana, and CRDs (ServiceMonitor/PodMonitor) built for Kubernetes-native discovery.
- โPrefer ServiceMonitor/PodMonitor over static scrape configs.
- โShip with kube-state-metrics + node-exporter for core cluster signals.
2) Alert on symptoms, not noise (SLO-first)
Define SLOs per service and alert on user-impacting symptoms. The goal is fewer alerts with clear action, not "more charts."
- โGolden signals: latency, traffic, errors, saturation.
- โEvery alert should include a runbook link.
3) Retention, storage, and scaling that won't surprise you
- โSet retention intentionally (e.g., 7โ15 days hot metrics).
- โUse persistent volumes for Prometheus and plan capacity (disk + memory).
- โControl cardinality early (avoid labels like request_id/user_id/path IDs).
4) When you outgrow single-cluster Prometheus: remote-write
For multi-cluster, long-term retention, or global queries, keep Prometheus "hot + local" and remote-write to Thanos/Mimir/Cortex/VictoriaMetrics.
- โUse remote-write for long retention and cross-cluster dashboards.
5) Grafana: secure, repeatable, and drift-free
- โEnable auth (SSO if available) and restrict admin rights.
- โProvision dashboards/datasources as code to avoid click-ops drift.
Ready to Implement This?
Want this implemented fast without the usual alert noise and dashboard sprawl? We can deploy a production-ready Prometheus + Grafana baseline (with retention, labeling, and alert routing) and deliver a short SLO-first alert set tailored to your services.
Book Free ConsultationGet More DevOps Insights
Join 2K+ engineers getting weekly tips on Kubernetes, CI/CD, cost optimization, and platform engineering.
Subscribe to Newsletter