Observability9 min readโ€ขJan 2026โ€ข1.8K views

Kubernetes Observability Stack: Prometheus + Grafana (Done the Right Way)

A practical guide to deploying Prometheus and Grafana in Kubernetes with sane defaults: ServiceMonitor discovery, alerting, dashboards, retention, scaling, and the gotchas that cause noisy pages.

K
Kevalix Team
DevOps & Platform Engineering
Share:๐•๐Ÿ’ผ

Prometheus + Grafana is the most common "first observability stack" for Kubernetesโ€”and it works extremely well when you set it up with clear goals: reliable scraping, consistent labels, actionable alerts, and predictable retention/costs. Here's the approach we use for production clusters.

1) Use the Prometheus Operator (kube-prometheus-stack)

In most clusters, the fastest path to a reliable baseline is the Prometheus Operator (via the kube-prometheus-stack Helm chart). You get Prometheus, Alertmanager, Grafana, and CRDs (ServiceMonitor/PodMonitor) built for Kubernetes-native discovery.

  • โœ“Prefer ServiceMonitor/PodMonitor over static scrape configs.
  • โœ“Ship with kube-state-metrics + node-exporter for core cluster signals.

2) Alert on symptoms, not noise (SLO-first)

Define SLOs per service and alert on user-impacting symptoms. The goal is fewer alerts with clear action, not "more charts."

  • โœ“Golden signals: latency, traffic, errors, saturation.
  • โœ“Every alert should include a runbook link.

3) Retention, storage, and scaling that won't surprise you

  • โœ“Set retention intentionally (e.g., 7โ€“15 days hot metrics).
  • โœ“Use persistent volumes for Prometheus and plan capacity (disk + memory).
  • โœ“Control cardinality early (avoid labels like request_id/user_id/path IDs).

4) When you outgrow single-cluster Prometheus: remote-write

For multi-cluster, long-term retention, or global queries, keep Prometheus "hot + local" and remote-write to Thanos/Mimir/Cortex/VictoriaMetrics.

  • โœ“Use remote-write for long retention and cross-cluster dashboards.

5) Grafana: secure, repeatable, and drift-free

  • โœ“Enable auth (SSO if available) and restrict admin rights.
  • โœ“Provision dashboards/datasources as code to avoid click-ops drift.
๐Ÿ’ก

Ready to Implement This?

Want this implemented fast without the usual alert noise and dashboard sprawl? We can deploy a production-ready Prometheus + Grafana baseline (with retention, labeling, and alert routing) and deliver a short SLO-first alert set tailored to your services.

Book Free Consultation
Found this helpful? Share it:๐• Twitter๐Ÿ’ผ LinkedIn
๐Ÿ“ฌ

Get More DevOps Insights

Join 2K+ engineers getting weekly tips on Kubernetes, CI/CD, cost optimization, and platform engineering.

Subscribe to Newsletter