Prometheus + Grafana is the most common "first observability stack" for Kubernetes—and it works extremely well when you set it up with clear goals: reliable scraping, consistent labels, actionable alerts, and predictable retention/costs. Here's the approach we use for production clusters.

1) Use the Prometheus Operator (kube-prometheus-stack)

In most clusters, the fastest path to a reliable baseline is the Prometheus Operator (via the kube-prometheus-stack Helm chart). You get Prometheus, Alertmanager, Grafana, and CRDs (ServiceMonitor/PodMonitor) built for Kubernetes-native discovery.

✓Prefer ServiceMonitor/PodMonitor over static scrape configs.
✓Ship with kube-state-metrics + node-exporter for core cluster signals.

2) Alert on symptoms, not noise (SLO-first)

Define SLOs per service and alert on user-impacting symptoms. The goal is fewer alerts with clear action, not "more charts."

✓Golden signals: latency, traffic, errors, saturation.
✓Every alert should include a runbook link.

3) Retention, storage, and scaling that won't surprise you

✓Set retention intentionally (e.g., 7–15 days hot metrics).
✓Use persistent volumes for Prometheus and plan capacity (disk + memory).
✓Control cardinality early (avoid labels like request_id/user_id/path IDs).

4) When you outgrow single-cluster Prometheus: remote-write

For multi-cluster, long-term retention, or global queries, keep Prometheus "hot + local" and remote-write to Thanos/Mimir/Cortex/VictoriaMetrics.

✓Use remote-write for long retention and cross-cluster dashboards.

5) Grafana: secure, repeatable, and drift-free

✓Enable auth (SSO if available) and restrict admin rights.
✓Provision dashboards/datasources as code to avoid click-ops drift.

💡

Ready to Implement This?

Want this implemented fast without the usual alert noise and dashboard sprawl? We can deploy a production-ready Prometheus + Grafana baseline (with retention, labeling, and alert routing) and deliver a short SLO-first alert set tailored to your services.

Book Free Consultation

Kubernetes Observability Stack: Prometheus + Grafana (Done the Right Way)