Skip to content
CloudOps
Newsletter
All prompts
AI for Prometheus & Monitoring Difficulty: Advanced ClaudeChatGPT

Grafana HA & Database Backend Tuning Prompt

Run Grafana in HA — multiple replicas, shared database (PostgreSQL/MySQL), session storage, Alertmanager cluster integration.

Target user
Platform engineers operating Grafana at scale
Difficulty
Advanced
Tools
Claude, ChatGPT

The prompt

You are a senior platform engineer who has scaled Grafana — moved from SQLite to PostgreSQL/MySQL, deployed multiple replicas, tuned for many concurrent users.

I will provide:
- Current Grafana setup
- User load
- Symptom (slow, OOM, database errors)

Your job:

1. **HA basics**:
   - Multiple Grafana replicas
   - Shared DB (Postgres/MySQL; NOT SQLite)
   - Shared session storage
   - Stateless replicas
2. **For database choice**:
   - SQLite — single-replica, small
   - PostgreSQL — recommended HA
   - MySQL — also supported
3. **For DB connection pool**:
   - `max_open_conn`, `max_idle_conn`
   - Per-replica
   - Total = replicas × max_open
4. **For sessions**:
   - DB-backed (default)
   - Redis for higher perf
5. **For load balancer**:
   - Sticky sessions OR Redis sessions
   - Health check on /api/health
6. **For Alertmanager / alerting HA**:
   - Each Grafana replica evaluates rules
   - Coordination via DB
   - Lock to prevent duplicate notifications
7. **For caching**:
   - Query result cache (Enterprise)
   - Reduces DS load
8. **For monitoring Grafana itself**:
   - `/metrics` endpoint
   - Prometheus scrape

Mark DESTRUCTIVE: SQLite to PostgreSQL migration without backup, multiple replicas with non-shared DB (data inconsistency), upgrade without DB migration step.

---

Current setup: [DESCRIBE]
User load: [DESCRIBE]
Symptom: [DESCRIBE]

Why this prompt works

Scaling Grafana requires moving from defaults. This prompt walks them.

How to use it

  1. Migrate to PostgreSQL.
  2. Add replicas behind LB.
  3. Tune connection pool.
  4. Backup DB.

Useful commands

# DB connection test
psql -h <db-host> -U grafana -d grafana -c '\dt'

# Grafana metrics
curl http://grafana:3000/metrics | head

# Active connections (Postgres)
psql -c "SELECT count(*) FROM pg_stat_activity WHERE datname='grafana';"

# Slow queries
psql -c "SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"

# Backup
pg_dump -h <host> -U grafana grafana > grafana-backup-$(date +%F).sql

Config (PostgreSQL)

[database]
type = postgres
host = postgres.example.com:5432
name = grafana
user = grafana
password = $__file{/etc/grafana/db-password}
ssl_mode = require
max_open_conn = 100
max_idle_conn = 100
conn_max_lifetime = 14400

[remote_cache]
type = redis
connstr = addr=redis:6379,pool_size=100

[session]
provider = redis
provider_config = addr=redis:6379,pool_size=100,db=2

[server]
root_url = https://grafana.example.com/

# Image rendering (if needed)
[rendering]
server_url = http://renderer:8081/render
callback_url = http://grafana:3000/

[alerting]
ha_peers = grafana-1:9094,grafana-2:9094
ha_listen_address = 0.0.0.0:9094

Helm values (HA)

grafana:
  replicas: 3
  podAntiAffinity: hard

  grafana.ini:
    database:
      type: postgres
      host: postgres:5432
      name: grafana
      user: grafana
      password: $__file{/etc/secrets/db-password}
    session:
      provider: redis
      provider_config: addr=redis:6379

  persistence:
    enabled: false                  # no PV needed with HA + external DB

  resources:
    requests:
      cpu: 200m
      memory: 512Mi
    limits:
      memory: 1Gi

Common findings this catches

  • HA replicas with SQLite → data corruption.
  • DB connection pool too small → wait times.
  • Without Redis sessions → user lost on replica change.
  • No image renderer → reports fail.
  • DB CPU spikes → tune queries; index.
  • Alerts firing twice → HA coordination.
  • Slow startup → migration on every restart.

When to escalate

  • Major migration to HA — staged.
  • DB infrastructure — DBA.
  • Monitoring at scale — capacity.

Related prompts

Newsletter

Free: the DevOps AI Incident-Triage Cheat Sheet

Subscribe and we’ll send you the one-page cheat sheet — plus weekly AI prompts, automation ideas, and tool reviews for infrastructure engineers. One email a week. No spam, unsubscribe anytime.

  • AI Incident-Triage Cheat Sheet (PDF)
  • Access to 1,603 DevOps AI prompts
  • One practical workflow email per week