Grafana HA & Database Backend Tuning Prompt
Run Grafana in HA — multiple replicas, shared database (PostgreSQL/MySQL), session storage, Alertmanager cluster integration.
- Target user
- Platform engineers operating Grafana at scale
- Difficulty
- Advanced
- Tools
- Claude, ChatGPT
The prompt
You are a senior platform engineer who has scaled Grafana — moved from SQLite to PostgreSQL/MySQL, deployed multiple replicas, tuned for many concurrent users. I will provide: - Current Grafana setup - User load - Symptom (slow, OOM, database errors) Your job: 1. **HA basics**: - Multiple Grafana replicas - Shared DB (Postgres/MySQL; NOT SQLite) - Shared session storage - Stateless replicas 2. **For database choice**: - SQLite — single-replica, small - PostgreSQL — recommended HA - MySQL — also supported 3. **For DB connection pool**: - `max_open_conn`, `max_idle_conn` - Per-replica - Total = replicas × max_open 4. **For sessions**: - DB-backed (default) - Redis for higher perf 5. **For load balancer**: - Sticky sessions OR Redis sessions - Health check on /api/health 6. **For Alertmanager / alerting HA**: - Each Grafana replica evaluates rules - Coordination via DB - Lock to prevent duplicate notifications 7. **For caching**: - Query result cache (Enterprise) - Reduces DS load 8. **For monitoring Grafana itself**: - `/metrics` endpoint - Prometheus scrape Mark DESTRUCTIVE: SQLite to PostgreSQL migration without backup, multiple replicas with non-shared DB (data inconsistency), upgrade without DB migration step. --- Current setup: [DESCRIBE] User load: [DESCRIBE] Symptom: [DESCRIBE]
Why this prompt works
Scaling Grafana requires moving from defaults. This prompt walks them.
How to use it
- Migrate to PostgreSQL.
- Add replicas behind LB.
- Tune connection pool.
- Backup DB.
Useful commands
# DB connection test
psql -h <db-host> -U grafana -d grafana -c '\dt'
# Grafana metrics
curl http://grafana:3000/metrics | head
# Active connections (Postgres)
psql -c "SELECT count(*) FROM pg_stat_activity WHERE datname='grafana';"
# Slow queries
psql -c "SELECT query, calls, mean_exec_time FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;"
# Backup
pg_dump -h <host> -U grafana grafana > grafana-backup-$(date +%F).sql
Config (PostgreSQL)
[database]
type = postgres
host = postgres.example.com:5432
name = grafana
user = grafana
password = $__file{/etc/grafana/db-password}
ssl_mode = require
max_open_conn = 100
max_idle_conn = 100
conn_max_lifetime = 14400
[remote_cache]
type = redis
connstr = addr=redis:6379,pool_size=100
[session]
provider = redis
provider_config = addr=redis:6379,pool_size=100,db=2
[server]
root_url = https://grafana.example.com/
# Image rendering (if needed)
[rendering]
server_url = http://renderer:8081/render
callback_url = http://grafana:3000/
[alerting]
ha_peers = grafana-1:9094,grafana-2:9094
ha_listen_address = 0.0.0.0:9094
Helm values (HA)
grafana:
replicas: 3
podAntiAffinity: hard
grafana.ini:
database:
type: postgres
host: postgres:5432
name: grafana
user: grafana
password: $__file{/etc/secrets/db-password}
session:
provider: redis
provider_config: addr=redis:6379
persistence:
enabled: false # no PV needed with HA + external DB
resources:
requests:
cpu: 200m
memory: 512Mi
limits:
memory: 1Gi
Common findings this catches
- HA replicas with SQLite → data corruption.
- DB connection pool too small → wait times.
- Without Redis sessions → user lost on replica change.
- No image renderer → reports fail.
- DB CPU spikes → tune queries; index.
- Alerts firing twice → HA coordination.
- Slow startup → migration on every restart.
When to escalate
- Major migration to HA — staged.
- DB infrastructure — DBA.
- Monitoring at scale — capacity.
Related prompts
-
Grafana Dashboard Performance Prompt
Optimize Grafana dashboards — query parallelism, refresh rates, variable design, panel count, data source pressure.
-
Grafana Version Upgrade & Migration Prompt
Upgrade Grafana major versions — DB migrations, plugin compatibility, deprecated features, alert migration.
-
Prometheus HA & Deduplication Prompt
Run Prometheus in HA — paired servers, deduplication strategies (Thanos query, Alertmanager cluster, federation), failover.