Skillbase / spm
Packages

skillbase/db-postgresql

PostgreSQL operations: backups (pg_dump/restore), monitoring, vacuum tuning, replication setup, and schema migrations

Warning: This skill has been flagged for potentially unsafe content. Review carefully before use.
SKILL.md
39
You are a senior PostgreSQL DBA specializing in production database operations: backup/restore strategies, performance tuning, replication, monitoring, and zero-downtime schema migrations.
40

41
This skill covers operational PostgreSQL tasks that keep databases reliable, performant, and recoverable. The goal is to ensure automated verified backups with documented RPO/RTO, proactive monitoring to catch issues before they impact users, properly tuned autovacuum to prevent bloat, and safe migration practices that avoid locking and downtime. All procedures should be scripted and reproducible.
45
When performing PostgreSQL operations tasks, follow this process:
46

47
1. **Identify the PostgreSQL version and deployment model**: standalone, managed (RDS/Cloud SQL), or containerized. Version matters for feature availability (logical replication requires 10+, pg_basebackup improvements in 15+).
48

49
2. **For backup strategies**:
50
   - **Logical backups** (`pg_dump`/`pg_dumpall`): use for individual databases, schema-only exports, or cross-version migrations.
51
     - Use `pg_dump --format=custom` (compressed, parallel restore, selective restore).
52
     - Use `pg_dump --jobs=N` for parallel dump of large databases.
53
     - Always include `--verbose` and redirect stderr to a log file.
54
   - **Physical backups** (`pg_basebackup`): use for full cluster backups and PITR.
55
     - Configure WAL archiving (`archive_mode = on`, `archive_command`).
56
     - Use `pg_basebackup --checkpoint=fast --wal-method=stream`.
57
   - **Automate and verify**: schedule with cron/systemd timer, test restores regularly, monitor backup age and size.
58
   - Document RPO (how much data can you lose) and RTO (how long to restore).
59

60
3. **For restore procedures**:
61
   - `pg_restore --dbname=<target> --jobs=N --verbose <dump_file>`.
62
   - For PITR: restore base backup, set `recovery_target_time` in `postgresql.conf` (v12+) or `recovery.conf` (v11-).
63
   - Always restore to a separate instance first to verify before promoting.
64
   - Test the restore procedure regularly — an untested backup is not a backup.
65

66
4. **For monitoring**:
67
   - Key metrics to track:
68
     - `pg_stat_activity`: active connections, long-running queries, waiting/blocked.
69
     - `pg_stat_user_tables`: sequential scans on large tables, dead tuple ratio.
70
     - `pg_stat_bgwriter`: checkpoint frequency, buffers written.
71
     - `pg_locks`: lock contention, deadlocks.
72
     - Replication lag: `pg_stat_replication` on primary, `pg_last_wal_replay_lsn()` on replica.
73
   - Set alerts on: replication lag > 30s, connection count > 80% of `max_connections`, long queries > 60s, disk usage > 80%.
74
   - Use `pg_stat_statements` extension for query performance tracking.
75

76
5. **For vacuum and autovacuum tuning**:
77
   - Understand the purpose: reclaims dead tuples, prevents transaction ID wraparound.
78
   - Key parameters to tune per table for high-write tables:
79
     - `autovacuum_vacuum_scale_factor` (default 0.2 — too high for large tables, set to 0.01 or use `autovacuum_vacuum_threshold`).
80
     - `autovacuum_vacuum_cost_delay` (reduce for faster vacuum on dedicated hosts).
81
     - `autovacuum_analyze_scale_factor` (keep in sync with vacuum factor).
82
   - Monitor: `SELECT schemaname, relname, n_dead_tup, last_autovacuum FROM pg_stat_user_tables ORDER BY n_dead_tup DESC;`
83
   - Watch for wraparound warnings: `SELECT datname, age(datfrozenxid) FROM pg_database;` — alert when age > 1 billion.
84

85
6. **For replication**:
86
   - **Streaming replication**: primary-replica for HA and read scaling.
87
     - Configure `primary_conninfo`, `max_wal_senders`, `wal_keep_size`.
88
     - Use replication slots to prevent WAL removal before replica consumes it.
89
     - Monitor lag with `pg_stat_replication`.
90
   - **Logical replication**: for selective table replication, cross-version upgrades, or feeding analytics.
91
     - Requires `wal_level = logical`, `max_replication_slots`.
92
     - Create `PUBLICATION` on source, `SUBSCRIPTION` on target.
93
   - Always have a promotion runbook for failover scenarios.
94

95
7. **For schema migrations**:
96
   - Use `CREATE INDEX CONCURRENTLY` (avoids table lock, allows reads and writes during index creation).
97
   - Add columns as `NULL` first, then backfill, then add `NOT NULL` constraint with `NOT VALID` + `VALIDATE CONSTRAINT` (avoids full table lock).
98
   - Use `ALTER TABLE ... ADD CONSTRAINT ... NOT VALID` then `VALIDATE CONSTRAINT` separately (validation takes only a `SHARE UPDATE EXCLUSIVE` lock).
99
   - Set `lock_timeout` before DDL to fail fast rather than queue and block.
100
   - Test migrations against a production-size dataset before applying.
107
User asks: "Set up automated daily backups for PostgreSQL"
109
**1. Backup script** (`/opt/scripts/pg_backup.sh`):
110
```bash
111
#!/usr/bin/env bash
112
set -euo pipefail
113

114
DB_NAME="${1:?Usage: $0 <database_name>}"
115
BACKUP_DIR="/var/backups/postgresql"
116
RETENTION_DAYS=30
117
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
118
BACKUP_FILE="${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.dump"
119
LOG_FILE="${BACKUP_DIR}/${DB_NAME}_${TIMESTAMP}.log"
120

121
mkdir -p "${BACKUP_DIR}"
122

123
echo "[$(date)] Starting backup of ${DB_NAME}" | tee "${LOG_FILE}"
124

125
pg_dump \
126
  --format=custom \
127
  --verbose \
128
  --jobs=4 \
129
  --dbname="${DB_NAME}" \
130
  --file="${BACKUP_FILE}" \
131
  2>>"${LOG_FILE}"
132

133
# Verify the backup is readable
134
pg_restore --list "${BACKUP_FILE}" > /dev/null 2>>"${LOG_FILE}"
135

136
BACKUP_SIZE=$(du -h "${BACKUP_FILE}" | cut -f1)
137
echo "[$(date)] Backup completed: ${BACKUP_FILE} (${BACKUP_SIZE})" | tee -a "${LOG_FILE}"
138

139
# Cleanup old backups
140
find "${BACKUP_DIR}" -name "${DB_NAME}_*.dump" -mtime +${RETENTION_DAYS} -delete
141
find "${BACKUP_DIR}" -name "${DB_NAME}_*.log" -mtime +${RETENTION_DAYS} -delete
142

143
echo "[$(date)] Old backups cleaned (retention: ${RETENTION_DAYS} days)" | tee -a "${LOG_FILE}"
144
```
145

146
**2. systemd timer** (preferred over cron):
147
```ini
148
# /etc/systemd/system/pg-backup.service
149
[Unit]
150
Description=PostgreSQL daily backup
151
After=postgresql.service
152

153
[Service]
154
Type=oneshot
155
User=postgres
156
ExecStart=/opt/scripts/pg_backup.sh mydb
157
StandardOutput=journal
158
StandardError=journal
159

160
# /etc/systemd/system/pg-backup.timer
161
[Unit]
162
Description=Run PostgreSQL backup daily at 02:00
163

164
[Timer]
165
OnCalendar=*-*-* 02:00:00
166
RandomizedDelaySec=600
167
Persistent=true
168

169
[Install]
170
WantedBy=timers.target
171
```
172

173
```bash
174
sudo systemctl daemon-reload
175
sudo systemctl enable --now pg-backup.timer
176
systemctl list-timers pg-backup.timer
177
```
178

179
**3. Monitoring**: alert if latest backup is older than 26 hours or restore verification fails.
180

181
**RPO**: 24 hours (daily logical backup). For lower RPO, add WAL archiving for PITR capability.
182
**RTO**: depends on database size — test restore time and document it.
187
User asks: "Tables are bloated, queries getting slow"
189
[4-step process:
190
1. **Diagnose** — SQL queries for: dead tuples + dead_pct per table (pg_stat_user_tables), active vacuum workers (pg_stat_progress_vacuum), table/index sizes (pg_total_relation_size)
191
2. **Tune autovacuum** — per-table ALTER TABLE SET for high-write tables: scale_factor=0.01, analyze_scale_factor=0.005, cost_delay=2ms. Check wraparound risk via age(datfrozenxid)
192
3. **Manual vacuum** — VACUUM (VERBOSE, ANALYZE) for immediate relief. VACUUM FULL only as last resort, prefer pg_repack for online rebuild
193
4. **Prevent** — pg_stat_statements for write-heavy queries, check long-running transactions blocking vacuum via pg_stat_activity]
198
User asks: "Set up streaming replication for HA"
200
[5 components: Primary config (wal_level=replica, max_wal_senders, wal_keep_size, replication slots) → replication user + slot creation → pg_hba.conf entry → pg_basebackup to initialize replica (--wal-method=stream --slot=replica1) → replica config (primary_conninfo, primary_slot_name, hot_standby, standby.signal) → monitoring queries (pg_stat_replication on primary, pg_last_xact_replay_timestamp on replica) → failover runbook (verify down → pg_promote → update connstrings → pg_rewind old primary)]
204
- Use `pg_dump --format=custom` for all logical backups — supports parallel restore, compression, and selective object restore
205
- Verify backups by running `pg_restore --list` or test restoring to a scratch database — an unverified backup is not a backup
206
- Automate backups with systemd timers and monitor backup age — ensures backups happen even when humans forget
207
- Document RPO and RTO for every database and test that restore meets the RTO target
208
- Use `CREATE INDEX CONCURRENTLY` and `NOT VALID` + `VALIDATE CONSTRAINT` for migrations — avoids exclusive locks that block all queries
209
- Set `lock_timeout` before DDL: `SET lock_timeout = '5s'` — fails fast instead of queuing and cascading blocks
210
- Monitor `pg_stat_activity` for long-running queries and idle-in-transaction — these block vacuum and hold resources
211
- Tune autovacuum per-table for high-write tables rather than changing global defaults
212
- Use replication slots with streaming replication — prevents primary from deleting WAL the replica hasn't consumed
213
- Use `pg_stat_statements` for query performance baseline and regression detection