Skip to main content

Runbooks

Step-by-step procedures for common operational tasks and incident response.

Contents

  1. Emergency Access - Regaining access in emergencies
  2. Database Failover - PostgreSQL failover procedures
  3. Cluster Recovery - Kubernetes cluster recovery
  4. Secret Rotation - Rotating secrets and credentials
  5. Incident Response - Handling incidents

Runbook Format

Each runbook follows this structure:

  1. Overview - What the runbook covers
  2. Prerequisites - What you need before starting
  3. Procedure - Step-by-step instructions
  4. Verification - How to verify success
  5. Rollback - How to undo if needed

Severity Levels

LevelDescriptionResponse Time
P1Critical - Complete outageImmediate
P2Major - Partial outage< 30 min
P3Minor - Degraded service< 4 hours
P4Low - Cosmetic issuesNext business day