Runbooks¶

Runbooks document operational procedures for maintaining and troubleshooting systems. They provide step-by-step guidance for common tasks, incident response, and system recovery. Good runbooks enable consistent operations and reduce mean time to recovery during incidents.

Runbook Purpose¶

Runbooks help operations teams:

Respond to incidents consistently
Perform routine maintenance safely
Onboard new team members
Reduce dependency on tribal knowledge
Meet compliance requirements

Runbook Types¶

Incident Response Runbooks¶

For handling alerts and outages:

# Runbook: High CPU Alert

## Alert Details
- **Alert Name**: High CPU Usage
- **Threshold**: CPU > 90% for 5 minutes
- **Severity**: High
- **Escalation**: Page on-call engineer

## Initial Assessment

1. Check if alert is real or false positive
   ```bash
   ssh prod-server
   top -bn1 | head -20
   ```

2. Identify the process consuming CPU
   ```bash
   ps aux --sort=-%cpu | head -10
   ```

3. Check if this correlates with traffic
   ```bash
   curl -s localhost:9090/metrics | grep requests_total
   ```

## Common Causes and Solutions

### Cause: Traffic Spike

**Symptoms**: Request rate significantly above normal

**Actions**:
1. Verify this is legitimate traffic (not attack)
2. Scale horizontally if possible
   ```bash
   kubectl scale deployment api --replicas=5
   ```
3. Enable rate limiting if needed

### Cause: Runaway Process

**Symptoms**: Single process consuming most CPU

**Actions**:
1. Identify the process and its purpose
2. Check logs for errors
   ```bash
   journalctl -u myservice -n 100
   ```
3. Restart if safe
   ```bash
   systemctl restart myservice
   ```

### Cause: Memory Pressure

**Symptoms**: OOM killer active, swap usage high

**Actions**:
1. Check memory usage
   ```bash
   free -h
   ```
2. Identify memory-hungry processes
3. Restart affected services

## Escalation

If unable to resolve within 15 minutes:
1. Page secondary on-call
2. Create incident channel: #incident-YYYYMMDD
3. Notify stakeholders via status page

Maintenance Runbooks¶

For routine operations:

# Runbook: Database Backup Verification

## Purpose
Weekly verification that database backups are completing
and restorable.

## Schedule
Every Monday, 10:00 AM

## Prerequisites
- Access to backup storage
- Restore environment available
- 2 hours of uninterrupted time

## Procedure

### 1. Verify Backup Completion

Check that nightly backup completed:

```bash
aws s3 ls s3://backups/database/$(date -d yesterday +%Y%m%d)/

Expected: backup file present, size > 1GB

2. Download Latest Backup¶

aws s3 cp s3://backups/database/$(date -d yesterday +%Y%m%d)/db.sql.gz ./

3. Restore to Test Environment¶

gunzip -c db.sql.gz | psql -h test-db -U admin testdb

Expected: No errors, restore completes

4. Verify Data Integrity¶

Run validation queries:

-- Check record counts
SELECT COUNT(*) FROM users;  -- Expected: > 10000
SELECT COUNT(*) FROM orders; -- Expected: > 50000

-- Check recent data
SELECT MAX(created_at) FROM orders;
-- Expected: Yesterday's date

5. Document Results¶

Record in backup log: - Backup date - File size - Restore time - Verification status - Any issues encountered

6. Clean Up¶

rm db.sql.gz
psql -h test-db -U admin -c "DROP DATABASE testdb; CREATE DATABASE testdb;"

Troubleshooting¶

Backup missing¶

Check backup job logs
Verify storage permissions
Escalate if backup job failed

Restore fails¶

Check disk space on restore target
Verify database permissions

Check for corruption in backup file

### Deployment Runbooks

For release procedures:

```markdown
# Runbook: Production Deployment

## Pre-Deployment Checklist

- [ ] All tests passing in CI
- [ ] Code review completed
- [ ] Changelog updated
- [ ] Stakeholders notified
- [ ] Rollback plan reviewed

## Deployment Steps

### 1. Notify Team

Post in #deployments:
> Starting production deployment of v1.2.3

### 2. Create Deployment Snapshot

```bash
kubectl get deployment api -o yaml > pre-deploy-snapshot.yaml

3. Deploy¶

kubectl set image deployment/api api=myapp:v1.2.3

4. Monitor Rollout¶

kubectl rollout status deployment/api

Wait for "successfully rolled out" message.

5. Verify Deployment¶

Run smoke tests:

./scripts/smoke-test.sh production

Check key metrics: - Error rate: Should be < 0.1% - Latency p99: Should be < 500ms - Request rate: Should be stable

6. Announce Completion¶

Post in #deployments:

Production deployment of v1.2.3 complete ✓

Rollback Procedure¶

If issues detected:

Immediate Rollback¶

kubectl rollout undo deployment/api

Restore from Snapshot¶

If rollback fails:

kubectl apply -f pre-deploy-snapshot.yaml

Post-Rollback¶

Notify team of rollback
Create incident report

Investigate root cause

## Writing Effective Runbooks

### Be Specific

Provide exact commands:

**Vague:**
> Check the logs for errors

**Specific:**
> Check application logs:
> ```bash
> kubectl logs -l app=api --tail=100 | grep ERROR
> ```

### Include Expected Output

Show what success looks like:

```markdown
Run health check:
```bash
curl http://localhost:8080/health

Expected output:

{"status": "healthy", "version": "1.2.3"}

If status is not "healthy", proceed to troubleshooting.

### Provide Context

Explain why, not just what:

```markdown
### 2. Enable Maintenance Mode

Before proceeding, enable maintenance mode to prevent data
corruption during the migration:

```bash
./scripts/maintenance-mode.sh enable

This redirects traffic to a static page and blocks API writes.

### Cover Failure Cases

Document what to do when things go wrong:

```markdown
## Troubleshooting

### Command hangs

If the migration command hangs for more than 10 minutes:

1. Do NOT kill the process (may cause corruption)
2. Check database locks:
   ```sql
   SELECT * FROM pg_locks WHERE NOT granted;
   ```
3. Contact DBA for assessment

Include Contact Information¶

Provide escalation paths:

## Escalation

| Issue | Contact | Method |
|-------|---------|--------|
| Database | DBA Team | #dba-oncall |
| Infrastructure | Platform Team | Page via PagerDuty |
| Business Impact | Product Manager | Slack @pm-oncall |

Runbook Maintenance¶

Keep Current¶

Runbooks must reflect current systems:

Update when systems change
Review after incidents
Test procedures periodically
Remove obsolete content

Version Control¶

Track runbook changes:

Use git for runbook storage
Require review for changes
Link to related system changes

Test Runbooks¶

Verify procedures work:

Run through procedures regularly
Use game days to test incident runbooks
Update based on findings

Summary¶

Runbooks guide operational procedures:

Provide specific, copy-paste commands
Include expected outputs
Cover failure scenarios
Keep procedures current
Test regularly

Good runbooks reduce incident response time and ensure consistent operations.

Next: Internal Documentation covers team-facing documentation.