Skip to content

DevOps Documentation

DevOps documentation bridges development and operations. It covers infrastructure, deployment processes, monitoring, and incident response. Good DevOps docs help teams deploy confidently and respond to issues quickly.

Types of DevOps Documentation

Infrastructure Documentation

Document your infrastructure:

# Infrastructure Overview

## Architecture

[Diagram of infrastructure components]

## Components

### Web Servers
- **Type**: AWS EC2 t3.large
- **Count**: 3 (auto-scaling 3-10)
- **Region**: us-east-1
- **AMI**: ami-0123456789

### Database
- **Type**: AWS RDS PostgreSQL 15
- **Instance**: db.r6g.xlarge
- **Storage**: 500GB SSD
- **Multi-AZ**: Yes

### Cache
- **Type**: AWS ElastiCache Redis
- **Node**: cache.r6g.large
- **Cluster**: 3 nodes

Runbooks

Step-by-step operational procedures:

# Runbook: Database Failover

## When to Use
Use this runbook when the primary database is unresponsive
and automatic failover hasn't occurred.

## Prerequisites
- AWS console access
- Database admin credentials
- PagerDuty access to update incident

## Steps

### 1. Verify Primary is Down
aws rds describe-db-instances --db-instance-identifier prod-db

Check `DBInstanceStatus`. If not "available", proceed.

### 2. Initiate Failover
aws rds reboot-db-instance \
  --db-instance-identifier prod-db \
  --force-failover

### 3. Monitor Failover
Watch for status change to "available" (typically 2-5 minutes).

### 4. Verify Application
Check application health endpoints.
curl https://api.example.com/health

### 5. Update Incident
Note failover completion time in PagerDuty.

## Rollback
Failover is automatic. No manual rollback needed.

Deployment Documentation

Document deployment processes:

# Deployment Guide

## Deployment Pipeline

Code → Build → Test → Staging → Production

## Environments

| Environment | URL | Branch |
|-------------|-----|--------|
| Development | dev.example.com | develop |
| Staging | staging.example.com | main |
| Production | example.com | main (tagged) |

## Deploying to Staging

Staging deploys automatically when code merges to main.

## Deploying to Production

### Prerequisites
- All tests passing
- Staging verified
- Deployment window (Tuesday-Thursday, 10am-4pm)

### Steps

1. Create release tag
   git tag -a v1.2.3 -m "Release 1.2.3"
   git push origin v1.2.3

2. Monitor deployment
   - GitHub Actions: [link]
   - Deployment dashboard: [link]

3. Verify deployment
   curl https://example.com/health
   # Check version matches

4. Monitor metrics
   - Error rates
   - Response times
   - CPU/memory usage

## Rollback

### Automatic
Deployment rolls back if health checks fail.

### Manual
Deploy previous version:
./scripts/deploy.sh v1.2.2

Monitoring Documentation

Document monitoring and alerting:

# Monitoring Guide

## Dashboards

| Dashboard | Purpose | Link |
|-----------|---------|------|
| Overview | System health | [Grafana] |
| API | API performance | [Grafana] |
| Database | DB metrics | [Grafana] |
| Errors | Error tracking | [Sentry] |

## Key Metrics

### API Response Time
- **Target**: p99 < 500ms
- **Alert**: p99 > 1s for 5 minutes
- **Dashboard**: API Performance

### Error Rate
- **Target**: < 0.1%
- **Alert**: > 1% for 5 minutes
- **Dashboard**: Error Tracking

### Database Connections
- **Target**: < 80% of max
- **Alert**: > 90% for 5 minutes
- **Dashboard**: Database

## Alert Response

### High Error Rate
1. Check Sentry for error details
2. Review recent deployments
3. Check dependent services
4. See runbook: [Error Rate Spike]

### High Latency
1. Check database query times
2. Review cache hit rates
3. Check external service latency
4. See runbook: [Latency Investigation]

Incident Response

Document how to handle incidents:

# Incident Response Guide

## Severity Levels

| Level | Definition | Response Time | Examples |
|-------|------------|---------------|----------|
| SEV1 | Complete outage | Immediate | Site down, data loss |
| SEV2 | Major degradation | 15 minutes | Feature broken, slow |
| SEV3 | Minor issue | 4 hours | UI bug, minor feature |

## Incident Process

### 1. Acknowledge
Claim the incident in PagerDuty within SLA.

### 2. Assess
- What's the impact?
- What's the scope?
- What changed recently?

### 3. Communicate
- Update status page
- Notify stakeholders
- Set expectations

### 4. Mitigate
Focus on restoring service, not root cause.

### 5. Resolve
Confirm service is restored and stable.

### 6. Post-Mortem
Schedule within 48 hours for SEV1/SEV2.

## Communication Templates

### Initial Status
We're investigating reports of [issue]. Updates to follow.

### Update
We've identified the cause as [cause]. Working on resolution.
ETA: [time].

### Resolution
The issue has been resolved. [Brief explanation].
We'll publish a post-mortem within 48 hours.

Writing DevOps Documentation

Be Precise

Vague instructions cause problems under pressure:

# Bad
Restart the service if there are issues.

# Good
sudo systemctl restart api-server

Verify the service is running:
sudo systemctl status api-server

Expected output should show "active (running)".

Include Context

Explain why, not just how:

## Database Connection Limits

Current limit: 100 connections

Why this limit:
- RDS instance supports max 150
- Reserve 30 for admin/monitoring
- Buffer of 20 for spikes

How to increase:
1. Modify RDS parameter group
2. Connection pooler config update
3. Application restart required

Assume Stress

People use runbooks during incidents:

  • Use numbered steps
  • One action per step
  • Include verification after each step
  • Provide rollback instructions

Infrastructure as Code Documentation

Document Your IaC

# Terraform Structure

infrastructure/
├── modules/
│   ├── vpc/           # Network configuration
│   ├── ecs/           # Container orchestration
│   └── rds/           # Database resources
├── environments/
│   ├── dev/           # Development environment
│   ├── staging/       # Staging environment
│   └── prod/          # Production environment
└── global/            # Shared resources

## Making Changes

1. Modify configuration
2. Run plan
   terraform plan -out=plan.tfplan

3. Review changes
4. Apply (requires approval for prod)
   terraform apply plan.tfplan

## Module Documentation

Each module has a README with:
- Purpose
- Inputs/outputs
- Example usage
- Dependencies

Configuration Documentation

Document configuration options:

# Application Configuration

## Environment Variables

| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| DATABASE_URL | Yes | - | PostgreSQL connection string |
| REDIS_URL | Yes | - | Redis connection string |
| LOG_LEVEL | No | info | Logging level |
| MAX_WORKERS | No | 4 | Worker process count |

## Secrets Management

Secrets stored in AWS Secrets Manager:
- `prod/database` - Database credentials
- `prod/api-keys` - Third-party API keys
- `prod/jwt-secret` - JWT signing key

Rotation: Automated monthly for database credentials.

Keeping DevOps Docs Current

Update Triggers

Update documentation when:

  • Infrastructure changes
  • New services added
  • Incidents reveal gaps
  • Processes change
  • Team members change

Review Schedule

Doc Type Review Frequency
Runbooks After each use
Architecture Quarterly
Deployment After process changes
Incident response After each incident

Summary

Effective DevOps documentation:

  • Documents infrastructure clearly
  • Provides step-by-step runbooks for operations
  • Covers deployment and rollback procedures
  • Explains monitoring and alerting
  • Prepares teams for incident response
  • Stays current with infrastructure changes

Good DevOps documentation helps teams operate confidently and respond to issues quickly.