DevOps Documentation¶
DevOps documentation bridges development and operations. It covers infrastructure, deployment processes, monitoring, and incident response. Good DevOps docs help teams deploy confidently and respond to issues quickly.
Types of DevOps Documentation¶
Infrastructure Documentation¶
Document your infrastructure:
# Infrastructure Overview
## Architecture
[Diagram of infrastructure components]
## Components
### Web Servers
- **Type**: AWS EC2 t3.large
- **Count**: 3 (auto-scaling 3-10)
- **Region**: us-east-1
- **AMI**: ami-0123456789
### Database
- **Type**: AWS RDS PostgreSQL 15
- **Instance**: db.r6g.xlarge
- **Storage**: 500GB SSD
- **Multi-AZ**: Yes
### Cache
- **Type**: AWS ElastiCache Redis
- **Node**: cache.r6g.large
- **Cluster**: 3 nodes
Runbooks¶
Step-by-step operational procedures:
# Runbook: Database Failover
## When to Use
Use this runbook when the primary database is unresponsive
and automatic failover hasn't occurred.
## Prerequisites
- AWS console access
- Database admin credentials
- PagerDuty access to update incident
## Steps
### 1. Verify Primary is Down
aws rds describe-db-instances --db-instance-identifier prod-db
Check `DBInstanceStatus`. If not "available", proceed.
### 2. Initiate Failover
aws rds reboot-db-instance \
--db-instance-identifier prod-db \
--force-failover
### 3. Monitor Failover
Watch for status change to "available" (typically 2-5 minutes).
### 4. Verify Application
Check application health endpoints.
curl https://api.example.com/health
### 5. Update Incident
Note failover completion time in PagerDuty.
## Rollback
Failover is automatic. No manual rollback needed.
Deployment Documentation¶
Document deployment processes:
# Deployment Guide
## Deployment Pipeline
Code → Build → Test → Staging → Production
## Environments
| Environment | URL | Branch |
|-------------|-----|--------|
| Development | dev.example.com | develop |
| Staging | staging.example.com | main |
| Production | example.com | main (tagged) |
## Deploying to Staging
Staging deploys automatically when code merges to main.
## Deploying to Production
### Prerequisites
- All tests passing
- Staging verified
- Deployment window (Tuesday-Thursday, 10am-4pm)
### Steps
1. Create release tag
git tag -a v1.2.3 -m "Release 1.2.3"
git push origin v1.2.3
2. Monitor deployment
- GitHub Actions: [link]
- Deployment dashboard: [link]
3. Verify deployment
curl https://example.com/health
# Check version matches
4. Monitor metrics
- Error rates
- Response times
- CPU/memory usage
## Rollback
### Automatic
Deployment rolls back if health checks fail.
### Manual
Deploy previous version:
./scripts/deploy.sh v1.2.2
Monitoring Documentation¶
Document monitoring and alerting:
# Monitoring Guide
## Dashboards
| Dashboard | Purpose | Link |
|-----------|---------|------|
| Overview | System health | [Grafana] |
| API | API performance | [Grafana] |
| Database | DB metrics | [Grafana] |
| Errors | Error tracking | [Sentry] |
## Key Metrics
### API Response Time
- **Target**: p99 < 500ms
- **Alert**: p99 > 1s for 5 minutes
- **Dashboard**: API Performance
### Error Rate
- **Target**: < 0.1%
- **Alert**: > 1% for 5 minutes
- **Dashboard**: Error Tracking
### Database Connections
- **Target**: < 80% of max
- **Alert**: > 90% for 5 minutes
- **Dashboard**: Database
## Alert Response
### High Error Rate
1. Check Sentry for error details
2. Review recent deployments
3. Check dependent services
4. See runbook: [Error Rate Spike]
### High Latency
1. Check database query times
2. Review cache hit rates
3. Check external service latency
4. See runbook: [Latency Investigation]
Incident Response¶
Document how to handle incidents:
# Incident Response Guide
## Severity Levels
| Level | Definition | Response Time | Examples |
|-------|------------|---------------|----------|
| SEV1 | Complete outage | Immediate | Site down, data loss |
| SEV2 | Major degradation | 15 minutes | Feature broken, slow |
| SEV3 | Minor issue | 4 hours | UI bug, minor feature |
## Incident Process
### 1. Acknowledge
Claim the incident in PagerDuty within SLA.
### 2. Assess
- What's the impact?
- What's the scope?
- What changed recently?
### 3. Communicate
- Update status page
- Notify stakeholders
- Set expectations
### 4. Mitigate
Focus on restoring service, not root cause.
### 5. Resolve
Confirm service is restored and stable.
### 6. Post-Mortem
Schedule within 48 hours for SEV1/SEV2.
## Communication Templates
### Initial Status
We're investigating reports of [issue]. Updates to follow.
### Update
We've identified the cause as [cause]. Working on resolution.
ETA: [time].
### Resolution
The issue has been resolved. [Brief explanation].
We'll publish a post-mortem within 48 hours.
Writing DevOps Documentation¶
Be Precise¶
Vague instructions cause problems under pressure:
# Bad
Restart the service if there are issues.
# Good
sudo systemctl restart api-server
Verify the service is running:
sudo systemctl status api-server
Expected output should show "active (running)".
Include Context¶
Explain why, not just how:
## Database Connection Limits
Current limit: 100 connections
Why this limit:
- RDS instance supports max 150
- Reserve 30 for admin/monitoring
- Buffer of 20 for spikes
How to increase:
1. Modify RDS parameter group
2. Connection pooler config update
3. Application restart required
Assume Stress¶
People use runbooks during incidents:
- Use numbered steps
- One action per step
- Include verification after each step
- Provide rollback instructions
Infrastructure as Code Documentation¶
Document Your IaC¶
# Terraform Structure
infrastructure/
├── modules/
│ ├── vpc/ # Network configuration
│ ├── ecs/ # Container orchestration
│ └── rds/ # Database resources
├── environments/
│ ├── dev/ # Development environment
│ ├── staging/ # Staging environment
│ └── prod/ # Production environment
└── global/ # Shared resources
## Making Changes
1. Modify configuration
2. Run plan
terraform plan -out=plan.tfplan
3. Review changes
4. Apply (requires approval for prod)
terraform apply plan.tfplan
## Module Documentation
Each module has a README with:
- Purpose
- Inputs/outputs
- Example usage
- Dependencies
Configuration Documentation¶
Document configuration options:
# Application Configuration
## Environment Variables
| Variable | Required | Default | Description |
|----------|----------|---------|-------------|
| DATABASE_URL | Yes | - | PostgreSQL connection string |
| REDIS_URL | Yes | - | Redis connection string |
| LOG_LEVEL | No | info | Logging level |
| MAX_WORKERS | No | 4 | Worker process count |
## Secrets Management
Secrets stored in AWS Secrets Manager:
- `prod/database` - Database credentials
- `prod/api-keys` - Third-party API keys
- `prod/jwt-secret` - JWT signing key
Rotation: Automated monthly for database credentials.
Keeping DevOps Docs Current¶
Update Triggers¶
Update documentation when:
- Infrastructure changes
- New services added
- Incidents reveal gaps
- Processes change
- Team members change
Review Schedule¶
| Doc Type | Review Frequency |
|---|---|
| Runbooks | After each use |
| Architecture | Quarterly |
| Deployment | After process changes |
| Incident response | After each incident |
Summary¶
Effective DevOps documentation:
- Documents infrastructure clearly
- Provides step-by-step runbooks for operations
- Covers deployment and rollback procedures
- Explains monitoring and alerting
- Prepares teams for incident response
- Stays current with infrastructure changes
Good DevOps documentation helps teams operate confidently and respond to issues quickly.