---
title: "Operations Runbook"
description: "Build a comprehensive operations runbook covering deployment, monitoring, incident response, and rollback procedures. This lesson simulates real incidents and teaches you to operate a production bot with confidence."
canonical_url: "https://vercel.com/academy/slack-agents/operations-runbook"
md_url: "https://vercel.com/academy/slack-agents/operations-runbook.md"
docset_id: "vercel-academy"
doc_version: "1.0"
last_updated: "2026-04-09T11:08:45.301Z"
content_type: "lesson"
course: "slack-agents"
course_title: "Slack Agents on Vercel with the AI SDK"
prerequisites:  []
---

<agent-instructions>
Vercel Academy — structured learning, not reference docs.
Lessons are sequenced.
Adapt commands to the human's actual environment (OS, package manager, shell, editor) — detect from project context or ask, don't assume.
The lesson shows one path; if the human's project diverges, adapt concepts to their setup.
Preserve the learning goal over literal steps.
Quizzes are pedagogical — engage, don't spoil.
Quiz answers are included for your reference.
</agent-instructions>

# Operations Runbook

# Build Your Operations Runbook So Anyone Can Fix Outages

Your bot just crashed during the quarterly board meeting demo. The on-call engineer doesn't know TypeScript. They need a step-by-step guide to diagnose, mitigate, and resolve the incident. Without a runbook, they're guessing. With one, they're following a proven playbook that gets the bot back online in minutes, not hours.

## Outcome

Create a comprehensive `RUNBOOK.md` with SLOs, incident procedures, and simulate a production incident to validate your response process.

## Fast Track

1. Create `RUNBOOK.md` with sections: Setup, Secrets, Deploy, Incidents, Rollback
2. Define SLOs: ack < 3s (p99), response < 15s (p95), error rate < 1%
3. Simulate a rate limit incident and follow the runbook to resolution

## Building on Previous Lessons

Your runbook leverages everything we've built:

- **From [error handling](./error-handling-and-resilience)**: Retry logic and rate limit handling procedures
- **From [deploy to Vercel](./deploy-to-vercel)**: Deployment and rollback commands
- **From [structured logs](./scopes-and-structured-logs)**: Structured logs for incident investigation

## Hands-On Exercise 5.3

Create an operations runbook and validate it with incident simulation:

**Requirements:**

1. Create `RUNBOOK.md` with all operational procedures
2. Define SLOs with specific thresholds
3. Document incident response flowchart
4. Include rollback procedures with verification steps
5. Simulate a rate limit incident and resolve using the runbook

**Implementation hints:**

- Use actual Vercel commands and log queries
- Include correlation ID search examples
- Add specific error patterns to look for
- Create a decision tree for common failures
- Test the runbook by following it exactly

**SLOs to define:**

```yaml
slos:
  acknowledgment:
    target: 99%
    threshold: 3000ms
    measurement: "Time to ack() Slack events"
  
  response_time:
    target: 95%
    threshold: 15000ms
    measurement: "Time from event to final response"
  
  error_rate:
    target: < 1%
    measurement: "Percentage of failed responses"
  
  availability:
    target: 99.9%
    measurement: "Bot responding to mentions"
```

## Try It

1. **Create comprehensive runbook:**

   ````markdown title="/slack-agent/RUNBOOK.md"
   # Slack Bot Operations Runbook

   ## Quick Reference
   - **Production URL**: https://slack-bot-prod.vercel.app
   - **Health Check**: https://slack-bot-prod.vercel.app/health
   - **Logs**: https://vercel.com/team/slack-bot-prod/functions
   - **On-call**: @oncall-slack-bot (PagerDuty)

   ## SLOs (Service Level Objectives)

   | Metric | Target | Threshold | Alert |
   |--------|--------|-----------|-------|
   | Event Acknowledgment | 99% | < 3s | PagerDuty High |
   | Response Time (p95) | 95% | < 15s | PagerDuty Low |
   | Error Rate | < 1% | - | PagerDuty Medium |
   | Availability | 99.9% | - | PagerDuty High |

   ## Common Issues Quick Fix

   ### Bot Not Responding
   1. Check health endpoint: `curl https://slack-bot-prod.vercel.app/health`
   2. Verify in Vercel dashboard: Functions tab → Check for errors
   3. Check Slack App config: Event Subscriptions → URL verified?
   4. If URL not verified: Redeploy with `pnpm dlx vercel --prod --force`

   ### Rate Limit Errors (429)
   1. Check logs for `rateLimitWaitMs` > 0
   2. Verify retry logic: `grep "retryAttempt" logs | tail -20`
   3. Temporary mitigation: Scale down concurrent requests
   4. Long-term: Implement request queuing

   ## Deployment Procedures

   ### Normal Deploy
   ```bash
   git pull origin main
   pnpm test
   pnpm dlx vercel --prod
   # Verify: curl https://slack-bot-prod.vercel.app/health
   ````

   ### Emergency Rollback

   ```bash
   # List recent deployments
   pnpm dlx vercel ls

   # Rollback to previous version
   pnpm dlx vercel rollback

   # Verify rollback
   curl https://slack-bot-prod.vercel.app/health
   # Check logs for normal operation
   ```

   ## Incident Response Flowchart

   ```
   ALERT FIRED
       ↓
   [Check Health] → Failed → [Check Vercel Status]
       ↓ OK                      ↓
   [Check Logs]              [Await Resolution]
       ↓
   [Correlation Search] 
       ↓
   Error Pattern?
     ├─ 429/Rate Limit → [Apply Backoff]
     ├─ 5xx/Timeout → [Check OpenAI Status]
     ├─ Missing Scope → [Update Manifest]
     └─ Unknown → [Escalate to Senior]
   ```

   ## Log Investigation Commands

   ### Find Recent Errors

   ```
   Filter: level:50 OR level:40
   Time: Last 1 hour
   ```

   ### Track Specific Request

   ```
   Filter: correlationId:"EVENT_ID_TIMESTAMP"
   Shows: Full request lifecycle
   ```

   ### Check AI Performance

   ```
   Filter: operation:respondToMessage
   Aggregate: AVG(latencyMs), MAX(retryAttempt)
   ```

   ## Secret Rotation

   1. Generate new token in Slack App Config
   2. Update in Vercel: `pnpm dlx vercel env rm SLACK_BOT_TOKEN`
   3. Add new: `pnpm dlx vercel env add SLACK_BOT_TOKEN`
   4. Redeploy: `pnpm dlx vercel --prod --force`
   5. Verify: Test bot mention in Slack

   ## Monitoring Setup

   ### Vercel Monitoring

   - Enable Monitoring in project settings
   - Set alert for Function errors > 1%
   - Set alert for Function duration > 10s (p95)

   ### Custom Health Checks

   - Endpoint: `/health`
   - Frequency: Every 60 seconds
   - Alert: 2 consecutive failures

   ## Incident Communication

   ### Status Updates

   - Initial: "#incidents - Investigating bot responsiveness issues"
   - Update: "#incidents - Identified rate limiting, applying fixes"
   - Resolution: "#incidents - Resolved, bot operating normally"

   ### Post-Mortem Template

   - **Duration**: Start time - End time
   - **Impact**: % of requests affected
   - **Root Cause**: Specific technical issue
   - **Resolution**: Steps taken
   - **Prevention**: Long-term fixes

   ```
   ```

2. **Simulate rate limit incident:**

   ```typescript title="/slack-agent/scripts/simulate-incident.ts"
   // Trigger multiple rapid requests to hit rate limit
   for (let i = 0; i < 50; i++) {
     await client.chat.postMessage({
       channel: 'C_TEST_CHANNEL',
       text: `@bot test message ${i}`
     });
   }
   ```

   Expected logs:

   ```
   [WARN] bolt-app {
     correlationId: 'ev_1234_1733456789',
     operation: 'respondToMessage',
     error: 'rate_limited',
     retryAfter: 30000,
     retryAttempt: 1
   } Rate limited, waiting 30s

   [INFO] bolt-app {
     correlationId: 'ev_1234_1733456789',
     operation: 'respondToMessage', 
     retryAttempt: 2,
     rateLimitWaitMs: 30000,
     model: 'gpt-4o-mini'
   } Retry successful after backoff
   ```

3. **Follow runbook to resolve:**
   ```bash
   # 1. Identify issue in logs
   pnpm dlx vercel logs --follow | grep "rate_limited"

   # 2. Check retry metrics
   # Filter: retryAttempt:>0
   # Shows: 47 requests with retries

   # 3. Verify backoff working
   # Filter: rateLimitWaitMs:>0
   # Shows: Proper exponential backoff applied

   # 4. Confirm resolution
   # Recent logs show normal operation
   ```

4. **Update incident log:**
   ```markdown title="/slack-agent/INCIDENTS.md"
   ## 2024-12-06: Rate Limit Event

   **Duration**: 14:30 - 14:45 UTC (15 minutes)
   **Impact**: 47 messages delayed by 5-30 seconds
   **Root Cause**: Burst of 50 messages exceeded OpenAI rate limit

   **Timeline**:
   - 14:30 - Alert: Error rate spike to 94%
   - 14:31 - Identified rate_limited errors in logs
   - 14:32 - Confirmed retry logic engaging
   - 14:35 - Backoff successful, queue clearing
   - 14:45 - Normal operation restored

   **Resolution**: 
   - Automatic retry with exponential backoff handled issue
   - No manual intervention required

   **Action Items**:
   - [ ] Implement request queue to smooth bursts
   - [ ] Add pre-emptive rate limit monitoring
   - [ ] Update alerts to distinguish retriable errors
   ```

## Troubleshooting

**Runbook Not Helping:**

- Ensure it covers actual production scenarios
- Add more specific error patterns from real incidents
- Include actual commands that work, not theoretical ones

**SLO Measurement Issues:**

- Use structured logs to calculate metrics accurately
- Set up Vercel Analytics for automatic tracking
- Consider external monitoring (Datadog, New Relic)

**Incident Simulation Too Disruptive:**
Use a staging environment:

```bash
# Deploy to staging
pnpm dlx vercel --env preview

# Run tests against staging URL
SLACK_BOT_URL=https://staging.vercel.app npm run test:incident
```

## Commit

```bash
git add -A
git commit -m "feat(ops): comprehensive operations runbook with incident procedures

- Create RUNBOOK.md with deployment and rollback procedures
- Define SLOs: 3s ack (p99), 15s response (p95), <1% errors
- Document incident response flowchart
- Add log investigation queries and commands
- Include secret rotation and monitoring setup
- Validate with simulated rate limit incident"
```

## Done-When

- [x] `RUNBOOK.md` exists with all sections
- [x] SLOs defined with specific thresholds
- [x] Incident response flowchart documented
- [x] Rollback procedure tested and verified
- [x] Rate limit incident simulated and resolved

## Solution

The complete runbook structure includes:

1. **Quick reference section** with all critical URLs and contacts
2. **SLO definitions** with specific metrics and thresholds
3. **Common issues** with step-by-step fixes
4. **Deployment procedures** including rollback
5. **Incident flowchart** for systematic response
6. **Log queries** for investigation
7. **Secret rotation** procedures
8. **Monitoring setup** instructions
9. **Communication templates** for incidents

Key operational commands:

```bash
# Health check
curl https://slack-bot-prod.vercel.app/health

# View logs
pnpm dlx vercel logs --follow

# List deployments
pnpm dlx vercel ls

# Rollback
pnpm dlx vercel rollback

# Force redeploy
pnpm dlx vercel --prod --force
```

\*\*Side Quest: Chaos Engineering: Break It Before Users Do\*\*

## Key Takeaways

- Runbooks must be tested regularly or they become fiction
- SLOs should be measurable from your existing logs
- Incident response is about systematic investigation, not heroes
- Rollback capability is mandatory for production systems
- Chaos engineering finds problems before your users do


---

[Full course index](/academy/llms.txt) · [Sitemap](/academy/sitemap.md)
