Episode
Solving incidents with one-time ephemeral runbooks
- Podcast
- Adventures in DevOps
- Published
- Oct 20, 2025
- Duration seconds
- 2999
- Processing state
processed- Canonical source
- https://adventuresindevops.com/episodes/2025/10/20/solving-incidents-with-one-time-ephemeral-runbooks
Actions
POST https://stenobird.com/v1/public/podcasts/adventures-in-devops/episodes/solving-incidents-with-one-time-ephemeral-runbooks/transcription-requests
Idempotently request low-priority transcript generation for this episode.GET https://stenobird.com/podcast/adventures-in-devops/solving-incidents-with-one-time-ephemeral-runbooks.md
Read the agent-friendly Markdown representation of this episode resource.
Summary
Incident.io's Lawrence Jones explains how to move beyond static documentation by using AI-driven RAG to generate ephemeral, context-aware runbooks during outages. The discussion covers the transition from Heroku to GCP for high-compliance environments and the technical hurdles of automating incident investigation.
Topics
- Incident Response
- Retrieval-Augmented Generation
- Site Reliability Engineering
- DevOps Automation
- Cloud Infrastructure
- Service Catalogs
- FinTech Compliance
- AI Agents
Highlights
- Main idea: Effective incident response in regulated industries like FinTech requires high-rigor processes and automated, verifiable documentation
- Technical strategy: Use a knowledge graph—combining service catalogs, CRM data, and GitHub webhooks—to power RAG instead of relying solely on unstructured vector embeddings
- Practical takeaway: Ephemeral runbooks should be dynamically generated by analyzing recent PR diffs, Slack discussions, and telemetry to surface relevant dashboards instantly
- Failure mode: Avoid 'low-confidence' AI hallucinations in incident channels; implement background verification by cloning codebases to validate assumptions before alerting engineers
- Lesson learned: The bottleneck in modern DevOps is no longer the LLM's reasoning capability, but the engineering effort required to build structured, modular data pipelines
Chapters
1:00The High Stakes of FinTech Incidents: Why regulatory obligations and the cost of downtime in financial services demand much higher incident response discipline than other industries.12:20Scaling Infrastructure for Security: The transition from Heroku to GCP and Kubernetes to meet the security and compliance needs of enterprise customers.19:40Automating Investigation with AI: An introduction to AISRE and how the system uses a service catalog to guide automated incident investigations.23:30Building RAG with Knowledge Graphs: How to use GitHub integrations, webhooks, and universal adapters to feed relevant context into a RAG-based runbook generator.31:10Verifying AI Assumptions: The importance of using background processes to double-check PR changes and code diffs before presenting findings to responders.35:00The Future of LLMs in DevOps: Why the focus is shifting from model intelligence to the engineering of structured, modular systems and objective metrics.