Episode

Solving incidents with one-time ephemeral runbooks

Podcast: Adventures in DevOps
Published: Oct 20, 2025
Duration seconds: 2999
Processing state: processed
Canonical source: https://adventuresindevops.com/episodes/2025/10/20/solving-incidents-with-one-time-ephemeral-runbooks
Audio: https://dts.podtrac.com/redirect.mp3/api.spreaker.com/download/episode/68206117/download.mp3
JSON: /v1/public/podcasts/adventures-in-devops/episodes/solving-incidents-with-one-time-ephemeral-runbooks
Markdown: /podcast/adventures-in-devops/solving-incidents-with-one-time-ephemeral-runbooks.md

Actions

POST https://stenobird.com/v1/public/podcasts/adventures-in-devops/episodes/solving-incidents-with-one-time-ephemeral-runbooks/transcription-requests
Idempotently request low-priority transcript generation for this episode.
GET https://stenobird.com/podcast/adventures-in-devops/solving-incidents-with-one-time-ephemeral-runbooks.md
Read the agent-friendly Markdown representation of this episode resource.

Summary

Incident.io's Lawrence Jones explains how to move beyond static documentation by using AI-driven RAG to generate ephemeral, context-aware runbooks during outages. The discussion covers the transition from Heroku to GCP for high-compliance environments and the technical hurdles of automating incident investigation.

Topics

Incident Response
Retrieval-Augmented Generation
Site Reliability Engineering
DevOps Automation
Cloud Infrastructure
Service Catalogs
FinTech Compliance
AI Agents

Highlights

Main idea: Effective incident response in regulated industries like FinTech requires high-rigor processes and automated, verifiable documentation
Technical strategy: Use a knowledge graph—combining service catalogs, CRM data, and GitHub webhooks—to power RAG instead of relying solely on unstructured vector embeddings
Practical takeaway: Ephemeral runbooks should be dynamically generated by analyzing recent PR diffs, Slack discussions, and telemetry to surface relevant dashboards instantly
Failure mode: Avoid 'low-confidence' AI hallucinations in incident channels; implement background verification by cloning codebases to validate assumptions before alerting engineers
Lesson learned: The bottleneck in modern DevOps is no longer the LLM's reasoning capability, but the engineering effort required to build structured, modular data pipelines

Chapters

1:00 The High Stakes of FinTech Incidents: Why regulatory obligations and the cost of downtime in financial services demand much higher incident response discipline than other industries.
12:20 Scaling Infrastructure for Security: The transition from Heroku to GCP and Kubernetes to meet the security and compliance needs of enterprise customers.
19:40 Automating Investigation with AI: An introduction to AISRE and how the system uses a service catalog to guide automated incident investigations.
23:30 Building RAG with Knowledge Graphs: How to use GitHub integrations, webhooks, and universal adapters to feed relevant context into a RAG-based runbook generator.
31:10 Verifying AI Assumptions: The importance of using background processes to double-check PR changes and code diffs before presenting findings to responders.
35:00 The Future of LLMs in DevOps: Why the focus is shifting from model intelligence to the engineering of structured, modular systems and objective metrics.