# Solving incidents with one-time ephemeral runbooks Page: https://stenobird.com/podcast/adventures-in-devops/solving-incidents-with-one-time-ephemeral-runbooks Text version: https://stenobird.com/podcast/adventures-in-devops/solving-incidents-with-one-time-ephemeral-runbooks.md Podcast: [Adventures in DevOps](https://stenobird.com/podcast/adventures-in-devops) Published: 2025-10-20T00:00:00+00:00 Episode link: https://adventuresindevops.com/episodes/2025/10/20/solving-incidents-with-one-time-ephemeral-runbooks Audio file: https://dts.podtrac.com/redirect.mp3/api.spreaker.com/download/episode/68206117/download.mp3 Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/adventures-in-devops/episodes/solving-incidents-with-one-time-ephemeral-runbooks Duration seconds: 2999 ## Resource Incident.io's Lawrence Jones explains how to move beyond static documentation by using AI-driven RAG to generate ephemeral, context-aware runbooks during outages. The discussion covers the transition from Heroku to GCP for high-compliance environments and the technical hurdles of automating incident investigation. ## Highlights - Main idea: Effective incident response in regulated industries like FinTech requires high-rigor processes and automated, verifiable documentation - Technical strategy: Use a knowledge graph—combining service catalogs, CRM data, and GitHub webhooks—to power RAG instead of relying solely on unstructured vector embeddings - Practical takeaway: Ephemeral runbooks should be dynamically generated by analyzing recent PR diffs, Slack discussions, and telemetry to surface relevant dashboards instantly - Failure mode: Avoid 'low-confidence' AI hallucinations in incident channels; implement background verification by cloning codebases to validate assumptions before alerting engineers - Lesson learned: The bottleneck in modern DevOps is no longer the LLM's reasoning capability, but the engineering effort required to build structured, modular data pipelines ## Topics Incident Response, Retrieval-Augmented Generation, Site Reliability Engineering, DevOps Automation, Cloud Infrastructure, Service Catalogs, FinTech Compliance, AI Agents ## Chapters - 1:00 — The High Stakes of FinTech Incidents: Why regulatory obligations and the cost of downtime in financial services demand much higher incident response discipline than other industries. - 12:20 — Scaling Infrastructure for Security: The transition from Heroku to GCP and Kubernetes to meet the security and compliance needs of enterprise customers. - 19:40 — Automating Investigation with AI: An introduction to AISRE and how the system uses a service catalog to guide automated incident investigations. - 23:30 — Building RAG with Knowledge Graphs: How to use GitHub integrations, webhooks, and universal adapters to feed relevant context into a RAG-based runbook generator. - 31:10 — Verifying AI Assumptions: The importance of using background processes to double-check PR changes and code diffs before presenting findings to responders. - 35:00 — The Future of LLMs in DevOps: Why the focus is shifting from model intelligence to the engineering of structured, modular systems and objective metrics. ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/adventures-in-devops/episodes/solving-incidents-with-one-time-ephemeral-runbooks/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/adventures-in-devops/solving-incidents-with-one-time-ephemeral-runbooks.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.