# Solving incidents with one-time ephemeral runbooks

Page: https://stenobird.com/podcast/adventures-in-devops/solving-incidents-with-one-time-ephemeral-runbooks
Text version: https://stenobird.com/podcast/adventures-in-devops/solving-incidents-with-one-time-ephemeral-runbooks.md
Podcast: [Adventures in DevOps](https://stenobird.com/podcast/adventures-in-devops)
Published: 2025-10-20T00:00:00+00:00
Episode link: https://adventuresindevops.com/episodes/2025/10/20/solving-incidents-with-one-time-ephemeral-runbooks
Audio file: https://dts.podtrac.com/redirect.mp3/api.spreaker.com/download/episode/68206117/download.mp3
Processing state: processed
JSON: https://stenobird.com/v1/public/podcasts/adventures-in-devops/episodes/solving-incidents-with-one-time-ephemeral-runbooks
Duration seconds: 2999

## Resource

Incident.io's Lawrence Jones explains how to move beyond static documentation by using AI-driven RAG to generate ephemeral, context-aware runbooks during outages. The discussion covers the transition from Heroku to GCP for high-compliance environments and the technical hurdles of automating incident investigation.

## Highlights
- Main idea: Effective incident response in regulated industries like FinTech requires high-rigor processes and automated, verifiable documentation
- Technical strategy: Use a knowledge graph—combining service catalogs, CRM data, and GitHub webhooks—to power RAG instead of relying solely on unstructured vector embeddings
- Practical takeaway: Ephemeral runbooks should be dynamically generated by analyzing recent PR diffs, Slack discussions, and telemetry to surface relevant dashboards instantly
- Failure mode: Avoid 'low-confidence' AI hallucinations in incident channels; implement background verification by cloning codebases to validate assumptions before alerting engineers
- Lesson learned: The bottleneck in modern DevOps is no longer the LLM's reasoning capability, but the engineering effort required to build structured, modular data pipelines

## Topics

Incident Response, Retrieval-Augmented Generation, Site Reliability Engineering, DevOps Automation, Cloud Infrastructure, Service Catalogs, FinTech Compliance, AI Agents

## Chapters
- 1:00 — The High Stakes of FinTech Incidents: Why regulatory obligations and the cost of downtime in financial services demand much higher incident response discipline than other industries.
- 12:20 — Scaling Infrastructure for Security: The transition from Heroku to GCP and Kubernetes to meet the security and compliance needs of enterprise customers.
- 19:40 — Automating Investigation with AI: An introduction to AISRE and how the system uses a service catalog to guide automated incident investigations.
- 23:30 — Building RAG with Knowledge Graphs: How to use GitHub integrations, webhooks, and universal adapters to feed relevant context into a RAG-based runbook generator.
- 31:10 — Verifying AI Assumptions: The importance of using background processes to double-check PR changes and code diffs before presenting findings to responders.
- 35:00 — The Future of LLMs in DevOps: Why the focus is shifting from model intelligence to the engineering of structured, modular systems and objective metrics.

## Actions

- request_transcript: `POST https://stenobird.com/v1/public/podcasts/adventures-in-devops/episodes/solving-incidents-with-one-time-ephemeral-runbooks/transcription-requests` — Idempotently request low-priority transcript generation for this episode.
- read_markdown: `GET https://stenobird.com/podcast/adventures-in-devops/solving-incidents-with-one-time-ephemeral-runbooks.md` — Read the agent-friendly Markdown representation of this episode resource.

A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed.

## Transcript

Full transcripts are not published on public pages unless there is a clear rights basis.