Episode

Solving incidents with one-time ephemeral runbooks

Podcast
Adventures in DevOps
Published
Oct 20, 2025
Duration seconds
2999
Processing state
processed
Canonical source
https://adventuresindevops.com/episodes/2025/10/20/solving-incidents-with-one-time-ephemeral-runbooks
Audio
https://dts.podtrac.com/redirect.mp3/api.spreaker.com/download/episode/68206117/download.mp3
JSON
/v1/public/podcasts/adventures-in-devops/episodes/solving-incidents-with-one-time-ephemeral-runbooks
Markdown
/podcast/adventures-in-devops/solving-incidents-with-one-time-ephemeral-runbooks.md

Actions

  • POST https://stenobird.com/v1/public/podcasts/adventures-in-devops/episodes/solving-incidents-with-one-time-ephemeral-runbooks/transcription-requests
    Idempotently request low-priority transcript generation for this episode.
  • GET https://stenobird.com/podcast/adventures-in-devops/solving-incidents-with-one-time-ephemeral-runbooks.md
    Read the agent-friendly Markdown representation of this episode resource.

Summary

Incident.io's Lawrence Jones explains how to move beyond static documentation by using AI-driven RAG to generate ephemeral, context-aware runbooks during outages. The discussion covers the transition from Heroku to GCP for high-compliance environments and the technical hurdles of automating incident investigation.

Topics

  • Incident Response
  • Retrieval-Augmented Generation
  • Site Reliability Engineering
  • DevOps Automation
  • Cloud Infrastructure
  • Service Catalogs
  • FinTech Compliance
  • AI Agents

Highlights

  • Main idea: Effective incident response in regulated industries like FinTech requires high-rigor processes and automated, verifiable documentation
  • Technical strategy: Use a knowledge graph—combining service catalogs, CRM data, and GitHub webhooks—to power RAG instead of relying solely on unstructured vector embeddings
  • Practical takeaway: Ephemeral runbooks should be dynamically generated by analyzing recent PR diffs, Slack discussions, and telemetry to surface relevant dashboards instantly
  • Failure mode: Avoid 'low-confidence' AI hallucinations in incident channels; implement background verification by cloning codebases to validate assumptions before alerting engineers
  • Lesson learned: The bottleneck in modern DevOps is no longer the LLM's reasoning capability, but the engineering effort required to build structured, modular data pipelines

Chapters

  1. 1:00 The High Stakes of FinTech Incidents: Why regulatory obligations and the cost of downtime in financial services demand much higher incident response discipline than other industries.
  2. 12:20 Scaling Infrastructure for Security: The transition from Heroku to GCP and Kubernetes to meet the security and compliance needs of enterprise customers.
  3. 19:40 Automating Investigation with AI: An introduction to AISRE and how the system uses a service catalog to guide automated incident investigations.
  4. 23:30 Building RAG with Knowledge Graphs: How to use GitHub integrations, webhooks, and universal adapters to feed relevant context into a RAG-based runbook generator.
  5. 31:10 Verifying AI Assumptions: The importance of using background processes to double-check PR changes and code diffs before presenting findings to responders.
  6. 35:00 The Future of LLMs in DevOps: Why the focus is shifting from model intelligence to the engineering of structured, modular systems and objective metrics.