# Building Systems That Work Even When Everything Breaks with Ben Hartshorne Page: https://stenobird.com/podcast/screaming-in-the-cloud/building-systems-that-work-even-when-everything-breaks-with-ben-hartshorne Text version: https://stenobird.com/podcast/screaming-in-the-cloud/building-systems-that-work-even-when-everything-breaks-with-ben-hartshorne.md Podcast: [Screaming in the Cloud](https://stenobird.com/podcast/screaming-in-the-cloud) Published: 2026-01-15T11:00:00+00:00 Episode link: https://share.transistor.fm/s/5e1542c7 Audio file: https://dts.podtrac.com/redirect.mp3/media.transistor.fm/5e1542c7/1e3981f5.mp3 Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/screaming-in-the-cloud/episodes/building-systems-that-work-even-when-everything-breaks-with-ben-hartshorne Duration seconds: 2182 ## Resource Resilience in the cloud requires designing for failure rather than attempting to prevent it. Ben Hartshorne explains how to use observability to survive AWS outages and slash Lambda costs. ## Highlights - Main idea: True system resilience comes from local caching and fallback defaults that function even when upstream dependencies are unreachable - Practical takeaway: Use granular instrumentation to track specific cost drivers, like S3 access patterns, to drive significant cloud savings - Failure mode: Centralizing infrastructure in a single region creates a massive blast radius that can take down a disproportionate amount of the global internet - Practical takeaway: Implement rate limiting and circuit breakers to prevent recovering services from being crushed by a 'thundering herd' of retries - Main idea: High-velocity deployment pipelines are critical for incident response; if a fix takes days to reach production, you cannot effectively resolve bugs ## Topics AWS Outages, Cloud Cost Optimization, Observability, System Resilience, AWS Lambda, FinOps, Infrastructure Engineering, Incident Response ## Chapters - 1:00 — Designing for Dependency Failure: How SDKs like LaunchDarkly use local caching and code-based defaults to remain functional during upstream outages. - 3:55 — The Power of Spreadsheets in FinOps: Why exporting data to CSV and using tools like Pandas is often more effective for cost optimization than complex dashboards. - 6:35 — Balancing Feature Velocity and Cost: Navigating the continuum between investing in new product features and managing cloud infrastructure spend. - 9:15 — Observability During AWS Outages: The difficulty of determining if a system failure is internal or caused by a major cloud provider outage. - 11:55 — The Impact of Telemetry Disruptions: How outages can break the very tools (like OpenTelemetry collectors) needed to monitor the incident. - 14:40 — The Risks of Multi-Region Strategies: Evaluating the trade-offs between the high cost of multi-region redundancy and the risks of regional dependency. - 17:25 — The Complexity of Third-Party Dependencies: Why testing your own durability isn't enough when your critical path relies on a massive web of external vendors. - 20:10 — Preventing the Thundering Herd: Lessons from systems engineering on how recovering services can be immediately overwhelmed by queued requests. ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/screaming-in-the-cloud/episodes/building-systems-that-work-even-when-everything-breaks-with-ben-hartshorne/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/screaming-in-the-cloud/building-systems-that-work-even-when-everything-breaks-with-ben-hartshorne.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.