# Building Systems That Work Even When Everything Breaks with Ben Hartshorne

Page: https://stenobird.com/podcast/screaming-in-the-cloud/building-systems-that-work-even-when-everything-breaks-with-ben-hartshorne
Text version: https://stenobird.com/podcast/screaming-in-the-cloud/building-systems-that-work-even-when-everything-breaks-with-ben-hartshorne.md
Podcast: [Screaming in the Cloud](https://stenobird.com/podcast/screaming-in-the-cloud)
Published: 2026-01-15T11:00:00+00:00
Episode link: https://share.transistor.fm/s/5e1542c7
Audio file: https://dts.podtrac.com/redirect.mp3/media.transistor.fm/5e1542c7/1e3981f5.mp3
Processing state: processed
JSON: https://stenobird.com/v1/public/podcasts/screaming-in-the-cloud/episodes/building-systems-that-work-even-when-everything-breaks-with-ben-hartshorne
Duration seconds: 2182

## Resource

Resilience in the cloud requires designing for failure rather than attempting to prevent it. Ben Hartshorne explains how to use observability to survive AWS outages and slash Lambda costs.

## Highlights
- Main idea: True system resilience comes from local caching and fallback defaults that function even when upstream dependencies are unreachable
- Practical takeaway: Use granular instrumentation to track specific cost drivers, like S3 access patterns, to drive significant cloud savings
- Failure mode: Centralizing infrastructure in a single region creates a massive blast radius that can take down a disproportionate amount of the global internet
- Practical takeaway: Implement rate limiting and circuit breakers to prevent recovering services from being crushed by a 'thundering herd' of retries
- Main idea: High-velocity deployment pipelines are critical for incident response; if a fix takes days to reach production, you cannot effectively resolve bugs

## Topics

AWS Outages, Cloud Cost Optimization, Observability, System Resilience, AWS Lambda, FinOps, Infrastructure Engineering, Incident Response

## Chapters
- 1:00 — Designing for Dependency Failure: How SDKs like LaunchDarkly use local caching and code-based defaults to remain functional during upstream outages.
- 3:55 — The Power of Spreadsheets in FinOps: Why exporting data to CSV and using tools like Pandas is often more effective for cost optimization than complex dashboards.
- 6:35 — Balancing Feature Velocity and Cost: Navigating the continuum between investing in new product features and managing cloud infrastructure spend.
- 9:15 — Observability During AWS Outages: The difficulty of determining if a system failure is internal or caused by a major cloud provider outage.
- 11:55 — The Impact of Telemetry Disruptions: How outages can break the very tools (like OpenTelemetry collectors) needed to monitor the incident.
- 14:40 — The Risks of Multi-Region Strategies: Evaluating the trade-offs between the high cost of multi-region redundancy and the risks of regional dependency.
- 17:25 — The Complexity of Third-Party Dependencies: Why testing your own durability isn't enough when your critical path relies on a massive web of external vendors.
- 20:10 — Preventing the Thundering Herd: Lessons from systems engineering on how recovering services can be immediately overwhelmed by queued requests.

## Actions

- request_transcript: `POST https://stenobird.com/v1/public/podcasts/screaming-in-the-cloud/episodes/building-systems-that-work-even-when-everything-breaks-with-ben-hartshorne/transcription-requests` — Idempotently request low-priority transcript generation for this episode.
- read_markdown: `GET https://stenobird.com/podcast/screaming-in-the-cloud/building-systems-that-work-even-when-everything-breaks-with-ben-hartshorne.md` — Read the agent-friendly Markdown representation of this episode resource.

A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed.

## Transcript

Full transcripts are not published on public pages unless there is a clear rights basis.