Episode

Building Systems That Work Even When Everything Breaks with Ben Hartshorne

Podcast: Screaming in the Cloud
Published: Jan 15, 2026
Duration seconds: 2182
Processing state: processed
Canonical source: https://share.transistor.fm/s/5e1542c7
Audio: https://dts.podtrac.com/redirect.mp3/media.transistor.fm/5e1542c7/1e3981f5.mp3
JSON: /v1/public/podcasts/screaming-in-the-cloud/episodes/building-systems-that-work-even-when-everything-breaks-with-ben-hartshorne
Markdown: /podcast/screaming-in-the-cloud/building-systems-that-work-even-when-everything-breaks-with-ben-hartshorne.md

Actions

POST https://stenobird.com/v1/public/podcasts/screaming-in-the-cloud/episodes/building-systems-that-work-even-when-everything-breaks-with-ben-hartshorne/transcription-requests
Idempotently request low-priority transcript generation for this episode.
GET https://stenobird.com/podcast/screaming-in-the-cloud/building-systems-that-work-even-when-everything-breaks-with-ben-hartshorne.md
Read the agent-friendly Markdown representation of this episode resource.

Summary

Resilience in the cloud requires designing for failure rather than attempting to prevent it. Ben Hartshorne explains how to use observability to survive AWS outages and slash Lambda costs.

Topics

AWS Outages
Cloud Cost Optimization
Observability
System Resilience
AWS Lambda
FinOps
Infrastructure Engineering
Incident Response

Highlights

Main idea: True system resilience comes from local caching and fallback defaults that function even when upstream dependencies are unreachable
Practical takeaway: Use granular instrumentation to track specific cost drivers, like S3 access patterns, to drive significant cloud savings
Failure mode: Centralizing infrastructure in a single region creates a massive blast radius that can take down a disproportionate amount of the global internet
Practical takeaway: Implement rate limiting and circuit breakers to prevent recovering services from being crushed by a 'thundering herd' of retries
Main idea: High-velocity deployment pipelines are critical for incident response; if a fix takes days to reach production, you cannot effectively resolve bugs

Chapters

1:00 Designing for Dependency Failure: How SDKs like LaunchDarkly use local caching and code-based defaults to remain functional during upstream outages.
3:55 The Power of Spreadsheets in FinOps: Why exporting data to CSV and using tools like Pandas is often more effective for cost optimization than complex dashboards.
6:35 Balancing Feature Velocity and Cost: Navigating the continuum between investing in new product features and managing cloud infrastructure spend.
9:15 Observability During AWS Outages: The difficulty of determining if a system failure is internal or caused by a major cloud provider outage.
11:55 The Impact of Telemetry Disruptions: How outages can break the very tools (like OpenTelemetry collectors) needed to monitor the incident.
14:40 The Risks of Multi-Region Strategies: Evaluating the trade-offs between the high cost of multi-region redundancy and the risks of regional dependency.
17:25 The Complexity of Third-Party Dependencies: Why testing your own durability isn't enough when your critical path relies on a massive web of external vendors.
20:10 Preventing the Thundering Herd: Lessons from systems engineering on how recovering services can be immediately overwhelmed by queued requests.