Episode

Building Systems That Work Even When Everything Breaks with Ben Hartshorne

Podcast
Screaming in the Cloud
Published
Jan 15, 2026
Duration seconds
2182
Processing state
processed
Canonical source
https://share.transistor.fm/s/5e1542c7
Audio
https://dts.podtrac.com/redirect.mp3/media.transistor.fm/5e1542c7/1e3981f5.mp3
JSON
/v1/public/podcasts/screaming-in-the-cloud/episodes/building-systems-that-work-even-when-everything-breaks-with-ben-hartshorne
Markdown
/podcast/screaming-in-the-cloud/building-systems-that-work-even-when-everything-breaks-with-ben-hartshorne.md

Actions

  • POST https://stenobird.com/v1/public/podcasts/screaming-in-the-cloud/episodes/building-systems-that-work-even-when-everything-breaks-with-ben-hartshorne/transcription-requests
    Idempotently request low-priority transcript generation for this episode.
  • GET https://stenobird.com/podcast/screaming-in-the-cloud/building-systems-that-work-even-when-everything-breaks-with-ben-hartshorne.md
    Read the agent-friendly Markdown representation of this episode resource.

Summary

Resilience in the cloud requires designing for failure rather than attempting to prevent it. Ben Hartshorne explains how to use observability to survive AWS outages and slash Lambda costs.

Topics

  • AWS Outages
  • Cloud Cost Optimization
  • Observability
  • System Resilience
  • AWS Lambda
  • FinOps
  • Infrastructure Engineering
  • Incident Response

Highlights

  • Main idea: True system resilience comes from local caching and fallback defaults that function even when upstream dependencies are unreachable
  • Practical takeaway: Use granular instrumentation to track specific cost drivers, like S3 access patterns, to drive significant cloud savings
  • Failure mode: Centralizing infrastructure in a single region creates a massive blast radius that can take down a disproportionate amount of the global internet
  • Practical takeaway: Implement rate limiting and circuit breakers to prevent recovering services from being crushed by a 'thundering herd' of retries
  • Main idea: High-velocity deployment pipelines are critical for incident response; if a fix takes days to reach production, you cannot effectively resolve bugs

Chapters

  1. 1:00 Designing for Dependency Failure: How SDKs like LaunchDarkly use local caching and code-based defaults to remain functional during upstream outages.
  2. 3:55 The Power of Spreadsheets in FinOps: Why exporting data to CSV and using tools like Pandas is often more effective for cost optimization than complex dashboards.
  3. 6:35 Balancing Feature Velocity and Cost: Navigating the continuum between investing in new product features and managing cloud infrastructure spend.
  4. 9:15 Observability During AWS Outages: The difficulty of determining if a system failure is internal or caused by a major cloud provider outage.
  5. 11:55 The Impact of Telemetry Disruptions: How outages can break the very tools (like OpenTelemetry collectors) needed to monitor the incident.
  6. 14:40 The Risks of Multi-Region Strategies: Evaluating the trade-offs between the high cost of multi-region redundancy and the risks of regional dependency.
  7. 17:25 The Complexity of Third-Party Dependencies: Why testing your own durability isn't enough when your critical path relies on a massive web of external vendors.
  8. 20:10 Preventing the Thundering Herd: Lessons from systems engineering on how recovering services can be immediately overwhelmed by queued requests.