{"podcast":{"title":"Screaming in the Cloud","slug":"screaming-in-the-cloud","podcast_index_feed_id":512714,"rss_url":"https://feeds.transistor.fm/screaming-in-the-cloud","website_url":"https://screaminginthecloud.com","image_url":"https://img.transistorcdn.com/sjY7QBiTinCDr8X80gOsgDaM4fMY0WuZn87UxNTh6Fw/rs:fill:0:0:1/w:1400/h:1400/q:60/mb:500000/aHR0cHM6Ly9pbWct/dXBsb2FkLXByb2R1/Y3Rpb24udHJhbnNp/c3Rvci5mbS9zaG93/LzE0OTQvMTU4Mzg2/OTQ4My1hcnR3b3Jr/LmpwZw.jpg","author":"Corey Quinn","episode_count":673,"summary":"Screaming in the Cloud with Corey Quinn features conversations with domain experts in the world of Cloud Computing. Topics discussed include AWS, GCP, Azure, Oracle Cloud, and the \"why\" behind how businesses are coming to think about the Cloud.","last_synced_at":null,"page_url":"https://stenobird.com/podcast/screaming-in-the-cloud"},"episode":{"title":"Building Systems That Work Even When Everything Breaks with Ben Hartshorne","slug":"building-systems-that-work-even-when-everything-breaks-with-ben-hartshorne","published_at":"2026-01-15T11:00:00+00:00","page_url":"https://stenobird.com/podcast/screaming-in-the-cloud/building-systems-that-work-even-when-everything-breaks-with-ben-hartshorne","show_page_url":"https://stenobird.com/podcast/screaming-in-the-cloud","url":"https://share.transistor.fm/s/5e1542c7","audio_url":"https://dts.podtrac.com/redirect.mp3/media.transistor.fm/5e1542c7/1e3981f5.mp3","summary":"Resilience in the cloud requires designing for failure rather than attempting to prevent it. Ben Hartshorne explains how to use observability to survive AWS outages and slash Lambda costs.","meta_description":"Learn how to build resilient systems that survive AWS outages and how Honeycomb cut Lambda costs by 50% using granular instrumentation and spreadsheets.","key_points":["Main idea: True system resilience comes from local caching and fallback defaults that function even when upstream dependencies are unreachable","Practical takeaway: Use granular instrumentation to track specific cost drivers, like S3 access patterns, to drive significant cloud savings","Failure mode: Centralizing infrastructure in a single region creates a massive blast radius that can take down a disproportionate amount of the global internet","Practical takeaway: Implement rate limiting and circuit breakers to prevent recovering services from being crushed by a 'thundering herd' of retries","Main idea: High-velocity deployment pipelines are critical for incident response; if a fix takes days to reach production, you cannot effectively resolve bugs"],"chapters":[{"start_ms":60000,"title":"Designing for Dependency Failure","summary":"How SDKs like LaunchDarkly use local caching and code-based defaults to remain functional during upstream outages."},{"start_ms":235000,"title":"The Power of Spreadsheets in FinOps","summary":"Why exporting data to CSV and using tools like Pandas is often more effective for cost optimization than complex dashboards."},{"start_ms":395000,"title":"Balancing Feature Velocity and Cost","summary":"Navigating the continuum between investing in new product features and managing cloud infrastructure spend."},{"start_ms":555000,"title":"Observability During AWS Outages","summary":"The difficulty of determining if a system failure is internal or caused by a major cloud provider outage."},{"start_ms":715000,"title":"The Impact of Telemetry Disruptions","summary":"How outages can break the very tools (like OpenTelemetry collectors) needed to monitor the incident."},{"start_ms":880000,"title":"The Risks of Multi-Region Strategies","summary":"Evaluating the trade-offs between the high cost of multi-region redundancy and the risks of regional dependency."},{"start_ms":1045000,"title":"The Complexity of Third-Party Dependencies","summary":"Why testing your own durability isn't enough when your critical path relies on a massive web of external vendors."},{"start_ms":1210000,"title":"Preventing the Thundering Herd","summary":"Lessons from systems engineering on how recovering services can be immediately overwhelmed by queued requests."}],"topics":["AWS Outages","Cloud Cost Optimization","Observability","System Resilience","AWS Lambda","FinOps","Infrastructure Engineering","Incident Response"],"duration_seconds":2182,"processing_state":"processed","actions":[{"name":"request_transcript","method":"POST","url":"https://stenobird.com/v1/public/podcasts/screaming-in-the-cloud/episodes/building-systems-that-work-even-when-everything-breaks-with-ben-hartshorne/transcription-requests","description":"Idempotently request low-priority transcript generation for this episode."},{"name":"read_markdown","method":"GET","url":"https://stenobird.com/podcast/screaming-in-the-cloud/building-systems-that-work-even-when-everything-breaks-with-ben-hartshorne.md","description":"Read the agent-friendly Markdown representation of this episode resource."}]}}