{"podcast":{"title":"The Data Exchange with Ben Lorica","slug":"the-data-exchange-with-ben-lorica","podcast_index_feed_id":1196000,"rss_url":"https://rss.buzzsprout.com/682433.rss","website_url":"https://thedataexchange.media/","image_url":"https://storage.buzzsprout.com/ljk0yj7r22pi61grsmelnsoa9084?.jpg","author":"Ben Lorica","episode_count":345,"summary":"A series of informal conversations with thought leaders, researchers, practitioners, and writers on a wide range of topics in technology, science, and of course big data, data science, artificial intelligence, and related applications. Anchored by Ben Lorica (@BigData), the Data Exchange also features a roundup of the most important stories from the worlds of data, machine learning and AI. Detailed show notes for each episode can be found on https://thedataexchange.media/ The Data Exchange podcast is a production of Gradient Flow [https://gradientflow.com/].","last_synced_at":null,"page_url":"https://stenobird.com/podcast/the-data-exchange-with-ben-lorica"},"episode":{"title":"The Hidden Challenges of Running AI at Scale in Production","slug":"the-hidden-challenges-of-running-ai-at-scale-in-production","published_at":"2026-03-12T11:00:00+00:00","page_url":"https://stenobird.com/podcast/the-data-exchange-with-ben-lorica/the-hidden-challenges-of-running-ai-at-scale-in-production","show_page_url":"https://stenobird.com/podcast/the-data-exchange-with-ben-lorica","url":"https://dts.podtrac.com/redirect.mp3/www.buzzsprout.com/682433/episodes/18789806-the-hidden-challenges-of-running-ai-at-scale-in-production.mp3","audio_url":"https://dts.podtrac.com/redirect.mp3/www.buzzsprout.com/682433/episodes/18789806-the-hidden-challenges-of-running-ai-at-scale-in-production.mp3","summary":"Moving AI from pilot to production requires a fundamental shift from experimentation to managing complex, multi-node infrastructure. Chen Goldberg explains how optimizing 'goodput' and observability is critical for scaling AI workloads effectively.","meta_description":"Learn how to scale AI from pilot to production with Chen Goldberg (CoreWeave) on infrastructure complexity, GPU efficiency, and the future of engineering.","key_points":["Main idea: Scaling AI requires moving beyond single-node thinking to managing complex multi-node orchestration and networking","Practical takeaway: Focus on 'goodput'—the actual time GPUs spend performing useful work—by optimizing data throughput and caching","Failure mode: Relying on bad benchmarks or high-level abstractions without visibility into the underlying hardware bottlenecks","Main idea: The transition to AI-first clouds is driven by the need for specialized hardware orchestration that general-purpose clouds lack","Practical takeaway: Use AI-driven observability to unify telemetry across storage, network, and workloads to accelerate troubleshooting"],"chapters":[{"start_ms":60000,"title":"The Reality of AI Production","summary":"Debunking the myth that AI is stuck in the pilot phase and discussing the shift toward real-world production use cases."},{"start_ms":210000,"title":"Choosing an AI-First Cloud","summary":"When enterprises should move away from established general-purpose cloud providers toward specialized AI infrastructure."},{"start_ms":500000,"title":"Optimizing GPU Goodput","summary":"How to maximize compute efficiency by addressing bottlenecks in data volume, throughput, and caching mechanisms."},{"start_ms":640000,"title":"The Complexity of Multi-Node Systems","summary":"The engineering challenges introduced by moving from single-node tasks to highly available, distributed AI orchestration."},{"start_ms":920000,"title":"Unified Observability and Mission Control","summary":"Using integrated telemetry to gain transparency into the entire stack, from storage to workload performance."},{"start_ms":1650000,"title":"Navigating Technical Debt and Career Growth","summary":"Advice for engineers on leveraging new AI tools to augment expertise rather than replacing the need for deep domain knowledge."}],"topics":["AI Infrastructure","GPU Computing","Cloud Engineering","Machine Learning Operations","Distributed Systems","Kubernetes","Data Observability","CoreWeave"],"duration_seconds":1941,"processing_state":"processed","actions":[{"name":"request_transcript","method":"POST","url":"https://stenobird.com/v1/public/podcasts/the-data-exchange-with-ben-lorica/episodes/the-hidden-challenges-of-running-ai-at-scale-in-production/transcription-requests","description":"Idempotently request low-priority transcript generation for this episode."},{"name":"read_markdown","method":"GET","url":"https://stenobird.com/podcast/the-data-exchange-with-ben-lorica/the-hidden-challenges-of-running-ai-at-scale-in-production.md","description":"Read the agent-friendly Markdown representation of this episode resource."}]}}