Episode

The Hidden Challenges of Running AI at Scale in Production

Podcast
The Data Exchange with Ben Lorica
Published
Mar 12, 2026
Duration seconds
1941
Processing state
processed
Canonical source
https://dts.podtrac.com/redirect.mp3/www.buzzsprout.com/682433/episodes/18789806-the-hidden-challenges-of-running-ai-at-scale-in-production.mp3
Audio
https://dts.podtrac.com/redirect.mp3/www.buzzsprout.com/682433/episodes/18789806-the-hidden-challenges-of-running-ai-at-scale-in-production.mp3
JSON
/v1/public/podcasts/the-data-exchange-with-ben-lorica/episodes/the-hidden-challenges-of-running-ai-at-scale-in-production
Markdown
/podcast/the-data-exchange-with-ben-lorica/the-hidden-challenges-of-running-ai-at-scale-in-production.md

Actions

  • POST https://stenobird.com/v1/public/podcasts/the-data-exchange-with-ben-lorica/episodes/the-hidden-challenges-of-running-ai-at-scale-in-production/transcription-requests
    Idempotently request low-priority transcript generation for this episode.
  • GET https://stenobird.com/podcast/the-data-exchange-with-ben-lorica/the-hidden-challenges-of-running-ai-at-scale-in-production.md
    Read the agent-friendly Markdown representation of this episode resource.

Summary

Moving AI from pilot to production requires a fundamental shift from experimentation to managing complex, multi-node infrastructure. Chen Goldberg explains how optimizing 'goodput' and observability is critical for scaling AI workloads effectively.

Topics

  • AI Infrastructure
  • GPU Computing
  • Cloud Engineering
  • Machine Learning Operations
  • Distributed Systems
  • Kubernetes
  • Data Observability
  • CoreWeave

Highlights

  • Main idea: Scaling AI requires moving beyond single-node thinking to managing complex multi-node orchestration and networking
  • Practical takeaway: Focus on 'goodput'—the actual time GPUs spend performing useful work—by optimizing data throughput and caching
  • Failure mode: Relying on bad benchmarks or high-level abstractions without visibility into the underlying hardware bottlenecks
  • Main idea: The transition to AI-first clouds is driven by the need for specialized hardware orchestration that general-purpose clouds lack
  • Practical takeaway: Use AI-driven observability to unify telemetry across storage, network, and workloads to accelerate troubleshooting

Chapters

  1. 1:00 The Reality of AI Production: Debunking the myth that AI is stuck in the pilot phase and discussing the shift toward real-world production use cases.
  2. 3:30 Choosing an AI-First Cloud: When enterprises should move away from established general-purpose cloud providers toward specialized AI infrastructure.
  3. 8:20 Optimizing GPU Goodput: How to maximize compute efficiency by addressing bottlenecks in data volume, throughput, and caching mechanisms.
  4. 10:40 The Complexity of Multi-Node Systems: The engineering challenges introduced by moving from single-node tasks to highly available, distributed AI orchestration.
  5. 15:20 Unified Observability and Mission Control: Using integrated telemetry to gain transparency into the entire stack, from storage to workload performance.
  6. 27:30 Navigating Technical Debt and Career Growth: Advice for engineers on leveraging new AI tools to augment expertise rather than replacing the need for deep domain knowledge.