Episode

Fixing GPU Starvation in Large-Scale Distributed Training

Podcast
MLOps.community
Published
Apr 3, 2026
Duration seconds
3168
Processing state
processed
Canonical source
https://podcasters.spotify.com/pod/show/mlops/episodes/Fixing-GPU-Starvation-in-Large-Scale-Distributed-Training-e3hcn48
Audio
https://anchor.fm/s/174cb1b8/podcast/play/117905992/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2026-3-3%2F421342927-44100-2-a2a86203b9d3d.mp3
JSON
/v1/public/podcasts/mlops-community/episodes/fixing-gpu-starvation-in-large-scale-distributed-training
Markdown
/podcast/mlops-community/fixing-gpu-starvation-in-large-scale-distributed-training.md

Actions

  • POST https://stenobird.com/v1/public/podcasts/mlops-community/episodes/fixing-gpu-starvation-in-large-scale-distributed-training/transcription-requests
    Idempotently request low-priority transcript generation for this episode.
  • GET https://stenobird.com/podcast/mlops-community/fixing-gpu-starvation-in-large-scale-distributed-training.md
    Read the agent-friendly Markdown representation of this episode resource.

Summary

Scaling machine learning models is rarely limited by model architecture, but rather by the underlying infrastructure and data I/O bottlenecks. Kashish Mittal details how optimizing data loading and caching layers can drastically increase GPU utilization and reduce training time.

Topics

  • MLOps
  • Distributed Training
  • GPU Utilization
  • Data Engineering
  • Machine Learning Infrastructure
  • Petastorm
  • System Profiling
  • Large-scale Systems

Highlights

  • Main idea: Infrastructure, specifically data loading and I/O, is the primary hidden constraint when scaling ML workloads
  • Practical takeaway: Redesigning caching layers to bypass CPU transformation bottlenecks can boost GPU utilization from low levels to over 60%
  • Failure mode: Increasing parallelism too aggressively can introduce non-determinism and label skewness, degrading model quality
  • Trade-off: Balancing latency and throughput in serving requires deciding between larger batch sizes and the serialization costs of fetching features
  • Lesson: Efficient distributed training requires a full-stack profiling approach to identify whether the bottleneck lies in the producer, consumer, or network

Chapters

  1. 1:00 The Economics of AI Infrastructure: A discussion on the rising costs of compute and the decision-making process regarding human engineers versus AI agents.
  2. 4:55 The Hidden Constraint: Data I/O: Why scaling models is often an infrastructure problem rather than an architectural one, focusing on data bottlenecks.
  3. 8:40 Hardware-Specific Optimization: Exploring the differences between TPU and GPU environments and how hardware affects optimization strategies.
  4. 12:30 Profiling the Data Pipeline: A deep dive into the producer-consumer architecture of Petastorm and identifying where data gets clogged.
  5. 16:40 Implementing Local Caching: Strategies for using local disk/SSD to cache data and reduce expensive remote file system calls.
  6. 24:30 The Cost of Batching: Analyzing the trade-offs between CPU and GPU execution and the overhead of increasing batch sizes.
  7. 28:35 Risks of High Parallelism: How extreme parallelism can lead to data order non-determinism and impact model training stability.