Episode
Fixing GPU Starvation in Large-Scale Distributed Training
- Podcast
- MLOps.community
- Published
- Apr 3, 2026
- Duration seconds
- 3168
- Processing state
processed
Actions
POST https://stenobird.com/v1/public/podcasts/mlops-community/episodes/fixing-gpu-starvation-in-large-scale-distributed-training/transcription-requests
Idempotently request low-priority transcript generation for this episode.GET https://stenobird.com/podcast/mlops-community/fixing-gpu-starvation-in-large-scale-distributed-training.md
Read the agent-friendly Markdown representation of this episode resource.
Summary
Scaling machine learning models is rarely limited by model architecture, but rather by the underlying infrastructure and data I/O bottlenecks. Kashish Mittal details how optimizing data loading and caching layers can drastically increase GPU utilization and reduce training time.
Topics
- MLOps
- Distributed Training
- GPU Utilization
- Data Engineering
- Machine Learning Infrastructure
- Petastorm
- System Profiling
- Large-scale Systems
Highlights
- Main idea: Infrastructure, specifically data loading and I/O, is the primary hidden constraint when scaling ML workloads
- Practical takeaway: Redesigning caching layers to bypass CPU transformation bottlenecks can boost GPU utilization from low levels to over 60%
- Failure mode: Increasing parallelism too aggressively can introduce non-determinism and label skewness, degrading model quality
- Trade-off: Balancing latency and throughput in serving requires deciding between larger batch sizes and the serialization costs of fetching features
- Lesson: Efficient distributed training requires a full-stack profiling approach to identify whether the bottleneck lies in the producer, consumer, or network
Chapters
1:00The Economics of AI Infrastructure: A discussion on the rising costs of compute and the decision-making process regarding human engineers versus AI agents.4:55The Hidden Constraint: Data I/O: Why scaling models is often an infrastructure problem rather than an architectural one, focusing on data bottlenecks.8:40Hardware-Specific Optimization: Exploring the differences between TPU and GPU environments and how hardware affects optimization strategies.12:30Profiling the Data Pipeline: A deep dive into the producer-consumer architecture of Petastorm and identifying where data gets clogged.16:40Implementing Local Caching: Strategies for using local disk/SSD to cache data and reduce expensive remote file system calls.24:30The Cost of Batching: Analyzing the trade-offs between CPU and GPU execution and the overhead of increasing batch sizes.28:35Risks of High Parallelism: How extreme parallelism can lead to data order non-determinism and impact model training stability.