{"podcast":{"title":"MLOps.community","slug":"mlops-community","podcast_index_feed_id":28679,"rss_url":"https://anchor.fm/s/174cb1b8/podcast/rss","website_url":"https://mlops.community","image_url":"https://d3t3ozftmdmh3i.cloudfront.net/production/podcast_uploaded_nologo/3809022/3809022-1612190855115-e91f8b881173f.jpg","author":"Demetrios","episode_count":516,"summary":"Relaxed Conversations around getting AI into production, whatever shape that may come in (agentic, traditional ML, LLMs, Vibes, etc)","last_synced_at":null,"page_url":"https://stenobird.com/podcast/mlops-community"},"episode":{"title":"Fixing GPU Starvation in Large-Scale Distributed Training","slug":"fixing-gpu-starvation-in-large-scale-distributed-training","published_at":"2026-04-03T17:00:28+00:00","page_url":"https://stenobird.com/podcast/mlops-community/fixing-gpu-starvation-in-large-scale-distributed-training","show_page_url":"https://stenobird.com/podcast/mlops-community","url":"https://podcasters.spotify.com/pod/show/mlops/episodes/Fixing-GPU-Starvation-in-Large-Scale-Distributed-Training-e3hcn48","audio_url":"https://anchor.fm/s/174cb1b8/podcast/play/117905992/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2026-3-3%2F421342927-44100-2-a2a86203b9d3d.mp3","summary":"Scaling machine learning models is rarely limited by model architecture, but rather by the underlying infrastructure and data I/O bottlenecks. Kashish Mittal details how optimizing data loading and caching layers can drastically increase GPU utilization and reduce training time.","meta_description":"Learn how to solve GPU starvation in large-scale distributed training by optimizing data pipelines, caching layers, and overcoming I/O bottlenecks.","key_points":["Main idea: Infrastructure, specifically data loading and I/O, is the primary hidden constraint when scaling ML workloads","Practical takeaway: Redesigning caching layers to bypass CPU transformation bottlenecks can boost GPU utilization from low levels to over 60%","Failure mode: Increasing parallelism too aggressively can introduce non-determinism and label skewness, degrading model quality","Trade-off: Balancing latency and throughput in serving requires deciding between larger batch sizes and the serialization costs of fetching features","Lesson: Efficient distributed training requires a full-stack profiling approach to identify whether the bottleneck lies in the producer, consumer, or network"],"chapters":[{"start_ms":60000,"title":"The Economics of AI Infrastructure","summary":"A discussion on the rising costs of compute and the decision-making process regarding human engineers versus AI agents."},{"start_ms":295000,"title":"The Hidden Constraint: Data I/O","summary":"Why scaling models is often an infrastructure problem rather than an architectural one, focusing on data bottlenecks."},{"start_ms":520000,"title":"Hardware-Specific Optimization","summary":"Exploring the differences between TPU and GPU environments and how hardware affects optimization strategies."},{"start_ms":750000,"title":"Profiling the Data Pipeline","summary":"A deep dive into the producer-consumer architecture of Petastorm and identifying where data gets clogged."},{"start_ms":1000000,"title":"Implementing Local Caching","summary":"Strategies for using local disk/SSD to cache data and reduce expensive remote file system calls."},{"start_ms":1470000,"title":"The Cost of Batching","summary":"Analyzing the trade-offs between CPU and GPU execution and the overhead of increasing batch sizes."},{"start_ms":1715000,"title":"Risks of High Parallelism","summary":"How extreme parallelism can lead to data order non-determinism and impact model training stability."}],"topics":["MLOps","Distributed Training","GPU Utilization","Data Engineering","Machine Learning Infrastructure","Petastorm","System Profiling","Large-scale Systems"],"duration_seconds":3168,"processing_state":"processed","actions":[{"name":"request_transcript","method":"POST","url":"https://stenobird.com/v1/public/podcasts/mlops-community/episodes/fixing-gpu-starvation-in-large-scale-distributed-training/transcription-requests","description":"Idempotently request low-priority transcript generation for this episode."},{"name":"read_markdown","method":"GET","url":"https://stenobird.com/podcast/mlops-community/fixing-gpu-starvation-in-large-scale-distributed-training.md","description":"Read the agent-friendly Markdown representation of this episode resource."}]}}