{"podcast":{"title":"Agentic AI Podcast","slug":"agentic-ai-podcast","podcast_index_feed_id":7288877,"rss_url":"https://feeds.transistor.fm/agentic-ai-podcast","website_url":"http://www.lowtouch.ai","image_url":"https://img.transistorcdn.com/aeWdXvkVLrVCLe32rK52NOQ_RaVF70zMoXZLjLC2UwI/rs:fill:0:0:1/w:1400/h:1400/q:60/mb:500000/aHR0cHM6Ly9pbWct/dXBsb2FkLXByb2R1/Y3Rpb24udHJhbnNp/c3Rvci5mbS85N2M0/MmIzYmQwY2Q5ZThj/OTUyZDQ3NDkyODky/ZDRjNi5wbmc.jpg","author":"lowtouch.ai","episode_count":69,"summary":"Discover how agentic AI is transforming businesses! Hosted by lowtouch.ai, the Agentic AI Podcast dives into real-world applications, success stories, and expert insights on no-code automation, enterprise AI adoption, and the future of intelligent agents. Perfect for CXOs, innovators, and tech enthusiasts looking to stay ahead in the AI era.","last_synced_at":null,"page_url":"https://stenobird.com/podcast/agentic-ai-podcast"},"episode":{"title":"What is vLLM? | Agentic AI Podcast by lowtouch.ai","slug":"what-is-vllm-agentic-ai-podcast-by-lowtouch-ai","published_at":"2026-02-14T09:00:00+00:00","page_url":"https://stenobird.com/podcast/agentic-ai-podcast/what-is-vllm-agentic-ai-podcast-by-lowtouch-ai","show_page_url":"https://stenobird.com/podcast/agentic-ai-podcast","url":"https://share.transistor.fm/s/a6b84bb0","audio_url":"https://media.transistor.fm/a6b84bb0/011236b7.mp3","summary":"vLLM solves the massive memory fragmentation and latency issues inherent in serving autoregressive LLMs. By implementing PagedAttention and continuous batching, it transforms LLM inference from a high-waste process into a high-throughput, cost-effective engine for production agents.","meta_description":"Learn how vLLM uses PagedAttention and continuous batching to reduce memory waste from 60% to 4% and increase LLM throughput by up to 24x.","key_points":["Main idea: vLLM addresses the 'fragmentation trap' where traditional KV cache allocation wastes massive amounts of GPU memory","Technical breakthrough: PagedAttention functions like virtual memory in an OS, allowing non-contiguous physical memory blocks to appear continuous to the model","Performance optimization: Continuous batching eliminates 'head-of-line blocking' by processing tokens at the individual level rather than waiting for entire batches to finish","Practical takeaway: High-efficiency inference reduces the 'hardware tax,' making private, self-hosted enterprise AI economically viable","Failure mode: Static batching creates a bottleneck where a single long-running request can hold an entire GPU cluster hostage, spiking latency for all users"],"chapters":[{"start_ms":60000,"title":"The Infrastructure Crisis","summary":"An introduction to why scaling agentic AI requires moving beyond model intelligence to focus on the underlying inference infrastructure."},{"start_ms":135000,"title":"The Problem with State","summary":"Explaining why autoregressive LLMs are harder to serve than stateless web traffic due to the heavy mathematical overhead of the KV cache."},{"start_ms":210000,"title":"The Fragmentation Trap","summary":"A deep dive into how traditional memory allocation leads to massive internal fragmentation and wasted GPU real estate."},{"start_ms":350000,"title":"PagedAttention Explained","summary":"How vLLM uses techniques from operating systems to manage memory blocks efficiently, reducing waste from 60% to under 4%."},{"start_ms":505000,"title":"Continuous Batching","summary":"Moving from 'static bus' batching to a 'conveyor belt' model to prevent long requests from blocking short ones."},{"start_ms":580000,"title":"Pre-fill vs. Decode","summary":"Analyzing the compute-bound pre-fill phase and the memory-bound decode phase of token generation."},{"start_ms":785000,"title":"The Future of Private AI","summary":"How increased efficiency enables the democratization of AI through cost-effective, self-hosted, and multimodal deployment."}],"topics":["vLLM","PagedAttention","LLM Inference","KV Cache","Continuous Batching","GPU Memory Management","Agentic AI","Machine Learning Infrastructure"],"duration_seconds":1018,"processing_state":"processed","actions":[{"name":"request_transcript","method":"POST","url":"https://stenobird.com/v1/public/podcasts/agentic-ai-podcast/episodes/what-is-vllm-agentic-ai-podcast-by-lowtouch-ai/transcription-requests","description":"Idempotently request low-priority transcript generation for this episode."},{"name":"read_markdown","method":"GET","url":"https://stenobird.com/podcast/agentic-ai-podcast/what-is-vllm-agentic-ai-podcast-by-lowtouch-ai.md","description":"Read the agent-friendly Markdown representation of this episode resource."}]}}