{"podcast":{"title":"Latent Space: The AI Engineer Podcast","slug":"latent-space-ai-engineer","podcast_index_feed_id":6058902,"rss_url":"https://api.substack.com/feed/podcast/1084089.rss","website_url":"https://www.latent.space/podcast","image_url":"https://substackcdn.com/feed/podcast/1084089/ca7468da5614a246d2906ee8926f6de7.jpg","author":"Latent.Space","episode_count":204,"summary":"The AI Engineer newsletter + Top technical AI podcast. How leading labs build Agents, Models, Infra, & AI for Science. See https://latent.space/about for highlights from Greg Brockman, Andrej Karpathy, George Hotz, Simon Willison, Soumith Chintala et al!","last_synced_at":null,"page_url":"https://stenobird.com/podcast/latent-space-ai-engineer"},"episode":{"title":"Everything you need to run Mission Critical Inference (ft. DeepSeek v3 + SGLang)","slug":"everything-you-need-to-run-mission-critical-inference-ft-deepseek-v3-sglang","published_at":"2025-01-19T04:00:15+00:00","page_url":"https://stenobird.com/podcast/latent-space-ai-engineer/everything-you-need-to-run-mission-critical-inference-ft-deepseek-v3-sglang","show_page_url":"https://stenobird.com/podcast/latent-space-ai-engineer","url":"https://www.latent.space/p/baseten","audio_url":"https://api.substack.com/feed/podcast/155135149/573a1b749e9ab7ea811cb6daf30c53e4.mp3","summary":"Sponsorships and applications for the AI Engineer Summit in NYC are live ! (Speaker CFPs have closed ) If you are building AI agents or leading teams of AI Engineers , this will be the single highest-signal conference of the year for you. Right after Christmas, the Chinese Whale Bros ended 2024 by dropping the last big model launch of the year: DeepSeek v3 . Right now on LM Arena, DeepSeek v3 has a score of 1319, right under the full o1 model, Gemini 2, and 4o latest. This makes it the best open weights model in the world in January 2025. There has been a big recent trend in Chinese labs releasing very large open weights models, with TenCent releasing Hunyuan-Large in November and Hailuo releasing MiniMax-Text this week, both over 400B in size. However these extra-large language models are very difficult to serve. Baseten was the first of the Inference neocloud startups to get DeepSeek V3 online, because of their H200 clusters, their close collaboration with the DeepSeek team and early support of SGLang , a relatively new VLLM alternative that is also used at frontier labs like X.ai. Each H200 has 141 GB of VRAM with 4.8 TB per second of bandwidth, meaning that you can use 8 H200's in a node to inference DeepSeek v3 in FP8, taking into account KV Cache needs. We have been close to Baseten since Sarah Guo introduced Amir Haghighat to swyx, and they supported the very first Latent Space Demo Day in San Francisco, which was effectively the trial run for swyx and Alessio to work together! Since then, Philip Kiely also led a well attended workshop on TensorRT LLM at the 2024 World's Fair. We worked with him to get two of their best representatives, Amir and Lead Model Performance Engineer Yineng Zhang , to discuss DeepSeek, SGLang, and everything they have learned running M…","meta_description":"Sponsorships and applications for the AI Engineer Summit in NYC are live ! (Speaker CFPs have closed ) If you are building AI agents or leading teams of A…","key_points":[],"chapters":[],"topics":[],"duration_seconds":3604,"processing_state":"processed","actions":[{"name":"request_transcript","method":"POST","url":"https://stenobird.com/v1/public/podcasts/latent-space-ai-engineer/episodes/everything-you-need-to-run-mission-critical-inference-ft-deepseek-v3-sglang/transcription-requests","description":"Idempotently request low-priority transcript generation for this episode."},{"name":"read_markdown","method":"GET","url":"https://stenobird.com/podcast/latent-space-ai-engineer/everything-you-need-to-run-mission-critical-inference-ft-deepseek-v3-sglang.md","description":"Read the agent-friendly Markdown representation of this episode resource."}]}}