Episode

Scaling Graph Analytics Without ETL: Inside PuppyGraph’s Architecture

Podcast: Data Engineering Podcast
Published: Jun 1, 2026
Duration seconds: 3260
Processing state: not_requested
Canonical source: https://www.dataengineeringpodcast.com/puppygraph-zero-etl-graph-query-engine-episode-510
Audio: https://op3.dev/e/dts.podtrac.com/redirect.mp3/serve.podhome.fm/episode/f6ff0caa-931b-4c08-bfdd-08dc7f5cd336/6391586951131897518fbec695-0bb3-4a0e-8ebd-eec46b1f65cfv1.mp3
JSON: /v1/public/podcasts/data-engineering-podcast/episodes/scaling-graph-analytics-without-etl-inside-puppygraph-s-architecture
Markdown: /podcast/data-engineering-podcast/scaling-graph-analytics-without-etl-inside-puppygraph-s-architecture.md

Actions

POST https://stenobird.com/v1/public/podcasts/data-engineering-podcast/episodes/scaling-graph-analytics-without-etl-inside-puppygraph-s-architecture/transcription-requests
Idempotently request low-priority transcript generation for this episode.
GET https://stenobird.com/podcast/data-engineering-podcast/scaling-graph-analytics-without-etl-inside-puppygraph-s-architecture.md
Read the agent-friendly Markdown representation of this episode resource.

Summary

Summary In this episode Weimo Liu, co‑founder of PuppyGraph, talks about the engineering behind their “zero-copy” graph querying engine for lakehouse and database sources. He explores how PuppyGraph lets you run Cypher and Gremlin traversals and graph algorithms directly on data in Iceberg, Delta, Hudi, Hive, and even MongoDB—without loading into a separate graph store. Weimo explains their edge-sharded, vectorized, MPP architecture that tackles hub nodes, multi-hop traversals, and shuffle at scale, targeting sub-second to single-digit-second workloads. He digs into practical graph data modeling on top of normalized and denormalized tables, logical views, and flexible mappings; strategies for caching, adaptive reads, and leveraging Iceberg metadata; and how PuppyGraph’s operator-based engine unifies query and algorithms. He also covers real-world applications—from cybersecurity log analysis to entity resolution and agentic workflows—when to choose embedded or transactional graph databases instead, and what’s next for enterprise features and broader warehouse integrations. Announcements Hello and welcome to the Data Engineering Podcast, the show about modern data management This episode is sponsored by DataDriven.io, the free data engineering interview prep platform built by data engineers for data engineers. Ever walked into a data engineering interview and gotten a question that has nothing to do with real data engineering work? Interviewing is its own skill, separate from the job. Watch your code execute live, inspect Spark internals, and whiteboard your data models and pipelines and defend your decisions. Unlike SQL-only or Python-only practice, DataDriven.io covers the full interview loop: star schemas, slowly changing dimensions, grain and fact table design, idemp…