Episode

Telemetry & Observability for Elixir Apps at Cars.com with Zack Kayser & Ethan Gunderson

Podcast
Elixir Wizards
Published
Dec 12, 2024
Duration seconds
2559
Processing state
processed
Canonical source
https://smartlogic.fireside.fm/s13-e09-observability-telemetry-elixir-cars-commerce
Audio
https://aphid.fireside.fm/d/1437767933/03a50f66-dc5e-4da4-ab6e-31895b6d4c9e/0fd8471e-c80e-4683-8410-e06ece191a31.mp3
JSON
/v1/public/podcasts/elixir-wizards/episodes/telemetry-observability-for-elixir-apps-at-cars-com-with-zack-kayser-ethan-gunderson
Markdown
/podcast/elixir-wizards/telemetry-observability-for-elixir-apps-at-cars-com-with-zack-kayser-ethan-gunderson.md

Actions

  • POST https://stenobird.com/v1/public/podcasts/elixir-wizards/episodes/telemetry-observability-for-elixir-apps-at-cars-com-with-zack-kayser-ethan-gunderson/transcription-requests
    Idempotently request low-priority transcript generation for this episode.
  • GET https://stenobird.com/podcast/elixir-wizards/telemetry-observability-for-elixir-apps-at-cars-com-with-zack-kayser-ethan-gunderson.md
    Read the agent-friendly Markdown representation of this episode resource.

Summary

Learn how to implement effective observability in high-traffic Elixir environments using Telemetry and OpenTelemetry. Engineers from Cars.com share practical strategies for managing large-scale system visibility and avoiding deployment-driven traffic spikes.

Topics

  • Elixir
  • Telemetry
  • OpenTelemetry
  • Observability
  • Phoenix LiveView
  • Distributed Tracing
  • Microservices
  • System Monitoring

Highlights

  • Main idea: Observability should enable developers to ask unplanned questions of a system to diagnose incidents and prevent recurrence
  • Practical takeaway: Use OpenTelemetry instrumentation libraries to easily add vendor-agnostic tracing and spans to your Elixir applications
  • Failure mode: Relying on Phoenix LiveView's default auto-recovery during deployments can trigger massive, redundant downstream database or search engine queries
  • Practical takeaway: Leverage the Elixir Telemetry ecosystem to hook into events from libraries like Oban without needing to modify their internal source code
  • Trade-off: Balancing high-resolution data collection with the storage costs and performance overhead of high-volume telemetry spans

Chapters

  1. 1:00 Introduction to Cars.com Scale: The guests discuss their experience transitioning from small-scale Elixir apps to managing high-throughput production environments at Cars.com.
  2. 4:15 The High-Stakes Switch: A look at the technical pressure and challenges of migrating traffic from legacy stacks to new Elixir-based infrastructure.
  3. 7:20 The Value of Contextual Tracing: Why simple log lines are insufficient for triaging incidents and how tracing allows you to follow a specific user's journey through downstream services.
  4. 10:35 Defining Observability Goals: Moving beyond simple incident diagnosis to using telemetry for proactive system understanding.
  5. 13:50 Managing Data Volume and Sampling: The challenges of handling massive amounts of telemetry data and the necessity of sampling strategies to manage costs.
  6. 16:50 LiveView and WebSocket Challenges: How Phoenix LiveView socket reconnections during deployments can create significant downstream load on services like Elasticsearch.
  7. 23:30 Scaling Instrumentation: Strategies for instrumenting large-scale applications and the importance of using standardized libraries like OpenTelemetry.
  8. 39:20 The Future of Elixir Telemetry: How the growing ecosystem of Telemetry-enabled libraries simplifies the burden of building custom observability tools.