# Software Engineering in the Age of Coding Agents: Testing, Evals, and Shipping Safely at Scale Page: https://stenobird.com/podcast/mlops-community/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scale Text version: https://stenobird.com/podcast/mlops-community/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scale.md Podcast: [MLOps.community](https://stenobird.com/podcast/mlops-community) Published: 2026-02-10T18:00:07+00:00 Episode link: https://podcasters.spotify.com/pod/show/mlops/episodes/Software-Engineering-in-the-Age-of-Coding-Agents-Testing--Evals--and-Shipping-Safely-at-Scale-e3eta9q Audio file: https://anchor.fm/s/174cb1b8/podcast/play/115304186/https%3A%2F%2Fd3ctxlq1ktw2nl.cloudfront.net%2Fstaging%2F2026-1-10%2F417834561-44100-2-32c1411bf9507.mp3 Processing state: processed JSON: https://stenobird.com/v1/public/podcasts/mlops-community/episodes/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scale Duration seconds: 3444 ## Resource Engineering agentic systems requires a hybrid approach between traditional software engineering and non-deterministic machine learning. This discussion explores how to manage complexity, evaluate performance, and maintain trust in autonomous AI workflows. ## Highlights - Main idea: Agentic systems are a hybrid of deterministic software engineering and non-deterministic predictive modeling - Practical takeaway: Avoid over-engineering multi-agent graphs; use a single agent with well-defined 'skills' written in traditional code whenever possible - Failure mode: Over-reliance on LLMs for logic that can be handled by pure business logic leads to increased costs and decreased testability - Practical takeaway: Implement 'LLM as a judge' and integration tests that use real customer data to evaluate agent performance at scale - Failure mode: Neglecting the UX of observability; users need clear audit trails and 'reasoning' visibility to trust autonomous actions ## Topics AI Agents, Software Engineering, LLM Evaluation, MLOps, Agentic Workflows, Observability, Cybersecurity AI, System Architecture ## Chapters - 1:00 — The Value of AI Coding Assistants: An exploration of how tools like Claude Code and Cursor are significantly increasing developer velocity and the ROI of AI-driven coding. - 5:20 — The Shift to Hybrid Engineering: Discussing the transition from traditional software requirements to managing systems that blend deterministic code with probabilistic prompts. - 18:10 — Observability and Audit Trails: The necessity of building transparent audit trails so users can understand the reasoning behind an agent's specific actions. - 31:05 — Language Sensitivity in Prompts: How subtle changes in prompt wording can trigger different reasoning paths and the importance of versioning prompt lineage. - 39:45 — Evaluating Agentic Workflows: Strategies for implementing two-level evaluations: unit-test style integration tests and large-scale evaluations against real-world data. - 52:50 — Architectural Minimalism: A critique of complex multi-agent graphs and the argument for keeping architectures simple by managing context effectively within a single agent. ## Actions - request_transcript: `POST https://stenobird.com/v1/public/podcasts/mlops-community/episodes/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scale/transcription-requests` — Idempotently request low-priority transcript generation for this episode. - read_markdown: `GET https://stenobird.com/podcast/mlops-community/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scale.md` — Read the agent-friendly Markdown representation of this episode resource. A page view does not enqueue transcription. Agents should invoke `request_transcript` explicitly when they need this episode processed. ## Transcript Full transcripts are not published on public pages unless there is a clear rights basis.