Episode
Software Engineering in the Age of Coding Agents: Testing, Evals, and Shipping Safely at Scale
- Podcast
- MLOps.community
- Published
- Feb 10, 2026
- Duration seconds
- 3444
- Processing state
processed
Actions
POST https://stenobird.com/v1/public/podcasts/mlops-community/episodes/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scale/transcription-requests
Idempotently request low-priority transcript generation for this episode.GET https://stenobird.com/podcast/mlops-community/software-engineering-in-the-age-of-coding-agents-testing-evals-and-shipping-safely-at-scale.md
Read the agent-friendly Markdown representation of this episode resource.
Summary
Engineering agentic systems requires a hybrid approach between traditional software engineering and non-deterministic machine learning. This discussion explores how to manage complexity, evaluate performance, and maintain trust in autonomous AI workflows.
Topics
- AI Agents
- Software Engineering
- LLM Evaluation
- MLOps
- Agentic Workflows
- Observability
- Cybersecurity AI
- System Architecture
Highlights
- Main idea: Agentic systems are a hybrid of deterministic software engineering and non-deterministic predictive modeling
- Practical takeaway: Avoid over-engineering multi-agent graphs; use a single agent with well-defined 'skills' written in traditional code whenever possible
- Failure mode: Over-reliance on LLMs for logic that can be handled by pure business logic leads to increased costs and decreased testability
- Practical takeaway: Implement 'LLM as a judge' and integration tests that use real customer data to evaluate agent performance at scale
- Failure mode: Neglecting the UX of observability; users need clear audit trails and 'reasoning' visibility to trust autonomous actions
Chapters
1:00The Value of AI Coding Assistants: An exploration of how tools like Claude Code and Cursor are significantly increasing developer velocity and the ROI of AI-driven coding.5:20The Shift to Hybrid Engineering: Discussing the transition from traditional software requirements to managing systems that blend deterministic code with probabilistic prompts.18:10Observability and Audit Trails: The necessity of building transparent audit trails so users can understand the reasoning behind an agent's specific actions.31:05Language Sensitivity in Prompts: How subtle changes in prompt wording can trigger different reasoning paths and the importance of versioning prompt lineage.39:45Evaluating Agentic Workflows: Strategies for implementing two-level evaluations: unit-test style integration tests and large-scale evaluations against real-world data.52:50Architectural Minimalism: A critique of complex multi-agent graphs and the argument for keeping architectures simple by managing context effectively within a single agent.