Episode
Controlling AI Models from the Inside
- Podcast
- Practical AI
- Published
- Jan 20, 2026
- Duration seconds
- 2635
- Processing state
processed- Canonical source
- https://share.transistor.fm/s/df33214d
Actions
POST https://stenobird.com/v1/public/podcasts/practical-ai/episodes/controlling-ai-models-from-the-inside/transcription-requests
Idempotently request low-priority transcript generation for this episode.GET https://stenobird.com/podcast/practical-ai/controlling-ai-models-from-the-inside.md
Read the agent-friendly Markdown representation of this episode resource.
Summary
Traditional AI safety relies on external filters that monitor prompts and responses, often creating latency and high costs. This episode explores a model-native approach using runtime instrumentation to detect problematic neuron activation inside the 'black box' before bad outputs are even generated.
Topics
- AI Safety
- Large Language Models
- Model Interpretability
- Runtime Security
- AI Guardrails
- Machine Learning Infrastructure
- Cybersecurity
- AI Governance
Highlights
- Main idea: Current AI safety is limited to the 'gatekeeper' layer, analyzing only inputs and outputs
- Failure mode: External guardrails can be bypassed by jailbreaks and are often too expensive or slow for production
- Practical takeaway: Monitoring internal model subspaces allows for intervention during the generation process, not just after
- Technical concept: Model-native safety involves instrumenting the model to identify specific subregions that trigger during toxic or unauthorized content generation
- Future vision: Creating a standardized safety layer that enables the use of LLMs in highly regulated industries like healthcare
Chapters
1:00Introduction: Hosts Daniel and Chris introduce Alizishaan Khatri, founder of Wrynx, and set the stage for discussing the future of AI model safety.4:20AI for Security vs. Security for AI: Distinguishing between using AI to solve security problems and the challenge of securing the AI models themselves as they enter the tech stack.7:25The Limits of Prompt Filtering: An analysis of why current 'gatekeeper' solutions—analyzing prompts and responses—are insufficient against sophisticated jailbreaks.17:45Model-Native Instrumentation: Exploring the concept of 'cameras inside the building' by monitoring internal model subspaces and neuron activation at runtime.24:15The Burden of Custom Training: Discussing why customers cannot simply train new models to avoid certain topics and the need for a more scalable safety layer.33:50Detecting Toxicity via Subspaces: How identifying specific model regions that trigger during toxic generation allows for proactive intervention.40:35The Future of Model Safety: Alizishaan outlines his vision for a de facto safety layer that enables LLM adoption in sensitive sectors like healthcare.