Stop Doing Agent Eval Theater: Why AISI's Transcript Analysis Actually Catches What Breaks

AISI's October 2025 methodology exposes why pass rates miss critical agent failures. Learn practical transcript analysis, defect taxonomies, and how to catch real security issues before production.

Oct 28, 2025

Your agent scores 90% on benchmarks. Great. Now tell me why 30% of your production transcripts show systematic refusals, tool abuse, or data leakage. The UK AI Safety Institute just exposed the dirty secret of agent evaluation: pass rates are security theater. Here’s what you should measure instead.

I’ve been watching teams sprint toward agent deployments for the past year. The pattern repeats itself. Build fast, eval on synthetic benchmarks, celebrate high pass rates, ship to production, then act shocked when real users trigger behaviors you never tested for. Sound familiar?

The UK AI Safety Institute dropped a methodology on October 10th that should make every ML engineer and security architect rethink their evaluation strategy. They analyzed 6,390 transcripts from cybersecurity agents across 9 model checkpoints. What they found is eye-opening.

Two models with identical 55% pass rates showed completely different safety properties. One model refused to engage with tasks 30% of the time, burning through 800-message conversations with refusal variants. The pass rate metric? Blind to it. You can’t protect what you don’t measure, and most teams aren’t measuring what actually breaks in production.

Let me show you what real evaluation looks like and why your current approach is probably eval theater.

AISI’s Method Reveals What Pass Rates Hide

The problem with pass rates is that they tell you if the agent succeeded, not how it got there or why it failed. AISI’s three-tiered approach fixes this.

Tier 1 is manual review. Pick 10-20 failed transcripts from different tasks. Read them. You’ll spot patterns that automated metrics miss. AISI found hard refusals where agents explicitly rejected tasks, soft refusals where they claimed tasks were unsolvable, and tool usage failures that never triggered alarms. This takes 4-8 hours per evaluation cycle. That’s your minimum viable effort.

Tier 2 adds programmatic analysis. Run automated checks across all transcripts. AISI tracked message counts, tool call frequencies, and reasoning token usage. Here’s what jumped out: Model B4 had about 30% of transcripts sitting at exactly 800 messages. That’s not normal distribution. That’s a systematic refusal pattern filling up token budgets with “I can’t help with that” variants. Another model called the submit tool 60,000 times across only 710 attempts. Tool abuse at scale.

Figure 1: Pass Rates vs. Actual Quality Issues

Tier 3 brings targeted content analysis. AISI used keyword searches and LLM-based scanning to find specific failure modes. They searched for hard refusal text, usage policy violations, and emotional language patterns. For the LLM scanning, they used GPT-4 at temperature 0 with simple prompts: “Does this message contain a statement semantically equivalent to ‘The task can’t be solved’?” The cost? About £300 to scan 10 transcripts across 9 models. That’s cheaper than one security incident.

The key insight is that you need all three tiers. Automated metrics catch systematic issues at scale. Human review catches subtle problems machines miss. LLM scanning bridges the gap for semantic patterns.

Map Your Threats to Frameworks That Actually Matter

Security teams love frameworks. The problem is, most agent security frameworks are aspirational. MITRE ATLAS and OWASP LLM Top 10 aren’t aspirational. They’re built from real incidents. By the way… stay tuned for the OWASP Agentic Top 10!

OWASP updated its LLM Top 10 in 2025 with agent-specific risks. The number one threat for agents? LLM06: Excessive Agency. This is when your agent has access to more tools than they need, with broader permissions than required, and no human approval for high-risk actions. I’ve seen this kill deployments.

Imagine that your customer service agent can read from your CRM. Makes sense. But can it update records? Delete tickets? Access financial data? If you’re not enforcing least privilege at the tool level, prompt injection turns your helpful agent into a data exfiltration pipeline.

Figure 2: Agent Security Defect Taxonomy

LLM01 is prompt injection. The stats should worry you. Academic research from October 2024 showed 56-100% attack success rates depending on the model and technique. One study got 97.2% success extracting system prompts from over 200 custom GPTs. Another achieved 100% success in file leakage. These aren’t theoretical attacks. They’re reproducible.

The Asana MCP server incident from June 2025 showed what happens when you skip security architecture. A logic flaw let AI agents from one organization access other organizations’ tasks, files, and comments. About 1,000 customers potentially exposed. The root cause? Insufficient tenant isolation in a multi-agent system.

MITRE ATLAS added 19 new techniques in Spring 2025, including RAG poisoning and LLM plugin compromise. Map these to your deployment. If you’re using retrieval-augmented generation (RAG) and haven’t tested for poisoned documents in your knowledge base, you’re running blind.

The Microsoft 365 Copilot EchoLeak vulnerability (CVE-2025-32711) earned a CVSS score of 9.3. Zero-click prompt injection that could expose chat logs, OneDrive files, and Teams messages without user interaction. That’s not a theoretical attack. That’s production reality.

The Numbers Don’t Lie About Your Risk Surface

Let’s talk adoption versus security maturity. LangChain’s 2024 survey found 51% of organizations already deploy agents in production. McKinsey says 78% use AI in at least one business function. So we’re shipping fast.

Sadly, however, IBM’s July 2025 data breach report found that 97% of breached organizations lacked proper AI access controls. Not 97% had weak controls. 97% lacked them entirely. And 13% reported actual breaches of AI models or applications, with another 8% admitting they don’t know if they’ve been compromised.

Shadow AI makes this worse. One in five organizations reported breaches due to employees using unapproved AI tools. Those breaches cost an average of $670,000 more. Your security team can’t protect agents they don’t know exist.

The prompt injection success rates should end any debate about whether this is a real threat. The paper, “Systematically Analysing Prompt Injection Vulnerabilities in Diverse LLM Architectures” tested 144 injection attempts across 36 LLMs. The average success rate is a measly 56%. Some attacks hit 100%. Your safety-aligned model with perfect benchmark scores? Still vulnerable.

According to the McKinsey Global Survey on AI, only 1% of executives describe their GenAI rollouts as mature. More than 80% report no tangible impact on EBIT from gen AI deployments. You can’t drive business value when you’re too busy dealing with security incidents or pulling agents offline because they’re doing unexpected things.

The eval theater problem manifests here. Teams run evaluations that look rigorous but miss real vulnerabilities. You test on synthetic data that doesn’t match production distribution. You check for known attacks but don’t explore novel vectors. You rely on automated metrics without human review. Then you’re surprised when production breaks.

Build Evaluation Into Your Pipeline or Build It Twice

The right time to implement transcript analysis is before your first production deployment. The realistic time is right now, regardless of where you are.

Start with AISI’s Tier 1 approach. Log every agent conversation. Pick 10-20 failed transcripts. Read them. Document what you find, including refusal patterns, tool misuse, parameter hallucinations, and anything that doesn’t match expected behavior. This takes one afternoon and will teach you more about your agent than a week of staring at accuracy metrics.

Figure 3: AISI Tiered Implementation Approach

Add Tier 2 within your first month in production. Automate metadata analysis across all transcripts. Track message counts, tool call frequencies, and token usage patterns. Set alerts for anomalies. For example, transcripts that are too long, too short, or show unusual tool access patterns. Use simple keyword searches for refusal text and error messages. This catches systematic issues at scale with minimal human effort.

Tier 3 is where you get serious. Use LLM-based scanning to detect semantic patterns. Temperature 0 for consistency. Scan for hard refusals, soft refusals, instruction-following from external sources, PII disclosure, and tool parameter anomalies. The cost is negligible compared to the cost of a single security incident. Run this on 10% of your production transcripts weekly.

Integrate with your CI/CD pipeline. Every code change should trigger your evaluation suite. Block deployments that exhibit critical failures, such as unauthorized actions, PII disclosure, execution of injected instructions, or granting elevated privileges. Require human review for high-severity issues. Auto-approve only when all tests pass.

The privacy piece is non-negotiable. Implement anonymization before storage. Replace real user IDs with tokens. Mask PII, names, organizations. Use encryption at rest and in transit. Set retention policies that match your compliance requirements. GDPR gives users the right to access and delete their data. Make sure you can fulfill those requests.

For teams worried about cost, here’s the math. Tier 1 costs 4-8 hours of human time per evaluation cycle. Tier 2 is mostly automated, maybe £500-1,000 monthly in compute. Tier 3 with comprehensive LLM scanning costs £5,000-15,000 per month for large-scale deployments. One security incident costs more. One compliance violation costs more. One production outage costs more.

The real cost is shipping agents without understanding how they fail. That’s when you’re forced to pull everything offline, rebuild evaluation from scratch under time pressure, and explain to executives why you didn’t catch obvious issues before production.

Stop Checking Boxes, Start Catching Real Problems

You know you’re doing eval theater when your agents pass all tests, but users keep hitting edge cases you never anticipated. When your security team discovers shadow AI agents running in production that nobody has evaluated. When you can’t explain why an agent made a specific decision because you only logged pass/fail metrics.

Real evaluation has characteristics. It tests adversarially, not just on sunny-day scenarios. It combines automated metrics with human expert review. It runs continuously in production, not just as a pre-deployment gate. It measures safety-critical properties like robustness to injection, alignment with user intent, and appropriate refusal behavior, not just accuracy and latency.

You need context about how your agents actually behave in production. You need accountability for who owns agent safety. You need a risk assessment that goes beyond “we ran some benchmarks.” You need engagement between security, ML engineering, product, and legal teams.

The McKinsey research I cited earlier found that tracking well-defined KPIs for gen AI solutions has the highest correlation with EBIT impact. Less than one in five organizations do this. Most teams can’t even tell you what good looks like for their agents, let alone measure it systematically.

Start simple. Pick five critical behaviors your agent must exhibit and five it must never exhibit. Write test cases for each. Run them on every deployment. Analyze failures systematically. Update your test suite based on production incidents. That’s evaluation-driven development in action.

The adversarial testing can’t be optional. There are many open-source test suites for the OWASP Top 10 for LLMs. AgentDojo provides security benchmarks. These aren’t academic exercises. They’re testing the same techniques attackers use. If you’re not running adversarial evals, you’re essentially hoping attackers won’t notice your agents exist.

Your agents WILL have security issues. The question is whether you find them during controlled testing or when they’re exploited in production. Transcript analysis gives you a systematic way to find problems before they become incidents.

The gap between AI agent adoption and security maturity is widening, not closing. If you’re deploying agents without systematic transcript analysis, you’re flying blind. Start with AISI’s Tier 1 approach this week. Log conversations. Read failures. Document patterns. Build from there.

Need help implementing agent evaluation programs that actually catch problems? Check out the AI strategy and governance frameworks I’ve built with enterprise security teams. Your agents are making decisions that affect your business. Make sure you understand how.

👉 Key Takeaway: Pass rates tell you if agents succeed, transcript analysis tells you how they fail and the failure modes are where real security and safety risks hide.

👉 Subscribe for more AI security and governance insights with the occasional rant.