Designing Evals for Agentic Systems
- A Practitioner’s Field Guide
Table of Contents
- Front Matter
- The TRACES Framework
- Appendices
Introduction
- Why This Guide Exists
- The TRACES Framework
- How to Use This Guide — Three Navigation Systems
- Intellectual Foundations
- What This Guide Is Not
Part I: The TRACES Framework
T — Triage: Look at Your Data Before Building Anything
- T.1 Opening Vignette
- T.2 Core Principles
- T.3 The 30-Minute Error Analysis Workflow
- T.4 What to Look For in Every Trace
- T.5 Documentation Practices
- T.6 Sampling Strategies for Your First Review
- T.7 When to Go Deeper vs. When 30 Minutes Suffices
- T.8 Anti-Patterns
- T.9 Quick-Reference Cards
- T.10 Cross-References
R — Recognize: From Raw Observations to Failure Taxonomy
- R.1 Opening Vignette
- R.2 Core Principles
- R.3 Methodology
- R.4 Practical Examples
- R.5 Anti-Patterns
- R.6 Taxonomy Evolution — Criteria Drift and Its Implications
- R.7 Quick-Reference Cards
- R.8 Cross-References
A — Assess: “How Do I Measure This Automatically?”
- A.1 Opening Vignette
- A.2 Core Principles
- A.3 Code-Based Assertions — When Determinism Is Enough
- A.4 The Critique Shadowing Process — Building LLM Judges That Work
- A.5 Judge Prompt Engineering — A Complete Template
- A.6 The Benevolent Dictator Model for Judge Governance
- A.7 Specialized vs. General Judges — A Data-Driven Decision
- A.8 RAG-Specific Evaluators — Jason Liu’s 3-Tier Framework
- A.9 Anti-Patterns — What Goes Wrong and How to Escape
- Quick-Reference Card: LLM Judge Prompt Template
- Quick-Reference Card: Binary vs. Likert Decision Tree
- Quick-Reference Card: Critique Shadowing 7-Step Checklist
- Cross-References
C — Calibrate: Can You Trust Your Evaluators?
- C.1 Opening Vignette
- C.2 Core Principles
- C.3 The Validation Imperative
- C.4 Held-Out Validation Set Design
- C.5 The 3-Iteration Calibration Pattern: Honeycomb Case Study
- C.6 When Judges Drift: Detection and Response
- C.7 Refresh Cadence and Recalibration Triggers
- C.8 Anti-Patterns
- Quick-Reference Card: Judge Validation Metrics Cheat Sheet
- Quick-Reference Card: Am I Ready to Automate? Checklist
- Cross-References
E — Evolve: How Is Agent Evaluation Different?
- E.1 Opening Vignette
- E.2 Core Principles
- E.3 Methodology
- E.4 Practical Examples
- E.5 Anti-Patterns
- E.6 Quick-Reference Cards
- E.7 Agent vs. Simple LLM Evaluation: Detailed Comparison
- E.8 Cross-References
S — Sustain: Keeping Evals Running in Production
- S.1 Opening Vignette
- S.2 Core Principles
- S.3 Production Monitoring and Sampling Strategies
- S.4 CI/CD Integration
- S.5 Data Flywheel Design
- S.6 Synthetic Data for Regression Testing
- S.7 Cost Optimization at Scale
- S.8 Eval Platform Selection
- S.9 Guardrails vs. Evaluators
- S.10 Experiment Culture
- S.11 Anti-Patterns
- Quick-Reference Card: Production Sampling Strategy Selector
- Quick-Reference Card: Eval Maturity Assessment
- Quick-Reference Card: Data Flywheel Design Template
Part II: Appendices
Appendix A: Anti-Pattern Reference
- Category 1: Premature Action (Doing Things Too Early)
- Category 2: Measurement Failures (Measuring the Wrong Things)
- Category 3: Governance Failures (Wrong People, Wrong Process)
- Category 4: Architecture Failures (Building the Wrong Thing)
- Category 5: Staleness Failures (Not Updating)
- Category 6: Operational Failures (Running Unsustainably)
- Quick Diagnostic: Warning Signs Summary
Appendix B: Cross-Reference Index
- Core Concepts
- Cross-Chapter Dependencies
Appendix C: Quick-Reference Card Compilation
- Card Directory
- Recommended Card Sets by Role
- Card 1: The 30-Minute Error Analysis Checklist
- Card 2: What to Look For in Every Trace
- Card 3: Am I Falling Into an Anti-Pattern?
- Card 4: Open Coding Worksheet
- Card 5: Failure Taxonomy Template
- Card 6: LLM Judge Prompt Template
- Card 7: Binary vs. Likert Decision Tree
- Card 8: Critique Shadowing 7-Step Checklist
- Card 9: Judge Validation Metrics Cheat Sheet
- Card 10: Am I Ready to Automate?
- Card 11: Transition Failure Matrix Template
- Card 12: Agent Eval vs. Simple LLM Eval
- Card 13: Two-Phase Agent Eval Checklist
- Card 14: Capability Funnel Setup
- Card 15: Production Sampling Strategy Selector
- Card 16: Eval Maturity Assessment
- Card 17: Data Flywheel Design Template