Leanpub Header

Skip to main content

Designing Evals for Agentic Systems

A Practitioner's Field Guide

Your agent's dashboard is green. Your evaluators report an 87% pass rate. Then a customer complaint reveals the system has been confidently fabricating regulatory citations for three weeks. The evaluators weren't broken — they were measuring the wrong things.  This field guide exists because the gap between "we have evals" and "our evals actually protect us" is larger than most teams realize.

Minimum price

$19.00

$29.00

You pay

$29.00

Author earns

$23.20
$

...Or Buy With Credits!

You can get credits with a paid monthly or annual Reader Membership, or you can buy them here.
PDF
EPUB
About

About

About the Book

Your AI agent passes every evaluator and still fails in production — because those evaluators were built from assumptions, not observations. This field guide gives you TRACES: a battle-tested six-phase methodology for designing evaluation systems that catch what actually breaks, specifically built for agentic systems that fail at the seams between steps. 17 quick-reference cards included. Start with your data, not your metrics.

Author

About the Author

Rhys Fisher

maker of fizzy things

Contents

Table of Contents

Designing Evals for Agentic Systems

  1. A Practitioner’s Field Guide

Table of Contents

  1. Front Matter
  2. The TRACES Framework
  3. Appendices

Introduction

  1. Why This Guide Exists
  2. The TRACES Framework
  3. How to Use This Guide — Three Navigation Systems
  4. Intellectual Foundations
  5. What This Guide Is Not

Part I: The TRACES Framework

T — Triage: Look at Your Data Before Building Anything

  1. T.1 Opening Vignette
  2. T.2 Core Principles
  3. T.3 The 30-Minute Error Analysis Workflow
  4. T.4 What to Look For in Every Trace
  5. T.5 Documentation Practices
  6. T.6 Sampling Strategies for Your First Review
  7. T.7 When to Go Deeper vs. When 30 Minutes Suffices
  8. T.8 Anti-Patterns
  9. T.9 Quick-Reference Cards
  10. T.10 Cross-References

R — Recognize: From Raw Observations to Failure Taxonomy

  1. R.1 Opening Vignette
  2. R.2 Core Principles
  3. R.3 Methodology
  4. R.4 Practical Examples
  5. R.5 Anti-Patterns
  6. R.6 Taxonomy Evolution — Criteria Drift and Its Implications
  7. R.7 Quick-Reference Cards
  8. R.8 Cross-References

A — Assess: “How Do I Measure This Automatically?”

  1. A.1 Opening Vignette
  2. A.2 Core Principles
  3. A.3 Code-Based Assertions — When Determinism Is Enough
  4. A.4 The Critique Shadowing Process — Building LLM Judges That Work
  5. A.5 Judge Prompt Engineering — A Complete Template
  6. A.6 The Benevolent Dictator Model for Judge Governance
  7. A.7 Specialized vs. General Judges — A Data-Driven Decision
  8. A.8 RAG-Specific Evaluators — Jason Liu’s 3-Tier Framework
  9. A.9 Anti-Patterns — What Goes Wrong and How to Escape
  10. Quick-Reference Card: LLM Judge Prompt Template
  11. Quick-Reference Card: Binary vs. Likert Decision Tree
  12. Quick-Reference Card: Critique Shadowing 7-Step Checklist
  13. Cross-References

C — Calibrate: Can You Trust Your Evaluators?

  1. C.1 Opening Vignette
  2. C.2 Core Principles
  3. C.3 The Validation Imperative
  4. C.4 Held-Out Validation Set Design
  5. C.5 The 3-Iteration Calibration Pattern: Honeycomb Case Study
  6. C.6 When Judges Drift: Detection and Response
  7. C.7 Refresh Cadence and Recalibration Triggers
  8. C.8 Anti-Patterns
  9. Quick-Reference Card: Judge Validation Metrics Cheat Sheet
  10. Quick-Reference Card: Am I Ready to Automate? Checklist
  11. Cross-References

E — Evolve: How Is Agent Evaluation Different?

  1. E.1 Opening Vignette
  2. E.2 Core Principles
  3. E.3 Methodology
  4. E.4 Practical Examples
  5. E.5 Anti-Patterns
  6. E.6 Quick-Reference Cards
  7. E.7 Agent vs. Simple LLM Evaluation: Detailed Comparison
  8. E.8 Cross-References

S — Sustain: Keeping Evals Running in Production

  1. S.1 Opening Vignette
  2. S.2 Core Principles
  3. S.3 Production Monitoring and Sampling Strategies
  4. S.4 CI/CD Integration
  5. S.5 Data Flywheel Design
  6. S.6 Synthetic Data for Regression Testing
  7. S.7 Cost Optimization at Scale
  8. S.8 Eval Platform Selection
  9. S.9 Guardrails vs. Evaluators
  10. S.10 Experiment Culture
  11. S.11 Anti-Patterns
  12. Quick-Reference Card: Production Sampling Strategy Selector
  13. Quick-Reference Card: Eval Maturity Assessment
  14. Quick-Reference Card: Data Flywheel Design Template

Part II: Appendices

Appendix A: Anti-Pattern Reference

  1. Category 1: Premature Action (Doing Things Too Early)
  2. Category 2: Measurement Failures (Measuring the Wrong Things)
  3. Category 3: Governance Failures (Wrong People, Wrong Process)
  4. Category 4: Architecture Failures (Building the Wrong Thing)
  5. Category 5: Staleness Failures (Not Updating)
  6. Category 6: Operational Failures (Running Unsustainably)
  7. Quick Diagnostic: Warning Signs Summary

Appendix B: Cross-Reference Index

  1. Core Concepts
  2. Cross-Chapter Dependencies

Appendix C: Quick-Reference Card Compilation

  1. Card Directory
  2. Recommended Card Sets by Role
  3. Card 1: The 30-Minute Error Analysis Checklist
  4. Card 2: What to Look For in Every Trace
  5. Card 3: Am I Falling Into an Anti-Pattern?
  6. Card 4: Open Coding Worksheet
  7. Card 5: Failure Taxonomy Template
  8. Card 6: LLM Judge Prompt Template
  9. Card 7: Binary vs. Likert Decision Tree
  10. Card 8: Critique Shadowing 7-Step Checklist
  11. Card 9: Judge Validation Metrics Cheat Sheet
  12. Card 10: Am I Ready to Automate?
  13. Card 11: Transition Failure Matrix Template
  14. Card 12: Agent Eval vs. Simple LLM Eval
  15. Card 13: Two-Phase Agent Eval Checklist
  16. Card 14: Capability Funnel Setup
  17. Card 15: Production Sampling Strategy Selector
  18. Card 16: Eval Maturity Assessment
  19. Card 17: Data Flywheel Design Template

Appendix D: Glossary

Colophon

The Leanpub 60 Day 100% Happiness Guarantee

Within 60 days of purchase you can get a 100% refund on any Leanpub purchase, in two clicks.

Now, this is technically risky for us, since you'll have the book or course files either way. But we're so confident in our products and services, and in our authors and readers, that we're happy to offer a full money back guarantee for everything we sell.

You can only find out how good something is by trying it, and because of our 100% money back guarantee there's literally no risk to do so!

So, there's no reason not to click the Add to Cart button, is there?

See full terms...

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earned over $14 million writing, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Learn more about Leanpub's ebook formats and where to read them

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub