Name: Designing Evals for Agentic Systems
Brand: Leanpub
Price: 19.00 USD
Availability: InStock

Your agent's dashboard is green. Your evaluators report an 87% pass rate. Then a customer complaint reveals the system has been confidently fabricating regulatory citations for three weeks. The evaluators weren't broken — they were measuring the wrong things. This field guide exists because the gap between "we have evals" and "our evals actually protect us" is larger than most teams realize.

Your AI agent passes every evaluator and still fails in production — because those evaluators were built from assumptions, not observations. This field guide gives you TRACES: a battle-tested six-phase methodology for designing evaluation systems that catch what actually breaks, specifically built for agentic systems that fail at the seams between steps. 17 quick-reference cards included. Start with your data, not your metrics.

Designing Evals for Agentic Systems

A Practitioner’s Field Guide

Front Matter
The TRACES Framework
Appendices

Introduction

Why This Guide Exists
The TRACES Framework
How to Use This Guide — Three Navigation Systems
Intellectual Foundations
What This Guide Is Not

Part I: The TRACES Framework

T — Triage: Look at Your Data Before Building Anything

T.1 Opening Vignette
T.2 Core Principles
T.3 The 30-Minute Error Analysis Workflow
T.4 What to Look For in Every Trace
T.5 Documentation Practices
T.6 Sampling Strategies for Your First Review
T.7 When to Go Deeper vs. When 30 Minutes Suffices
T.8 Anti-Patterns
T.9 Quick-Reference Cards
T.10 Cross-References

R — Recognize: From Raw Observations to Failure Taxonomy

R.1 Opening Vignette
R.2 Core Principles
R.3 Methodology
R.4 Practical Examples
R.5 Anti-Patterns
R.6 Taxonomy Evolution — Criteria Drift and Its Implications
R.7 Quick-Reference Cards
R.8 Cross-References

A — Assess: “How Do I Measure This Automatically?”

A.1 Opening Vignette
A.2 Core Principles
A.3 Code-Based Assertions — When Determinism Is Enough
A.4 The Critique Shadowing Process — Building LLM Judges That Work
A.5 Judge Prompt Engineering — A Complete Template
A.6 The Benevolent Dictator Model for Judge Governance
A.7 Specialized vs. General Judges — A Data-Driven Decision
A.8 RAG-Specific Evaluators — Jason Liu’s 3-Tier Framework
A.9 Anti-Patterns — What Goes Wrong and How to Escape
Quick-Reference Card: LLM Judge Prompt Template
Quick-Reference Card: Binary vs. Likert Decision Tree
Quick-Reference Card: Critique Shadowing 7-Step Checklist
Cross-References

C — Calibrate: Can You Trust Your Evaluators?

C.1 Opening Vignette
C.2 Core Principles
C.3 The Validation Imperative
C.4 Held-Out Validation Set Design
C.5 The 3-Iteration Calibration Pattern: Honeycomb Case Study
C.6 When Judges Drift: Detection and Response
C.7 Refresh Cadence and Recalibration Triggers
C.8 Anti-Patterns
Quick-Reference Card: Judge Validation Metrics Cheat Sheet
Quick-Reference Card: Am I Ready to Automate? Checklist
Cross-References

E — Evolve: How Is Agent Evaluation Different?

E.1 Opening Vignette
E.2 Core Principles
E.3 Methodology
E.4 Practical Examples
E.5 Anti-Patterns
E.6 Quick-Reference Cards
E.7 Agent vs. Simple LLM Evaluation: Detailed Comparison
E.8 Cross-References

S — Sustain: Keeping Evals Running in Production

S.1 Opening Vignette
S.2 Core Principles
S.3 Production Monitoring and Sampling Strategies
S.4 CI/CD Integration
S.5 Data Flywheel Design
S.6 Synthetic Data for Regression Testing
S.7 Cost Optimization at Scale
S.8 Eval Platform Selection
S.9 Guardrails vs. Evaluators
S.10 Experiment Culture
S.11 Anti-Patterns
Quick-Reference Card: Production Sampling Strategy Selector
Quick-Reference Card: Eval Maturity Assessment
Quick-Reference Card: Data Flywheel Design Template

Part II: Appendices

Appendix A: Anti-Pattern Reference

Category 1: Premature Action (Doing Things Too Early)
Category 2: Measurement Failures (Measuring the Wrong Things)
Category 3: Governance Failures (Wrong People, Wrong Process)
Category 4: Architecture Failures (Building the Wrong Thing)
Category 5: Staleness Failures (Not Updating)
Category 6: Operational Failures (Running Unsustainably)
Quick Diagnostic: Warning Signs Summary

Appendix B: Cross-Reference Index

Core Concepts
Cross-Chapter Dependencies

Appendix C: Quick-Reference Card Compilation

Card Directory
Recommended Card Sets by Role
Card 1: The 30-Minute Error Analysis Checklist
Card 2: What to Look For in Every Trace
Card 3: Am I Falling Into an Anti-Pattern?
Card 4: Open Coding Worksheet
Card 5: Failure Taxonomy Template
Card 6: LLM Judge Prompt Template
Card 7: Binary vs. Likert Decision Tree
Card 8: Critique Shadowing 7-Step Checklist
Card 9: Judge Validation Metrics Cheat Sheet
Card 10: Am I Ready to Automate?
Card 11: Transition Failure Matrix Template
Card 12: Agent Eval vs. Simple LLM Eval
Card 13: Two-Phase Agent Eval Checklist
Card 14: Capability Funnel Setup
Card 15: Production Sampling Strategy Selector
Card 16: Eval Maturity Assessment
Card 17: Data Flywheel Design Template

Appendix D: Glossary

Colophon

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

We pay 80% royalties on purchases of $7.99 or more, and 80% royalties minus a 50 cent flat fee on purchases between $0.99 and $7.98. You earn $8 on a $10 sale, and $16 on a $20 sale. So, if we sell 5000 non-refunded copies of your book for $20, you'll earn $80,000.

(Yes, some authors have already earned much more than that on Leanpub.)

In fact, authors have earned over $15 million writing, publishing and selling on Leanpub.

Learn more about writing on Leanpub

Free Updates. DRM Free.

If you buy a Leanpub book, you get free updates for as long as the author updates the book! Many authors use Leanpub to publish their books in-progress, while they are writing them. All readers get free updates, regardless of when they bought the book or how much they paid (including free).

Most Leanpub books are available in PDF (for computers) and EPUB (for phones, tablets and Kindle). The formats that a book includes are shown at the top right corner of this page.

Finally, Leanpub books don't have any DRM copy-protection nonsense, so you can easily read them on any supported device.

Write and Publish on Leanpub

You can use Leanpub to easily write, publish and sell in-progress and completed ebooks and online courses!

Leanpub is a powerful platform for serious authors, combining a simple, elegant writing and publishing workflow with a store focused on selling in-progress ebooks.

Leanpub is a magical typewriter for authors: just write in plain text, and to publish your ebook, just click a button. (Or, if you are producing your ebook your own way, you can even upload your own PDF and/or EPUB files and then publish with one click!) It really is that easy.

Learn more about writing on Leanpub

Designing Evals for Agentic Systems

You pay

Author earns

You pay

Author earns

About

Share this book

Categories

Feedback

Author

Contents

Designing Evals for Agentic Systems

Table of Contents

Introduction

Part I: The TRACES Framework

T — Triage: Look at Your Data Before Building Anything

R — Recognize: From Raw Observations to Failure Taxonomy

A — Assess: “How Do I Measure This Automatically?”

C — Calibrate: Can You Trust Your Evaluators?

E — Evolve: How Is Agent Evaluation Different?

S — Sustain: Keeping Evals Running in Production

Part II: Appendices

Appendix A: Anti-Pattern Reference

Appendix B: Cross-Reference Index

Appendix C: Quick-Reference Card Compilation

Appendix D: Glossary

Colophon

The Leanpub 60 Day 100% Happiness Guarantee

Earn $8 on a $10 Purchase, and $16 on a $20 Purchase

Free Updates. DRM Free.

Write and Publish on Leanpub