Defining Metrics for Success Using LLMs

In the early days of generative AI, evaluation was often a matter of “vibes.” You would run a prompt, read the output, nod your head, and say, “That looks about right.” For a hobbyist, that is fine. For a software engineer building reliable systems with Small AI, it is unacceptable.

When we work with “Small AI” (i.e., usually local models, quantized weights, and targeted use cases) we are trading the raw, brute-force false-omniscience of massive models for efficiency, privacy, and speed. To make that trade-off intelligent, we must measure exactly what we are losing in generality and what we are gaining in efficiency. We need rigorous metrics.

This chapter defines the scorecard for your Small AI projects. We will look at two categories of metrics: Operational Metrics (is it fast and cheap?) and Quality Metrics (is it smart enough?).

Operational Metrics: The Small AI Advantage

This is where Small AI shines. If you are running a 7-billion parameter model locally versus calling a 2-trillion parameter model in the cloud, these numbers or criteria may justify your architectural choice of Small AI:

Time to First Token (TTFT): This measures the latency between the user hitting “Enter” and seeing the first word appear. For interactive applications, this is your “snappiness” metric.
Target: < 200ms for a “real-time” feel.
Small AI Win: A local model typically eliminates network latency, often beating cloud APIs significantly on TTFT.
Tokens Per Second (TPS): Once the generation starts, how fast does it flow? This is your throughput.
Context: Human reading speed is roughly 3–5 tokens per second. If your local model generates at 50 TPS, it is effectively instantaneous to the user.
Cost Per 1k Successful Transactions: Notice I said “Successful” transactions. A cheap model that fails 50% of the time is expensive.

Quality Metrics: The “LLM-as-a-Judge” Pattern

How do you measure if your local, fine-tuned Llama-3 (or other) model is “smart” enough? You can’t use standard unit tests because the output varies. You could use human review, but that doesn’t scale for some use cases.

The solution is the LLM-as-a-Judge pattern. In this workflow, you use a massive, frontier model (like GPT-5) to act as the teacher grading the homework of your Small AI.

The Workflow is:

Create a Dataset: 50–100 inputs (prompts) typical of your use case.
Generate Outputs: Run these through your Small AI model.
The Judge Steps In: Send the pair (Input, Small_AI_Output) to the Frontier Model with a grading rubric prompt.

Here is a judge prompt example: “Rate the following response on a scale of 1–5 for factual accuracy and brevity. Output ONLY the number.”

This gives you a quantifiable accuracy score (e.g., “We are at 4.2/5 quality relative to GPT-5”) without paying for GPT-5 on every production query. You only pay for it during development and testing.

RAG Metrics: The Trinity of Trust

Many systems implemented using Small AI rely on RAG (Retrieval Augmented Generation) to inject knowledge into a LLM prompt. If your system fails, you need to know why. Was the model stupid, or was the data missing? RAG architectures lend themselves to logging intermediate processing steps that are useful during development and later to track down problems in production.

RAG systems historically use vector embedding databases but in my projects and experiments I usually combine vector embedding databases with classic BM25 text search.

We separate these concerns using three metrics (inspired by the RAGAS framework):

Context Precision (The Retrieval Metric): When the user asked about “Project X,” did your vector database actually retrieve documents about Project X? Low Score means: Your embeddings or search strategy are broken. The LLM never stood a chance.
Faithfulness (The Hallucination Metric): Did the LLM answer using only the retrieved context, or did it make things up? Low Score means: Your Small AI is hallucinating. You may need to adjust the system prompt or lower the temperature.
Answer Relevance (The Utility Metric): Did the answer actually help the user? Low Score means: The model might be polite and factual, but it’s answering the wrong question.

Deterministic Guardrails

Never underestimate the power of simple, boolean checks. For Small AI, which can be more prone to instruction-drift than massive models, these are safety nets:

JSON Validity Rate: If your app expects JSON, what percentage of outputs parse correctly?
Refusal Rate: How often does the model say “I cannot help with that”? (Too high = over-aligned/useless; Too low = unsafe).
Brevity Penalty: For Small AI, verbosity is a double cost: it looks bad and it hogs resources. You can measure this by simply counting the average token length of correct answers.

The Golden Ratio: Accuracy per Watt

Finally, I propose a metric specifically for this book: Accuracy per Watt.

If a 70B parameter model gives you 98% accuracy but burns 400 watts, and an 8B model gives you 96% accuracy consuming 40 watts, the “Small AI” choice is 10x more efficient for a 2% quality drop.

In engineering, we rarely hunt for “perfect.” We hunt for “optimal.” By using these metrics early, you stop chasing the impossible performance of a trillion-parameter god-model and start building a highly efficient, reliable tools that win on the metrics that actually matter for your business.

Code Example: The Judge Pattern

To implement the “LLM-as-a-Judge” pattern, we can create a Python script that automates the feedback loop between a lightweight target model and a more robust evaluator model. In the following example, we utilize the litellm library to interface with local Ollama instances, establishing a clear separation of duties: a smaller “Student” model generates answers to specific prompts, while a larger “Judge” model consumes those answers along with a strict grading rubric. This setup demonstrates how to programmatically enforce quality control by injecting a system prompt that explicitly defines criteria such as accuracy and brevity, ultimately parsing the judge’s output into a structured JSON format for immediate programmatic analysis.

The small model is 0.8 billion parameters and the judge model has 8 billion parameters.

  1 from litellm import completion
  2 import json
  3 
  4 DEBUG = False
  5 DEBUG_LENGTH = 300
  6 
  7 # --- CONFIGURATION ---
  8 # The "Small AI" we are testing (Student)
  9 # Using your specific model from the example
 10 #STUDENT_MODEL = "ollama_chat/qwen3:1.7b" 
 11 STUDENT_MODEL = "ollama_chat/gemma3:1b" 
 12 
 13 # The "Judge" model. 
 14 # Best Practice: Use a larger model here (e.g., Llama3.1-8B, Mistral-Large, or GPT-4o)
 15 # for the most reliable grading. 
 16 JUDGE_MODEL = "ollama_chat/rnj-1:latest" # 8B model
 17 OLLAMA_API = "http://localhost:11434"
 18 
 19 def get_student_response(user_prompt):
 20     """
 21     Generates a response from the small, target model.
 22     """
 23     response = completion(
 24         model=STUDENT_MODEL,
 25         messages=[{"role": "user", "content": user_prompt}],
 26         api_base=OLLAMA_API,
 27         temperature=0.3 # Keep it deterministic for testing
 28     )
 29     return response.choices[0].message.content
 30 
 31 def judge_response(user_prompt, student_answer):
 32     """
 33     Uses the stronger Judge model to evaluate the Student's work.
 34     Returns a JSON object with score and reasoning.
 35     """
 36     
 37     # The Rubric Prompt
 38     # This is the core of the "LLM-as-a-Judge" pattern.
 39     system_rubric = """
 40     You are an impartial AI Judge. You evaluate the quality of AI responses.
 41     
 42     Criteria for evaluation:
 43     1. Accuracy: Is the code or text factually correct?
 44     2. Brevity: Is the answer concise without unnecessary fluff?
 45     
 46     Output Format:
 47     Return ONLY a JSON object with the following keys:
 48     {
 49         "score": (int 1-10),
 50         "reasoning": (string, max 40 words),
 51         "pass": (boolean, true if score > 5),
 52         "suggested_prompt": (string, a DIFFERENT and improved version of the user question that would help the AI give a better response - must NOT be identical to the original question)
 53     }
 54     """
 55     
 56     evaluation_prompt = f"""
 57     [User Question]: {user_prompt}
 58     [AI Response]: {student_answer}
 59     
 60     Evaluate the AI Response based on the rubric.
 61     """
 62 
 63     response = completion(
 64         model=JUDGE_MODEL,
 65         messages=[
 66             {"role": "system", "content": system_rubric},
 67             {"role": "user", "content": evaluation_prompt}
 68         ],
 69         api_base=OLLAMA_API,
 70         json_mode=True # Enforce JSON output for easy parsing
 71     )
 72     
 73     return json.loads(response.choices[0].message.content)
 74 
 75 # --- MAIN EXECUTION ---
 76 if __name__ == "__main__":
 77     test_cases = [
 78         "Write a Python function to calculate the Fibonacci sequence.",
 79         "What are the names of President Lincoln's children?",
 80         "Compare Quantum Entanglement with the spirituality of meditation in two sentences."
 81     ]
 82 
 83     print(f"👨‍🎓 Student Model: {STUDENT_MODEL}")
 84     print(f"⚖️  Judge Model:   {JUDGE_MODEL}\n")
 85     print("-" * 60)
 86 
 87     for i, test_prompt in enumerate(test_cases, 1):
 88         print(f"Test #{i}: {test_prompt}")
 89         
 90         # 1. Run the Student
 91         student_ans = get_student_response(test_prompt)
 92         if DEBUG:
 93             print(f"\n>> Student Answer: {student_ans}...")
 94         else:
 95             print(f"\n>> Partial Student Answer: {student_ans[:DEBUG_LENGTH]}...")
 96         
 97         # 2. Run the Judge
 98         try:
 99             evaluation = judge_response(test_prompt, student_ans)
100             
101             # 3. Report Card
102             status = "✅ PASS" if evaluation.get('pass', evaluation['score'] > 5) else "❌ FAIL"
103             print(f">> Judge Rating: {evaluation['score']}/10 | {status}")
104             print(f">> Critique: {evaluation['reasoning']}")
105             if 'suggested_prompt' in evaluation:
106                 suggested = evaluation['suggested_prompt']
107                 print(f">> Suggested Prompt: {suggested}")
108                 
109                 # 4. Re-run Student with the improved prompt (only if different)
110                 if suggested.strip().lower() != test_prompt.strip().lower():
111                     improved_ans = get_student_response(suggested)
112                     if DEBUG:
113                         print(f">> Improved Student Answer : {improved_ans}")
114                     else:
115                         print(f">> Partial Improved Student Answer Using New Prompt: {improved_ans[:DEBUG_LENGTH]}...")
116                 else:
117                     print(">> (Skipped re-run: suggested prompt is same as original)")
118         except Exception as e:
119             print(f">> Judge Error: {e}")
120             
121         print("-" * 60)

The core logic resides within the judge_response function, where the “System Rubric” acts as the unit test definition. By setting json_mode=True in the litellm completion call, we force the Judge model to output structured data rather than free-form text, eliminating the fragile string parsing often required when dealing with conversational AI outputs. This ensures that the returned score and reasoning can be immediately utilized by downstream application logic, effectively treating subjective quality assessments as deterministic data points.

This architecture also highlights the strategic advantage of model stratification. The “Student” model is intentionally selected to be small (e.g., Gemma 3 1B), optimizing for inference speed and low resource consumption, while the “Judge” model is significantly larger to ensure nuanced reasoning and adherence to the prompt’s evaluation criteria. By decoupling generation from evaluation, developers can iterate rapidly on the smaller model’s prompts or fine-tuning, using the larger model as a consistent baseline for regression testing without incurring the latency of using a large model for every user interaction.

The following output is edited for brevity:

 1 $ uv run judge.py
 2 👨‍🎓 Student Model: ollama_chat/gemma3:1b
 3 ⚖️  Judge Model:   ollama_chat/rnj-1:latest
 4 
 5 ------------------------------------------------------------
 6 Test #1: Write a Python function to calculate the Fibonacci sequence.
 7 
 8 >> Partial Student Answer:
 9 def fibonacci(n):
10   """
11   Calculates the nth Fibonacci number.
12 
13   Args:
14     n: The index of the desired Fibonacci number (non-negative integer).
15 
16   Returns:
17     The nth Fibonacci number.  Returns 0 if n is 0.
18     Returns 1 if n is 1.
19   """
20   if n <= 0:
21     return 0
22   elif n == 1:
23     retur...
24 >> Judge Rating: 8/10 | ✅ PASS
25 >> Critique: The AI response provides a well-structured, efficient, and correct implementation of the Fibonacci sequence calculation. It includes a clear docstring, handles edge cases, uses an iterative approach for efficiency, and is well-documented. The only minor improvement could be adding a brief explanation of the iterative approach efficiency in the docstring.
26 >> Suggested Prompt: Write a Python function to calculate the Fibonacci sequence. Explain the efficiency of the iterative approach compared to a recursive approach.
27 >> Partial Improved Student Answer Using New Prompt: ```python
28 def fibonacci_iterative(n):
29   """
30   Calculates the nth Fibonacci number using an iterative approach.
31 
32   Args:
33     n: The index of the desired Fibonacci number (non-negative integer).
34 
35   Returns:
36     The nth Fibonacci number.
37   """
38   if n <= 1:
39     return n
40 ------------------------------------------------------------
41 Test #2: What are the names of President Lincolns children?
42 
43 >> Partial Student Answer: Abraham Lincoln had six children:
44 
45 1.  **Robert Todd Lincoln:** He was the eldest and was a prominent figure in the abolitionist movement.
46 2.  **Edward Lincoln:** He was a lawyer and became a prominent figure in the Illinois legislature.
47 3.  **William Henry Lincoln:** He was a successful businessman...
48 >> Judge Rating: 7/10 | ✅ PASS
49 >> Critique: The response is accurate and provides detailed information about Lincolns children. However, it could be more concise by focusing on the names and brief roles of each child.
50 >> Suggested Prompt: What were the names and roles of Abraham Lincolns children?
51 >> Partial Improved Student Answer Using New Prompt: Okay, let's break down Abraham Lincolns children. It is s a significant and often fascinating part of his life and legacy. Here's a list of his children, along with their roles and a little more detail:
52 
53 **1. Mary Todd Lincoln (1818-1865)**
54 
55 *   **Role:** Wife and First Lady of the United States. She ...
56 ------------------------------------------------------------
57 Test #3: Compare Quantum Entanglement with the spirituality of meditation in two sentences.
58 
59 >> Partial Student Answer: Quantum entanglement, where two particles become linked regardless of distance, mirrors the meditative experience of aligning with a deeper, interconnected reality, suggesting a unified field of consciousness.  Meditation, through focused awareness and stillness, can create a similar sense of resona...
60 >> Judge Rating: 8/10 | ✅ PASS
61 >> Critique: The response accurately draws a meaningful parallel between quantum entanglement and meditation, using concise language. It effectively conveys the concept of interconnectedness in both physics and spirituality.
62 >> Suggested Prompt: How does the concept of quantum entanglement relate to the spiritual practice of meditation in terms of interconnectedness and unity?
63 >> Partial Improved Student Answer Using New Prompt: Okay, this is a fascinating and increasingly explored intersection of quantum physics and spirituality. The connection between quantum entanglement and meditation is a complex and evolving area of research and discussion, but here is a breakdown of how it is being viewed, focusing on interconnectednes...
64 ------------------------------------------------------------

Wrap Up for LLM Metrics

Ultimately, defining metrics is the act of maturing your project from a fascinating experiment into a reliable software asset. In the realm of “Small AI,” where we deliberately trade the wider-range, sometimes omniscient-seeming capabilities of trillion-parameter models for the targeted efficiency of local systems, these metrics become your navigation tools. They allow you to pinpoint exactly where a smaller model punches above its weight, proving that while a local 8B model might lack the generic poetry writing skills of a frontier model, it can match or exceed it in specific, domain-constrained tasks when measured by accuracy per watt and latency. By moving beyond subjective “vibe checks” to rigorous standards like the LLM-as-a-Judge pattern and precise operational baselines, you gain the confidence to deploy appropriately sized models. You are no longer crossing your fingers and hoping the model works; you are engineering a system with known tolerances, predictable costs, and a clear definition of what “winning” looks like for your specific use case.

Up next

A Fair Shootout: Evaluating Large vs. Small Models for Task Fitness