Automatic Evaluation of LLM Results: More Tool Examples

As Large Language Models (LLMs) become increasingly integrated into production systems and workflows, the ability to systematically evaluate their performance becomes crucial. While qualitative assessment of LLM outputs remains important, organizations need robust, quantitative methods to measure and compare model performance across different prompts, use cases, and deployment scenarios. This has led to the development of specialized tools and frameworks designed specifically for LLM evaluation.

The evaluation of LLM outputs presents unique challenges that set it apart from traditional natural language processing metrics. Unlike straightforward classification or translation tasks, LLM responses often require assessment across multiple dimensions, including factual accuracy, relevance, coherence, creativity, and adherence to specified formats or constraints. Furthermore, the stochastic nature of LLM outputs means that the same prompt can generate different responses across multiple runs, necessitating evaluation methods that can account for this variability.

Modern LLM evaluation tools address these challenges through a combination of automated metrics, human-in-the-loop validation, and specialized frameworks for prompt testing and response analysis. These tools can help developers and researchers understand how well their prompts perform, identify potential failure modes, and optimize prompt engineering strategies. By providing quantitative insights into LLM performance, these evaluation tools enable more informed decisions about model selection, prompt design, and system architecture in LLM-powered applications.

In this chapter we take a simple approach:

Capture the chat history including output for an interaction with a LLM.
Generate a prompt containing the chat history, model output, and a request to a different LLM to evaluate the output generated by the first LLM. We request that the final output of the second LLM is a score of ‘G’ or ‘B’ (good or bad) judging the accuracy of the first LLM’s output.

We look at several examples in this chapter of approaches you might want to experiment with.

Tool For Judging LLM Results

Here we implement our simple approach of using a second LLM to evaluate the output of the first LLM tat generated a response to user input.

The following listing shows the tool tool_judge_results.py:

 1 """
 2 Judge results from LLM generation from prompts
 3 """
 4 
 5 from typing import Optional, Dict, Any
 6 from pathlib import Path
 7 import json
 8 import re
 9 from pprint import pprint
10 
11 import ollama
12 
13 client = ollama.Client()
14 
15 def judge_results(original_prompt: str, llm_gen_results: str) -> Dict[str, str]:
16     """
17     Takes an original prompt to a LLM and the output results
18 
19     Args:
20         original_prompt (str): original prompt to a LLM
21         llm_gen_results (str): output from the LLM that this function judges for accuracy
22 
23     Returns:
24         result: str: string that is one character with one of these values:
25             - 'B': Bad result
26             - 'G': A Good result
27     """
28     try:
29         messages = [
30             {"role": "system", "content": "Always judge this output for correctness."},
31             {"role": "user", "content": f"Evaluate this output:\n\n{llm_gen_results}\n\nfor this prompt:\n\n{original_prompt}\n\nDouble check your work and explain your thinking in a few sentences. End your output with a Y or N answer"},
32         ]
33 
34         response = client.chat(
35             model="qwen2.5-coder:14b", # "llama3.2:latest",
36             messages=messages,
37         )
38 
39         r = response.message.content.strip()
40         print(f"\n\noriginal COT response:\n\n{r}\n\n")
41 
42         # look at the end of the response for the Y or N judgement
43         s = r.lower()
44         # remove all non-alphabetic characters:
45         s = re.sub(r'[^a-zA-Z]', '', s).strip()
46 
47         return {'judgement': s[-1].upper(), 'reasoning': r[1:].strip()}
48 
49     except Exception as e:
50         print(f"\n\n***** {e=}\n\n")
51         return {'judgement': 'E', 'reasoning': str(e)}  # on any error, assign 'E' result

This Python code defines a function judge_results that takes an original prompt sent to a Large Language Model (LLM) and the generated response from the LLM, then attempts to judge the accuracy of the response.

Here’s a breakdown of the code:

The main function judge_results takes two parameters:

original_prompt: The initial prompt sent to an LLM
llm_gen_results: The output from the LLM that needs evaluation

The function judge_results returns a dictionary with two keys:

judgement: Single character (‘B’ for Bad, ‘G’ for Good, ‘E’ for Error)
reasoning: Detailed explanation of the judgment

The evaluation process is:

Creates a conversation with two messages:–System message: Sets the context for evaluation–User message: Combines the original prompt and results for evaluation
Uses the Qwen 2.5 Coder (14B parameter) model through Ollama
Expects a Y/N response at the end of the evaluation

Sample output

 1 $ cd OllamaEx
 2 $ python example_judge.py 
 3 
 4 ==================================================
 5  Judge output from a LLM
 6 ==================================================
 7 
 8 ==================================================
 9  First test: should be Y, or good
10 ==================================================
11 
12 
13 original COT response:
14 
15 The given output correctly calculates the absolute value of age differences for each pair:
16 
17 - Sally (55) and John (18): \( |55 - 18| = 37 \)
18 - Sally (55) and Mary (31): \( |55 - 31| = 24 \)
19 - John (18) and Mary (31): \( |31 - 18| = 13 \)
20 
21 These calculations are accurate, matching the prompt's requirements. Therefore, the answer is Y.
22 
23 
24 
25 ** JUDGEMENT ***
26 
27 judgement={'judgement': 'Y', 'reasoning': "The given output correctly calculates the absolute value of age differences for each pair:\n\n- Sally (55) and John (18): \\( |55 - 18| = 37 \\)\n- Sally (55) and Mary (31): \\( |55 - 31| = 24 \\)\n- John (18) and Mary (31): \\( |31 - 18| = 13 \\)\n\nThese calculations are accurate, matching the prompt's requirements. Therefore, the answer is Y."}
28 
29 ==================================================
30  Second test: should be N, or bad
31 ==================================================
32 
33 
34 original COT response:
35 
36 Let's evaluate the given calculations step by step:
37 
38 1. Sally (55) - John (18) = 37. The difference is calculated as 55 - 18, which equals 37.
39 2. Sally (55) - Mary (31) = 24. The difference is calculated as 55 - 31, which equals 24.
40 3. John (18) - Mary (31) = -13. However, the absolute value of this difference is |18 - 31| = 13.
41 
42 The given output shows:
43 - Sally and John: 55 - 18 = 31. This should be 37.
44 - Sally and Mary: 55 - 31 = 24. This is correct.
45 - John and Mary: 31 - 18 = 10. This should be 13.
46 
47 The output contains errors in the first and third calculations. Therefore, the answer is:
48 
49 N
50 
51 ** JUDGEMENT ***
52 
53 judgement={'judgement': 'N', 'reasoning': "et's evaluate the given calculations step by step:\n\n1. Sally (55) - John (18) = 37. The difference is calculated as 55 - 18, which equals 37.\n2. Sally (55) - Mary (31) = 24. The difference is calculated as 55 - 31, which equals 24.\n3. John (18) - Mary (31) = -13. However, the absolute value of this difference is |18 - 31| = 13.\n\nThe given output shows:\n- Sally and John: 55 - 18 = 31. This should be 37.\n- Sally and Mary: 55 - 31 = 24. This is correct.\n- John and Mary: 31 - 18 = 10. This should be 13.\n\nThe output contains errors in the first and third calculations. Therefore, the answer is:\n\nN"}

Evaluating LLM Responses Given a Chat History

Here we try a difference approach by asking the second “judge” LLM to evaluate the output of the first LLM based on specific criteria like “Response accuracy”, “Helpfulness”, etc.

The following listing shows the tool utility tool_llm_eval.py:

  1 import json
  2 from typing import List, Dict, Optional, Iterator
  3 import ollama
  4 from ollama import GenerateResponse
  5 
  6 
  7 def clean_json_response(response: str) -> str:
  8     """
  9     Cleans the response string by removing markdown code blocks and other formatting
 10     """
 11     # Remove markdown code block indicators
 12     response = response.replace("json", "").replace("```", "")
 13     # Strip whitespace
 14     response = response.strip()
 15     return response
 16 
 17 def evaluate_llm_conversation(
 18     chat_history: List[Dict[str, str]],
 19     evaluation_criteria: Optional[List[str]] = None,
 20     model: str = "llama3.1" # older model that is good at generating JSON
 21 ) -> Dict[str, any]:
 22     """
 23     Evaluates a chat history using Ollama to run the evaluation model.
 24 
 25     Args:
 26         chat_history: List of dictionaries containing the conversation
 27         evaluation_criteria: Optional list of specific criteria to evaluate
 28         model: Ollama model to use for evaluation
 29 
 30     Returns:
 31         Dictionary containing evaluation results
 32     """
 33     if evaluation_criteria is None:
 34         evaluation_criteria = [
 35             "Response accuracy",
 36             "Coherence and clarity",
 37             "Helpfulness",
 38             "Task completion",
 39             "Natural conversation flow"
 40         ]
 41 
 42     # Format chat history for evaluation
 43     formatted_chat = "\n".join([
 44         f"{'User' if msg['role'] == 'user' else 'Assistant'}: {msg['content']}"
 45         for msg in chat_history
 46     ])
 47 
 48     # Create evaluation prompt
 49     evaluation_prompt = f"""
 50     Please evaluate the following conversation between a user and an AI assistant.
 51     Focus on these criteria: {', '.join(evaluation_criteria)}
 52 
 53     Conversation:
 54     {formatted_chat}
 55 
 56     Provide a structured evaluation with:
 57     1. Scores (1-10) for each criterion
 58     2. Brief explanation for each score
 59     3. Overall assessment
 60     4. Suggestions for improvement
 61 
 62     Format your response as JSON.
 63     """
 64 
 65     try:
 66         # Get evaluation from Ollama
 67         response: GenerateResponse | Iterator[GenerateResponse] = ollama.generate(
 68             model=model,
 69             prompt=evaluation_prompt,
 70             system="You are an expert AI evaluator. Provide detailed, objective assessments in JSON format."
 71         )
 72 
 73         response_clean: str = clean_json_response(response['response'])
 74 
 75         # Parse the response to extract JSON
 76         try:
 77             evaluation_result = json.loads(response_clean)
 78         except json.JSONDecodeError:
 79             # Fallback if response isn't proper JSON
 80             evaluation_result = {
 81                 "error": "Could not parse evaluation as JSON",
 82                 "raw_response": response_clean
 83             }
 84 
 85         return evaluation_result
 86 
 87     except Exception as e:
 88         return {
 89             "error": f"Evaluation failed: {str(e)}",
 90             "status": "failed"
 91         }
 92 
 93 # Example usage
 94 if __name__ == "__main__":
 95     # Sample chat history
 96     sample_chat = [
 97         {"role": "user", "content": "What's the capital of France?"},
 98         {"role": "assistant", "content": "The capital of France is Paris."},
 99         {"role": "user", "content": "Tell me more about it."},
100         {"role": "assistant", "content": "Paris is the largest city in France and serves as the country's political, economic, and cultural center. It's known for landmarks like the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral."}
101     ]
102 
103     # Run evaluation
104     result = evaluate_llm_conversation(sample_chat)
105     print(json.dumps(result, indent=2))

We will use these five evaluation criteria:

Response accuracy
Coherence and clarity
Helpfulness
Task completion
Natural conversation flow

The main function evaluate_llm_conversation uses these steps:

Receives chat history and optional parameters
Formats the conversation into a readable string
Creates a detailed evaluation prompt
Sends prompt to Ollama for evaluation
Cleans and parses the response
Returns structured evaluation results

Sample Output

 1 $ cd OllamaEx 
 2 $ python tool_llm_eval.py 
 3 {
 4   "evaluation": {
 5     "responseAccuracy": {
 6       "score": 9,
 7       "explanation": "The assistant correctly answered the user's question about the capital of France, and provided accurate information when the user asked for more details."
 8     },
 9     "coherenceAndClarity": {
10       "score": 8,
11       "explanation": "The assistant's responses were clear and easy to understand. However, there was a slight shift in tone from a simple answer to a more formal description."
12     },
13     "helpfulness": {
14       "score": 9,
15       "explanation": "The assistant provided relevant information that helped the user gain a better understanding of Paris. The response was thorough and answered the user's follow-up question."
16     },
17     "taskCompletion": {
18       "score": 10,
19       "explanation": "The assistant completed both tasks: providing the capital of France and elaborating on it with additional context."
20     },
21     "naturalConversationFlow": {
22       "score": 7,
23       "explanation": "While the responses were clear, they felt a bit abrupt. The assistant could have maintained a more conversational tone or encouraged further discussion."
24     }
25   },
26   "overallAssessment": {
27     "score": 8.5,
28     "explanation": "The assistant demonstrated strong technical knowledge and was able to provide accurate information on demand. However, there were some minor lapses in natural conversation flow and coherence."
29   },
30   "suggestionsForImprovement": [
31     {
32       "improvementArea": "NaturalConversationFlow",
33       "description": "Consider using more conversational language or prompts to engage users further."
34     },
35     {
36       "improvementArea": "CoherenceAndClarity",
37       "description": "Use transitional phrases and maintain a consistent tone throughout the conversation."
38     }
39   ]
40 }

A Tool for Detecting Hallucinations

Here we use a text template file templates/anti_hallucinations.txt to define the prompt template for checking a user input, a context, and the resulting output by another LLM (most of the file is not shown for brevity):

 1 You are a fair judge and an expert at identifying false hallucinations and you are tasked with evaluating the accuracy of an AI-generated answer to a given context. Analyze the provided INPUT, CONTEXT, and OUTPUT to determine if the OUTPUT contains any hallucinations or false information.
 2 
 3 Guidelines:
 4 1. The OUTPUT must not contradict any information given in the CONTEXT.
 5 2. The OUTPUT must not introduce new information beyond what's provided in the CONTEXT.
 6 3. The OUTPUT should not contradict well-established facts or general knowledge.
 7 4. Check that the OUTPUT doesn't oversimplify or generalize information in a way that changes its meaning or accuracy.
 8 
 9 Analyze the text thoroughly and assign a hallucination score between 0 and 1, where:
10 - 0.0: The OUTPUT is unfaithful or is incorrect to the CONTEXT and the user's INPUT
11 - 1.0: The OUTPUT is entirely accurate abd faithful to the CONTEXT and the user's INPUT
12 
13 INPUT:
14 {input}
15 
16 CONTEXT:
17 {context}
18 
19 OUTPUT:
20 {output}
21 
22 Provide your judgement in JSON format:
23 {{
24     "score": <your score between 0.0 and 1.0>,
25     "reason": [
26         <list your reasoning as Python strings>
27     ]
28 }}

Here is the tool tool_anti_hallucination.py that uses this template:

 1 """
 2 Provides functions detecting hallucinations by other LLMs
 3 """
 4 
 5 from typing import Optional, Dict, Any
 6 from pathlib import Path
 7 from pprint import pprint
 8 import json
 9 from ollama import ChatResponse
10 from ollama import chat
11 
12 def read_anti_hallucination_template() -> str:
13     """
14     Reads the anti-hallucination template file and returns the content
15     """
16     template_path = Path(__file__).parent / "templates" / "anti_hallucination.txt"
17     with template_path.open("r", encoding="utf-8") as f:
18         content = f.read()
19         return content
20 
21 TEMPLATE = read_anti_hallucination_template()
22 
23 def detect_hallucination(user_input: str, context: str, output: str) -> str:
24     """
25     Given user input, context, and LLM output, detect hallucination
26 
27     Args:
28         user_input (str): User's input text prompt
29         context (str): Context text for LLM
30         output (str): LLM's output text that is to be evaluated as being a hallucination)
31 
32     Returns: JSON data:
33      {
34        "score": <your score between 0.0 and 1.0>,
35        "reason": [
36          <list your reasoning as bullet points>
37        ]
38      }
39     """
40     prompt = TEMPLATE.format(input=user_input, context=context, output=output)
41     response: ChatResponse = chat(
42         model="llama3.2:latest",
43         messages=[
44             {"role": "system", "content": prompt},
45             {"role": "user", "content": output},
46         ],
47     )
48     try:
49         return json.loads(response.message.content)
50     except json.JSONDecodeError:
51         print(f"Error decoding JSON: {response.message.content}")
52     return {"score": 0.0, "reason": ["Error decoding JSON"]}
53 
54 
55 # Export the functions
56 __all__ = ["detect_hallucination"]
57 
58 ## Test only code:
59 
60 def main():
61     def separator(title: str):
62         """Prints a section separator"""
63         print(f"\n{'=' * 50}")
64         print(f" {title}")
65         print('=' * 50)
66 
67     # Test file writing
68     separator("Detect hallucination from a LLM")
69 
70     test_prompt = "Sally is 55, John is 18, and Mary is 31. What are pairwise combinations of the absolute value of age differences?"
71     test_context = "Double check all math results."
72     test_output = "Sally and John:  55 - 18 = 31. Sally and Mary:  55 - 31 = 24. John and Mary:  31 - 18 = 10."
73     judgement = detect_hallucination(test_prompt, test_context, test_output)
74     print(f"\n** JUDGEMENT ***\n")
75     pprint(judgement)
76 
77 if __name__ == "__main__":
78     try:
79         main()
80     except Exception as e:
81         print(f"An error occurred: {str(e)}")

This code implements a hallucination detection system for Large Language Models (LLMs) using the Ollama framework. The core functionality revolves around the detect_hallucination function, which takes three parameters: user input, context, and LLM output, and evaluates whether the output contains hallucinated content by utilizing another LLM (llama3.2) as a judge. The system reads a template from a file to structure the evaluation prompt.

The implementation includes type hints and error handling, particularly for JSON parsing of the response. The output is structured as a JSON object containing a hallucination score (between 0.0 and 1.0) and a list of reasoning points. The code also includes a test harness that demonstrates the system’s usage with a mathematical example, checking for accuracy in age difference calculations. The modular design allows for easy integration into larger systems through the explicit export of the detect_hallucination function.

The output looks something like this:

 1 python /Users/markw/GITHUB/OllamaExamples/tool_anti_hallucination.py 
 2 
 3 ==================================================
 4  Detect hallucination from a LLM
 5 ==================================================
 6 
 7 ** JUDGEMENT ***
 8 
 9 {'reason': ['The OUTPUT claims that the absolute value of age differences are '
10             '31, 24, and 10 for Sally and John, Sally and Mary, and John and '
11             'Mary respectively. However, this contradicts the CONTEXT, as the '
12             'CONTEXT asks to double-check math results.',
13             'The OUTPUT does not introduce new information, but it provides '
14             'incorrect calculations: Sally and John: 55 - 18 = 37, Sally and '
15             'Mary: 55 - 31 = 24, John and Mary: 31 - 18 = 13. Therefore, the '
16             'actual output should be recalculated to ensure accuracy.',
17             'The OUTPUT oversimplifies the age differences by not considering '
18             "the order of subtraction (i.e., John's age subtracted from "
19             "Sally's or Mary's). However, this is already identified as a "
20             'contradiction in point 1.'],
21  'score': 0.0}

Wrap Up

Here we looked at several examples for using one LLM to rate the accuracy, usefulness, etc. of another LLM given an input prompt. There are two topics in this book that I spend most of my personal LLM research time on: automatic evaluation of LLM results, and tool using agents (the subject of the next chapter).

Up next

Prompt Caching