I recently bought the book Designing Multi-Agent Systems: Principles, Patterns, and Implementation for AI Agents by Victor Dibia.

In chapter 10 “Evaluating Multi-Agent Systems” he discusses how to build evaluation frameworks with LLM-as-judge and metrics. This was a new topic for me and wanted to learn more about what Agent Evaluators are and why they matter. This is why I wrote this blog post. I hope this is also an interesting topic for more people who are building Agentic AI Solutions.
I highly recommend reading this book written by Victor Dibia!
Before we can start explaining evalutors we first need to discuss observability. According to Victor’s book users want to observe what agents did (actions) and why. This is related to the question, how we can trust AI in applications. For that we need to access and monitor the content created by AI systems.
That’s where observability comes into play. AI observability refers to the ability to monitor, understand, and troubleshoot AI systems throughout their lifecycle. It involves collecting and analyzing signals such as evaluation metrics, logs, traces, and model and agent outputs to gain visibility into performance, quality, safety, and operational health.
Now it’s time to discuss evaluators.
Evaluators are specialized tools that measure the quality, safety, and reliability of AI responses. By implementing systematic evaluations throughout the AI development lifecycle, teams can identify and address potential issues before they impact users.
When you build Agentic Solutions your agent or agents explores a series of actions, and you want to know which steps take place and if these steps lead to success or failure. With an evaluation framework you want to be able to capture the end-to-end process, what is called “trajectory” according to Victor Dibia,. This is a sequence of reasoning messages and actions that are executed by your Agentic solution.
With systematic evaluation we want to build trust. Victor shows how to create an evaluation framework with his Python library called picoagents. In this blog post I’ll be using an other evaluation framework, the Microsoft Foundry’s evaluation functionality.
With the the Azure AI Evaluation SDK lets you run evaluations locally on your machine and in the cloud. In this blog post I’m using the local evaluations. In this blog post, you learn how to run built-in evaluators locally on simple agent data or agent messages.
The following evaluators are available in the Azure AI Evaluation SDK:
Built-in evaluators are out of box evaluators provided by Microsoft:
| Category | Evaluator class |
|---|---|
| Performance and quality (AI-assisted) | GroundednessEvaluator, RelevanceEvaluator, CoherenceEvaluator, FluencyEvaluator, SimilarityEvaluator, RetrievalEvaluator |
| Performance and quality (NLP) | F1ScoreEvaluator, RougeScoreEvaluator, GleuScoreEvaluator, BleuScoreEvaluator, MeteorScoreEvaluator |
| Risk and safety (AI-assisted) | ViolenceEvaluator, SexualEvaluator, SelfHarmEvaluator, HateUnfairnessEvaluator, IndirectAttackEvaluator, ProtectedMaterialEvaluator |
| Composite | QAEvaluator, ContentSafetyEvaluator |
For more in-depth information on each evaluator definition and how it’s calculated, see Evaluation and monitoring metrics for generative AI.
In this blog post I will start with a simple textual similarity evaluator.
Similarity measures the degrees of semantic similarity between the generated text and its ground truth with respect to a query. Compared to other text-similarity metrics that require ground truths, this metric focuses on semantics of a response, instead of simple overlap in tokens or n-grams. It also considers the broader context of a query.
The Similarity Evaluator provides the following metrics.
| Property | What it means | Common uses |
|---|---|---|
| similarity | 1–5 semantic alignment score | KPI, gating, aggregation |
| gpt_similarity | Legacy mirror of similarity | Backward compatibility; migrate away |
| similarity_result | “pass”/”fail” vs threshold | CI/CD gates; pass‑rate trends |
| similarity_threshold | Decision boundary (default 3) | Tune per scenario; document version |
| similarity_prompt_tokens | Tokens in evaluator input | Cost & latency tracking; prompt hygiene |
| similarity_completion_tokens | Tokens in evaluator output | Sanity checks on completion size |
| similarity_total_tokens | Sum of prompt+completion | Cost dashboards; outlier detection |
| similarity_finish_reason | LLM stop code (e.g., length) | Troubleshoot truncation & limits |
| similarity_model | Evaluator model ID | Reproducibility; drift checks |
| similarity_sample_input | Serialized evaluator input | Auditing; re‑runs; dataset mapping checks |
| similarity_sample_output | Serialized evaluator output | Low‑level trace of evaluator result |
Below the scenario I choose to demo the similarity evaluator.
I chose the Similarity Evaluator for this demonstration because it perfectly illustrates a critical challenge in AI development: different models may answer the same question differently, even when both answers are semantically correct.
The test case uses a seemingly simple question: “How many times does the letter ‘e’ appear in ‘Mercedes-Benz’?” with a ground truth of “4 times (M-e-rc-e-d-e-s-B-e-nz)”. While this has a definitive answer, different LLMs might:
This is where the Similarity Evaluator stands out. Traditional software tools often fail if the wording isn’t an exact match, or they only count how many words look the same. The Similarity Evaluator is different because it uses AI to act like a human judge, understanding the actual meaning of the answer rather than just the words used. It recognizes that “4 times” and “The letter e appears four times in Mercedes-Benz” mean the same thing, even though they are written differently.
By testing multiple models (GPT-4o, Phi-4, GPT-5, Mistral-Small) with the same question and evaluating their responses against the ground truth, we can:
This multi-model comparison demonstrates how evaluation frameworks help us understand model behavior and build confidence in AI systems—exactly the kind of systematic evaluation Victor Dibia emphasizes in his book for building trustworthy multi-agent systems.

The full python script multi_model_evaluation.py is available in this GitHub Gist.
Here are the key components of the evaluation script:
First, we initialize the client for GitHub Models and create our agent.
# Initialize GitHub Models Client for this specific model
openai_chat_client = OpenAIChatClient
model_id=model_id,
api_key=os.environ.get("GITHUB_TOKEN"),
base_url=os.environ.get("GITHUB_ENDPOINT"),
)
# Create AI Agent
agent = ChatAgent(
chat_client=openai_chat_client,
instructions=instructions,
stream=True
)
We define a clear query and a comprehensive ground truth. The ground truth doesn’t just give the number; it explains the reasoning which helps the evaluator judge semantic correctness even if the phrasing varies.
# Define the test case
query = "How many times does the letter 'e' appear in 'Mercedes-Benz'?"
ground_truth = "4 times (M-e-rc-e-d-e-s-B-e-nz)"
We configure an Azure OpenAI model (e.g., GPT-5-chat) to act as the “Judge”. You cannot use reasoning models for evaluation.
# Configure Azure OpenAI for the evaluator (LLM-as-Judge)
model_config = AzureOpenAIModelConfiguration(
azure_endpoint=os.environ.get("AZURE_OPENAI_ENDPOINT"),
api_key=os.environ.get("AZURE_OPENAI_KEY"),
azure_deployment=os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
api_version=os.environ.get("AZURE_OPENAI_VERSION"),
)
instructions = "You are being evaluated on your ability to answer questions accurately and follow instructions precisely."
We initialize the SimilarityEvaluator with our judge configuration. We set a threshold (e.g., 3 out of 5) to determine pass/fail criteria.
# Create SimilarityEvaluator with threshold of 3
similarity = SimilarityEvaluator(model_config=model_config, threshold=3)
# ... inside evaluation loop ...
# Evaluate the response
eval_result = similarity(
query=query,
response=agent_answer,
ground_truth=ground_truth
)
Finally, we parse the evaluation results. The evaluator returns a score (1-5) and a boolean pass/fail result based on our threshold.
# Store results
score = eval_result.get('gpt_similarity', eval_result.get('similarity'))
result = eval_result.get('similarity_result', 'N/A')
print(f"Similarity Score: {score}/5")
print(f"Result: {result.upper()}")
By implementing this SimilarityEvaluator, we moved from ‘feeling’ that our agent is working to ‘knowing’ it is working based on quantifiable metrics. This is the first step towards building robust, production-ready Agentic AI systems.
In the next blog post, I’ll explore how to use evaluators as a feedback loop to improve prompt instructions. I also plan to investigate how we can use these metrics to compare different agent architectures—testing whether complex workflows or tool-enabled agents actually deliver better results than simpler implementations.
As fun exercise you can also try to change the prompt to see if the tests will pass.
instructions = """You are being evaluated on your ability to answer questions accurately and follow instructions precisely.
CRITICAL for letter/character counting questions:
- These are difficult for AI models due to tokenization
- ALWAYS spell out the word letter-by-letter with separators (e.g., M-e-r-c-e-d-e-s)
- Count each occurrence carefully
- Verify your count twice
- Provide only the final number
For other questions:
- Provide direct, concise answers without preambles
- Match the requested format exactly
- Be precise and literal
Examples:
- Question: "How many 'e's in 'Mercedes-Benz'?" → Answer: "4"
- Question: "Spell 'cat' backwards" → Answer: "tac"
Provide only the final answer in your response."""
Let me know iņ the comments how you are using evaluations.