
May 30, 2025
Trace & Evaluate Your Agent with Arize Phoenix
Trace & Evaluate Your Agent with Arize Phoenix
Ever wondered how well your AI agent is performing? It can collect data, make choices, or automate tasks. But how will you know it is efficient, accurate, or trustworthy?
Here come tracing and evaluation. Building an agent is different from understanding its behavior behind the hood. Similar to driving a car, you want to know how fast it is (speed), fuel consumption, and direction.
Your ultimate AI agent debugging toolkit is Arize Phoenix. Phoenix lets you track your agent's actions, evaluate its decisions, and improve its performance. Here, I will explain how to build up an agent, trace its process, and evaluate its performance in a simple yet useful manner.
Setting Up Your AI Agent
Set up an agent before tracing and evaluating. You may skip ahead if you have one. Otherwise, let's make one!
Install the needed package first:
pip install -q smolagents
Let's add some important tools. Our agent will utilize DuckDuckGoSearchTool to find data and VisitWebpageTool to browse web pages.
from smolagents import CodeAgent, DuckDuckGoSearchTool, VisitWebpageTool, HfApiModel
hf_model = HfApiModel()
Now it is time to create the agent. Our AI assistant will search the internet:
agent = CodeAgent(
tools=[DuckDuckGoSearchTool(), VisitWebpageTool()],
model=hf_model,
add_base_tools=True
)
Let's try it with a real-world task. We will ask it to get a line graph of Google's share prices from 2020 to 2024.
agent.run("Fetch the share price of Google from 2020 to 2024, and create a line graph")
Though the agent is working now, we do not know how well the agent is performing. Exactly why we need tracing and evaluation!
Enabling Tracing with Arize Phoenix
Your agent is active. What happens if it answers incorrectly? Or worse, what if it fails completely? Tracing helps here.
Tracing shows how agents decide. It logs everything from input to tool calls and answer. This helps us fix bugs and boost performance.
Let's install Phoenix's telemetry module for tracing:
pip install -q 'smolagents[telemetry]'
Let's start a local Phoenix instance to track our agent:
python -m phoenix.server.main serve
After that, we must register a tracer provider to provide Phoenix agent logs.
from phoenix.otel import register
from openinference.instrumentation.smolagents import SmolagentsInstrumentor
tracer_provider = register(project_name="my-smolagents-app")
SmolagentsInstrumentor().instrument(tracer_provider=tracer_provider)
Every agent interaction is now real-time tracked. Test it with a simple query:
agent.run("What time is it in Tokyo right now?")
Boom! We can now track each move of our agent. If anything goes wrong, we will know where.
Evaluating Your Agentâs Performance
Tracing shows us what the agent is doing. What about measuring its performance? Here comes evaluation.
We address important questions with evaluations:
- Is the agent's answer relevant?
- Is it factually correct?
- How fast does it provide results?
First, let's set up GPT-4o as our test evaluation model.
pip install -q openai
We will get our agent's search query execution traces. This will let's evaluate agent information retrieval.
from phoenix.trace.dsl import SpanQuery
import phoenix as px
import json
query = SpanQuery().where("name == 'DuckDuckGoSearchTool'").select(
input="input.value",
reference="output.value"
)
tool_spans = px.Client().query_spans(query, project_name="my-smolagents-app")
tool_spans["input"] = tool_spans["input"].apply(lambda x: json.loads(x))
Next, we load the Relevancy Prompt Template to assess agent accuracy.
from phoenix.evals import (
RAG_RELEVANCY_PROMPT_TEMPLATE,
OpenAIModel,
llm_classify
)
GPT-4o will determine if the agent's search results are relevant; this is the actual magic.
eval_model = OpenAIModel(model="gpt-4o")
eval_results = llm_classify(
dataframe=tool_spans,
model=eval_model,
template=RAG_RELEVANCY_PROMPT_TEMPLATE,
rails=["relevant", "unrelated"],
concurrency=10,
provide_explanation=True
)
The situation is:
- GPT-4o evaluates agent input and output.
- Identifies relevant or irrelevant search results.
- The model gives a binary score (1 for relevant, 0 for irrelevant).
Want to see the results?
eval_results.head()
This shows us whether our agent is getting the proper info. If not, we can enhance accuracy by tweaking its logic.
Logging Evaluation Results to Phoenix
We may log evaluation results to Phoenix for visualization.
from phoenix.trace import SpanEvaluations
px.Client().log_evaluations(SpanEvaluations(eval_name="DuckDuckGoSearch"))
Phoenix's dashboard lets you read evaluation results, follow agent progress, and enhance performance after logging in.
Conclusion
All done! Built an AI agent, followed its process, and evaluated its performance using Arize Phoenix. Using tracing, we learned how our agent makes choices. Evaluations assessed its efficacy and identified opportunities for improvement.
The fun comes when you try various evaluation formats. Phoenix can measure answer accuracy, factual correctness, and multi-step reasoning.
Your AI agent's next move? Try these tools, improve your processes, and see your AI improve!
141 views