May 30, 2025

Trace & Evaluate Your Agent with Arize Phoenix

machinelearning

python

aianalytics

arizephoenix

modelevaluation

aiagent

OnlyCoders

@onlyCoders

Share what you learn in this blog to prepare for your interview, create your forever-free profile now, and explore how to monetize your valuable knowledge.

Trace & Evaluate Your Agent with Arize Phoenix

Ever wondered how well your AI agent is performing? It can collect data, make choices, or automate tasks. But how will you know it is efficient, accurate, or trustworthy?
Here come tracing and evaluation. Building an agent is different from understanding its behavior behind the hood. Similar to driving a car, you want to know how fast it is (speed), fuel consumption, and direction.
Your ultimate AI agent debugging toolkit is Arize Phoenix. Phoenix lets you track your agent's actions, evaluate its decisions, and improve its performance. Here, I will explain how to build up an agent, trace its process, and evaluate its performance in a simple yet useful manner.

Setting Up Your AI Agent

Set up an agent before tracing and evaluating. You may skip ahead if you have one. Otherwise, let's make one!

Install the needed package first:

pip install -q smolagents

Let's add some important tools. Our agent will utilize DuckDuckGoSearchTool to find data and VisitWebpageTool to browse web pages.

from smolagents import CodeAgent, DuckDuckGoSearchTool, VisitWebpageTool, HfApiModel

hf_model = HfApiModel()

Now it is time to create the agent. Our AI assistant will search the internet:

agent = CodeAgent(
   tools=[DuckDuckGoSearchTool(), VisitWebpageTool()],
    model=hf_model,
   add_base_tools=True
)

Let's try it with a real-world task. We will ask it to get a line graph of Google's share prices from 2020 to 2024.

agent.run("Fetch the share price of Google from 2020 to 2024, and create a line graph")

Though the agent is working now, we do not know how well the agent is performing. Exactly why we need tracing and evaluation!

Enabling Tracing with Arize Phoenix

Your agent is active. What happens if it answers incorrectly? Or worse, what if it fails completely? Tracing helps here.

Tracing shows how agents decide. It logs everything from input to tool calls and answer. This helps us fix bugs and boost performance.

Let's install Phoenix's telemetry module for tracing:

pip install -q 'smolagents[telemetry]'

Let's start a local Phoenix instance to track our agent:

python -m phoenix.server.main serve

After that, we must register a tracer provider to provide Phoenix agent logs.

from phoenix.otel import register
from openinference.instrumentation.smolagents import SmolagentsInstrumentor

tracer_provider = register(project_name="my-smolagents-app")
SmolagentsInstrumentor().instrument(tracer_provider=tracer_provider)

Every agent interaction is now real-time tracked. Test it with a simple query:

agent.run("What time is it in Tokyo right now?")

Boom! We can now track each move of our agent. If anything goes wrong, we will know where.

Evaluating Your Agentâ€™s Performance

Tracing shows us what the agent is doing. What about measuring its performance? Here comes evaluation.

We address important questions with evaluations:

Is the agent's answer relevant?
Is it factually correct?
How fast does it provide results?

First, let's set up GPT-4o as our test evaluation model.

pip install -q openai

We will get our agent's search query execution traces. This will let's evaluate agent information retrieval.

from phoenix.trace.dsl import SpanQuery
import phoenix as px
import json

query = SpanQuery().where("name == 'DuckDuckGoSearchTool'").select(
   input="input.value",
   reference="output.value"
)

tool_spans = px.Client().query_spans(query, project_name="my-smolagents-app")
tool_spans["input"] = tool_spans["input"].apply(lambda x: json.loads(x))

Next, we load the Relevancy Prompt Template to assess agent accuracy.

from phoenix.evals import (
   RAG_RELEVANCY_PROMPT_TEMPLATE,
    OpenAIModel,
    llm_classify
)

GPT-4o will determine if the agent's search results are relevant; this is the actual magic.

eval_model = OpenAIModel(model="gpt-4o")

eval_results = llm_classify(
   dataframe=tool_spans,
    model=eval_model,
   template=RAG_RELEVANCY_PROMPT_TEMPLATE,
   rails=["relevant", "unrelated"],
    concurrency=10,
   provide_explanation=True
)

The situation is:

GPT-4o evaluates agent input and output.
Identifies relevant or irrelevant search results.
The model gives a binary score (1 for relevant, 0 for irrelevant).

Want to see the results?

eval_results.head()

This shows us whether our agent is getting the proper info. If not, we can enhance accuracy by tweaking its logic.

Logging Evaluation Results to Phoenix

We may log evaluation results to Phoenix for visualization.

from phoenix.trace import SpanEvaluations

px.Client().log_evaluations(SpanEvaluations(eval_name="DuckDuckGoSearch"))

Phoenix's dashboard lets you read evaluation results, follow agent progress, and enhance performance after logging in.

Conclusion

All done! Built an AI agent, followed its process, and evaluated its performance using Arize Phoenix. Using tracing, we learned how our agent makes choices. Evaluations assessed its efficacy and identified opportunities for improvement.

The fun comes when you try various evaluation formats. Phoenix can measure answer accuracy, factual correctness, and multi-step reasoning.

Your AI agent's next move? Try these tools, improve your processes, and see your AI improve!

141 views

Please Login to create a Question

Posts

Questions

Blogs

Jobs