AI Agent Observability Stack for Production Teams in 2026

Most AI agent failures are not mysterious once you can see the run.

The problem is that many teams still ship agents with the same visibility they used for a chatbot prototype: a request log, a final answer, and maybe a token count.

That is not enough in production.

An agent does not just answer. It plans, calls tools, reads retrieved context, asks for approvals, retries after errors, hands off to other agents, and sometimes touches systems that have real side effects. If you cannot reconstruct that path after a bad result, you do not have observability. You have vibes with timestamps.

This guide is a practical setup path for a production-friendly AI agent observability stack in 2026: traces, spans, evals, approval logs, cost metrics, and privacy controls.

TL;DR

For production agents, track five layers:

Trace the run so you can replay the agent’s path from user request to final output.
Log tool calls with inputs, outputs, latency, status, and side-effect level.
Record approvals for anything that writes, sends, spends, deletes, deploys, or changes account state.
Run evals against real failure modes, not just generic answer quality.
Measure cost and reliability per workflow, model, tool, tenant, and release.

The mistake is treating observability as a dashboard you add later. For agents, observability is part of the safety boundary.

Who this is for

This setup is for teams building agents that do more than chat:

internal copilots with access to documents, tickets, repos, CRM, or cloud APIs
coding or operations agents that call tools and modify files
support agents that draft refunds, emails, or account changes
research agents that retrieve sources and produce reports
workflow agents that pause for human approval before sensitive actions

If your system is still a single prompt that returns text, you can start smaller. But if the agent can call tools, you need this shape before the first real production rollout.

The stack: what you need to capture

The clean mental model is:

AI agent observability architecture showing traces, spans, tool-call logs, approval records, evals, cost metrics, privacy filters, and dashboards connected around a production agent workflow

1. Traces: the end-to-end story

A trace should represent one meaningful agent workflow.

For example:

“answer this support ticket”
“research this company”
“prepare a pull request”
“summarize this contract”
“update this internal record after approval”

The OpenAI Agents SDK includes built-in tracing for agent runs, model generations, tool calls, handoffs, guardrails, and custom events. OpenTelemetry’s GenAI semantic conventions define AI-specific signals for model operations, agent spans, events, exceptions, and metrics. OpenInference also builds on OpenTelemetry to describe LLM calls, agent steps, tool invocations, retrieval operations, and token/cost fields in a trace-friendly way.

The practical point is simple: do not reduce an agent run to one log line.

You want a tree.

agent_run
  -> user_request
  -> retrieval_query
  -> model_generation
  -> tool_call
  -> approval_request
  -> tool_execution
  -> final_response

That tree gives you the first thing every production team needs after an incident: a timeline.

2. Spans: the operations inside the run

Each important step should become a span or event:

LLM generation
retrieval query
tool call
guardrail check
handoff
approval request
approval decision
retry
fallback model route
final response assembly

For each span, capture:

Field	Why it matters
`workflow_name`	Groups traces by product feature or agent type.
`trace_id`	Lets support, engineering, and security talk about the same run.
`span_type`	Distinguishes model calls, tool calls, retrieval, approvals, and guardrails.
`model`	Shows which model actually handled the step.
`tool_name`	Shows which external system was touched.
`status`	Separates success, error, rejection, timeout, and fallback.
`latency_ms`	Finds slow model calls, slow tools, and stuck approvals.
`cost_usd` or token fields	Prevents hidden agent loops from becoming budget leaks.
`release_version`	Connects regressions to deployments.
`tenant_id` or team key	Enables debugging without exposing raw user data.

This is where OpenTelemetry-style thinking helps. You are not inventing a special “AI dashboard” from scratch. You are making agent work visible as distributed operations.

Approval logs are not optional

If an agent can perform a sensitive action, approval records are part of the audit trail.

The OpenAI Agents SDK human-in-the-loop flow pauses a run when a tool call requires approval, returns interruptions, and lets the run resume later from the same run state. That pattern is exactly what production teams should preserve in their own logs.

At minimum, an approval record should include:

trace_id
tool_name
requested action
risk level
requester or agent identity
reviewer identity
decision: approved, rejected, edited, expired
decision time
reason or policy note
resulting tool execution status

Do not only log the final tool result. Log the request, the decision, and the execution.

That is the difference between “the agent changed a customer record” and “the agent proposed a record change, Maria approved it at 10:42, and the write succeeded 800 ms later.”

A minimal event schema you can start with

Start with one shared event shape. You can refine later.

{
  "timestamp": "2026-05-13T09:30:00Z",
  "trace_id": "trace_agent_support_01",
  "workflow": "support_refund_assistant",
  "span_id": "span_tool_refund_01",
  "parent_span_id": "span_agent_plan_01",
  "event_type": "tool_call",
  "model": "reasoning-default",
  "tool_name": "refund.create_draft",
  "side_effect": "draft_only",
  "status": "ok",
  "latency_ms": 842,
  "input_redacted": true,
  "output_redacted": true,
  "tokens_in": 1240,
  "tokens_out": 310,
  "estimated_cost_usd": 0.018,
  "release": "agent-2026.05.13",
  "tenant_hash": "tenant_7f3a"
}

For approvals, keep the same trace and add an approval-specific event:

{
  "timestamp": "2026-05-13T09:31:11Z",
  "trace_id": "trace_agent_support_01",
  "event_type": "approval_decision",
  "tool_name": "refund.submit",
  "decision": "rejected",
  "reviewer_role": "support_lead",
  "reason": "Amount exceeded policy limit",
  "release": "agent-2026.05.13"
}

The important detail is not the exact field names. It is that every high-risk step can be joined back to the same trace.

Add evals where dashboards cannot help

Dashboards tell you what happened.

Evals tell you whether the behavior is good enough.

A production agent eval set should include:

common happy-path tasks
known failure cases
prompt-injection attempts
missing-context cases
bad-tool-output cases
approval-required actions
cost and latency thresholds
refusal or escalation behavior

For agent systems, a good eval is not just “did the final answer look nice?”

It should ask:

Did the agent use the right tool?
Did it avoid tools when it lacked permission?
Did it ask for approval at the right moment?
Did it cite or preserve source context?
Did it stop instead of looping?
Did it escalate when the risk was too high?

This pairs naturally with the testing pattern in Promptfoo: LLM Evals + Red-Teaming Workflow and the control model in AI Coding Agents Need Guardrails, Not More Autonomy.

Privacy: capture enough, not everything

Observability can become a data leak if you capture raw prompts, retrieved documents, tool outputs, and user identifiers without controls.

The OpenAI Agents SDK tracing docs note that generation spans and function spans may capture inputs and outputs, and expose configuration for sensitive data capture. OpenInference also treats privacy sensitivity as a first-class observability issue because prompts and completions can contain personal data.

The practical rule:

log the structure of the run by default, and make full payload capture a controlled debug mode.

Use these defaults:

hash tenant and user identifiers
redact emails, tokens, phone numbers, and account IDs before export
store full prompts only for approved debug sessions
separate trace metadata from sensitive payload storage
set short retention for high-sensitivity traces
block secrets from serialized run state
never ship raw customer data to a vendor dashboard without policy approval

If a trace cannot be shared safely in an incident review, it is not production-ready telemetry.

Dashboard views that actually help

Start with four dashboards.

1. Workflow health

Show:

success rate by workflow
error rate by release
p95 and p99 latency
average tool calls per run
retries per run
fallback model usage

This is your “is the agent broken?” view.

2. Tool-risk view

Show:

tool calls by tool name
write/delete/send/spend actions
approval rate
rejection rate
failed execution after approval
top tools by incident count

This is your “what can the agent actually touch?” view.

3. Cost view

Show:

cost per workflow
cost per tenant or team
tokens per run
retry cost
fallback cost
top expensive traces

This catches invisible runaway loops before finance catches them for you.

4. Eval regression view

Show:

pass rate by eval suite
failures by release
failures by model route
approval-policy failures
prompt-injection failures
source-grounding failures

This is the view you check before rollout.

Implementation starter checklist

Use this sequence:

Define your agent workflows by name.
Assign a trace_id to every run.
Instrument model calls, retrieval, tool calls, approvals, guardrails, and handoffs.
Redact sensitive fields before export.
Add approval records for every sensitive tool.
Create eval suites for the top five failure modes.
Add release/version tags to every trace.
Build dashboards for workflow health, tool risk, cost, and eval regressions.
Set alerts for approval bypass, high-cost loops, repeated tool failure, and eval regression.
Review traces weekly with product, engineering, and security until the failure modes stabilize.

Do not try to build the perfect observability stack in one weekend. Start with trace shape, approval records, and one eval suite. Then deepen the stack where incidents and user feedback point.

What not to do

Avoid these mistakes:

logging only the final answer
capturing raw prompts forever
treating approval as a UI event instead of an audit record
hiding provider/model routes behind vague aliases with no trace metadata
failing over to another model without marking it in the trace
measuring cost only at the account level
running evals only before launch, not after changes
giving support teams no trace link when users report a bad answer

The quiet killer is not one dramatic model failure. It is hundreds of small opaque failures that nobody can explain.

FAQ

Do I need OpenTelemetry for a small agent?

Not necessarily on day one. But you should still design your events as if they could become traces later. If the agent will call tools or touch customer data, use a trace-style structure from the start.

Should I store full prompts and outputs?

Only when you have a clear privacy and retention policy. For normal production, store metadata and redacted payloads. Use full capture for controlled debugging.

Are evals the same as monitoring?

No. Monitoring tells you what happened in production. Evals test whether the agent behaves correctly against known tasks and failure modes. You need both.

What is the first metric to alert on?

Start with tool failure rate and approval anomalies. If a sensitive tool is suddenly failing, bypassing review, or getting rejected more often, you want to know quickly.

Bottom line

Production agents need more than a chat transcript.

They need traces that explain the workflow, tool logs that show external actions, approval records that preserve human decisions, evals that catch regressions, and cost controls that stop runaway behavior.

The goal is not to watch the agent for fun.

The goal is to make the system explainable enough that when something goes wrong, your team can answer three questions fast:

What happened?
Why did it happen?
What should we change before the next run?

If your observability stack can answer those questions, the agent is much closer to being production-ready.

AI Agent Observability Stack: Traces, Evals, Approval Logs, and Cost Controls

TL;DR

Who this is for

The stack: what you need to capture

1. Traces: the end-to-end story

2. Spans: the operations inside the run

Approval logs are not optional

A minimal event schema you can start with

Add evals where dashboards cannot help

Privacy: capture enough, not everything

Dashboard views that actually help

1. Workflow health

2. Tool-risk view

3. Cost view

4. Eval regression view

Implementation starter checklist

What not to do

FAQ

Do I need OpenTelemetry for a small agent?

Should I store full prompts and outputs?

Are evals the same as monitoring?

What is the first metric to alert on?

Bottom line

Sources

Charles Jasthyn De La Cueva / Founder of Open-TechStack

AI Agent Observability Stack: Traces, Evals, Approval Logs, and Cost Controls

TL;DR

Who this is for

The stack: what you need to capture

1. Traces: the end-to-end story

2. Spans: the operations inside the run

Approval logs are not optional

A minimal event schema you can start with

Add evals where dashboards cannot help

Privacy: capture enough, not everything

Dashboard views that actually help

1. Workflow health

2. Tool-risk view

3. Cost view

4. Eval regression view

Implementation starter checklist

What not to do

FAQ

Do I need OpenTelemetry for a small agent?

Should I store full prompts and outputs?

Are evals the same as monitoring?

What is the first metric to alert on?

Bottom line

Sources

Charles Jasthyn De La Cueva / Founder of Open-TechStack

More in setup-guides

How to Set Up OpenClaw for Content Automation Workflows (2026)

How to Use AnythingLLM with OpenAI for Private Document Chat (2026)

How to Use LibreChat with OpenAI and MCP Servers (2026)

Get the Open-TechStack Newsletter

You're on the list!