If you ship LLM features, you eventually get the same three painful questions:

  1. Why did this answer happen? (replayable traces)
  2. Why did it get worse? (evaluations + regression tracking)
  3. Why did we go down / spike cost? (routing + reliability controls)

That’s the real “LLM observability” job.

In 2026, three popular open-source-friendly options live in different parts of that stack:

  • Arize Phoenix: OpenTelemetry-first tracing + evaluation workflows.
  • Langfuse: a broader LLM engineering platform (tracing + prompts + evals) that can self-host at production scale.
  • Helicone: observability plus a gateway-first story (OpenAI-compatible API, routing, fallbacks) — with the AI Gateway explicitly labeled beta.

This post is a decision framework, not a feature dump.

If you want an “OTel-first tracing stack” deep dive with Phoenix specifically, read: LLM Tracing with OpenTelemetry + Phoenix (2026).

TL;DR: pick by architecture, not vibes

If your constraint is…Default pickWhy
“We already run OpenTelemetry and want zero lock-in.”PhoenixPhoenix is a collector/UI that receives traces via OTLP and leans hard into OpenTelemetry + OpenInference conventions.
“We need prompt versioning + eval workflows + tracing in one product.”LangfuseLangfuse is explicitly built around tracing, prompt management, and evaluation workflows — with self-hosting that’s designed for production.
“We want one OpenAI-compatible API for many providers + routing/fallbacks.”HeliconeHelicone’s AI Gateway is a unified OpenAI-style API for many providers with routing/fallbacks and built-in observability (but it’s beta).

What “LLM observability” actually means (so you don’t buy the wrong thing)

Before you compare tools, define what you need to observe.

At minimum, you want a trace to answer these questions:

  • What happened? (prompt + tool calls + retrieved context + response)
  • Where did it happen? (service, worker, queue job, function)
  • When did it happen? (latency breakdown per step)
  • How expensive was it? (tokens, cached tokens, provider cost)
  • Who did it happen to? (session, user, org/tenant)

If your logging stack cannot answer those, you don’t have observability — you have screenshots of failures.

That’s why OpenTelemetry-style tracing matters: you get a standard mental model (traces/spans/attributes) and a standard transport (OTLP) that can move data between systems.

The most important distinction: OTel-first vs gateway-first

Almost all “which tool should we use?” debates collapse to one choice:

Option A: Instrumentation-first (OpenTelemetry)

You emit spans from your app, export them via OTLP, and then decide where they go.

  • Pros: portability, multi-destination, fits existing infra, easy to swap backends.
  • Cons: you still need product workflows (prompt versions, evals, annotations) somewhere.

Phoenix lives here, and both Phoenix and Langfuse emphasize OpenTelemetry compatibility.

Option B: Gateway-first (proxy everything)

You route LLM traffic through a gateway that can:

  • log every request centrally,

  • enforce keys/quotas,

  • do failover/routing,

  • standardize provider differences.

  • Pros: fast adoption (one integration point), reliability controls, unified logging.

  • Cons: you’re inserting a new critical path; some “provider-native” features can get awkward.

Helicone’s AI Gateway is explicitly positioned in this camp.

What each tool is “really for”

Phoenix: debugging-first tracing with OpenInference semantics

Phoenix’s docs describe it as a server that receives traces over OTLP, with instrumentation managed through OpenInference instrumentors. That’s a strong signal about intent: structured tracing of AI app behavior, not just API logs.

Phoenix is a great fit when:

  • you want distributed traces that connect retrieval, tools, and LLM calls,
  • you want an OTel-native workflow,
  • you want to keep the option to ship traces to other OTel backends later.

Practical note: Phoenix docs state it supports OTLP over HTTP for trace ingestion, which matters for simple deployments where you don’t want gRPC everywhere.

Where Phoenix shines is the “debugging loop”:

  • reproduce a user issue,
  • click through the trace tree,
  • find the span where context got lost (retrieval miss, tool failure, bad prompt version),
  • then fix the system, not just the prompt.

Langfuse: a product workflow layer on top of traces

Langfuse positions itself as an open-source “LLM engineering platform” with three built-in lanes:

  • Observability (traces, cost, latency)
  • Prompts (version control, deployment labels, playground)
  • Evaluation (datasets/experiments, production monitoring)

The key difference versus a tracing UI is that Langfuse is designed to be a collaboration surface: prompt versions, experiments, and evaluation outputs are first-class.

If you self-host, Langfuse’s published architecture is also a clue: it calls out a transactional store (Postgres) plus an OLAP store (ClickHouse), a cache/queue (Redis/Valkey), and blob storage (S3/Blob) for event persistence. That is the shape of a system that expects serious ingestion volume.

Langfuse is a great fit when:

  • you want observability and prompt lifecycle workflows in one place,
  • you want “product” features like datasets/experiments tied to production traces,
  • you’re willing to run a more involved self-host stack (or use their cloud).

Licensing reality check (important for planning): Langfuse’s docs describe core product capabilities as MIT licensed, with optional enterprise modules that require a commercial license when you self-host.

Helicone: observability plus reliability controls (with a beta gateway)

Helicone’s pitch is not just “monitor.” It explicitly frames production LLM pain as outages, cost spikes, and debugging issues — and pairs observability with routing/reliability controls.

Helicone’s AI Gateway docs describe:

  • an OpenAI-compatible API surface,
  • “intelligent routing” and “automatic fallbacks,”
  • and access to “100+” providers/models through a unified interface,
  • with a prominent note that the AI Gateway is in beta.

That beta label is not a deal-breaker — but it is a signal to be careful about “bet your production path on it” decisions.

Helicone is a great fit when:

  • you want a single integration point for multi-provider access,
  • you need routing/failover without wiring a custom proxy layer,
  • you want observability that’s naturally coupled to request routing decisions.

Self-hosting reality: complexity is part of the decision

If you’re choosing for a team, you’re also choosing an operational burden.

Here is the rule of thumb:

  • Phoenix is easiest to host because it looks like “collector + UI” and speaks OTLP.
  • Langfuse is heavier because it’s a workflow platform (OLAP + blob + queue/cache + workers).
  • A gateway adds a new critical-path service. Treat it like an API tier: health checks, autoscaling, rollbacks, and incident runbooks.

If your team is small and you don’t have infra bandwidth, the “best” tool is the one that you can run reliably.

Data governance: you probably need redaction, not just dashboards

LLM traces are dangerous because they tend to include:

  • raw user input (often PII),
  • retrieved documents (often sensitive),
  • tool outputs (often internal data),
  • model outputs (may contain secrets).

If you adopt tracing without a data policy, you eventually ship a “customer data leak” feature.

Practical baseline:

  • hash or redact obvious PII fields (emails, phone numbers) before exporting,
  • store prompt templates and prompt parameters separately when possible,
  • make “full prompt capture” an opt-in debug mode, not the default.

(OpenInference explicitly calls out privacy sensitivity and masking as a first-class need for AI traces.)

A practical decision checklist (what I’d ask in a real team)

1) Do you already run OpenTelemetry?

  • Yes → start with an OTel-first plan and seriously consider Phoenix (or Langfuse if you need prompt/eval workflows).
  • No → you can still use OTel, but you may prefer a gateway/product integration first for speed.

2) Do you need prompt versioning and “non-engineer safe” iteration?

  • Yes → Langfuse is the most direct match (prompt management is explicitly a core feature).
  • No → Phoenix can be the simplest “show me what happened” layer.

3) Is multi-provider routing/failover a hard requirement?

  • Yes → the gateway-first path gets attractive quickly. Helicone is designed around that story.
  • No → keep the request path simple; instrument the app and export spans.

4) Are you self-hosting under security/compliance constraints?

If you need on-prem/VPC-only:

  • Phoenix tends to be a simpler self-host footprint (collector + UI model).
  • Langfuse can self-host, but plan for a heavier stack (OLAP + blob + queues).
  • A gateway adds a new critical-path service; make sure that’s acceptable.

Suggested “safe adoption” path (minimize regret)

If you want the lowest-regret sequence, ship in this order:

  1. Instrument with OpenTelemetry + AI semantic conventions (OpenInference is explicitly built to complement OpenTelemetry for AI apps).
  2. Export via OTLP (so you can swap destinations later).
  3. Start with Phoenix for fast “why did this happen?” debugging.
  4. Add a platform layer (Langfuse) if you need prompt lifecycle, experiments, and eval workflows tied to traces.
  5. Add a gateway (Helicone) only when routing/fallbacks and unified provider access become a real operational need.

This avoids the classic mistake: adopting a gateway because it feels “more enterprise,” then discovering it’s now the thing that breaks your app.

“Which one should I choose?” (quick recommendations)

  • Choose Phoenix if you want OTel-native tracing and a clean, portable observability foundation.
  • Choose Langfuse if you want a broader LLM engineering workflow (prompts + evals + traces) and you’re okay running a heavier self-host stack.
  • Choose Helicone if you want a gateway-first architecture for routing/fallbacks and unified observability — and you’re comfortable with the AI Gateway’s beta status.

Sources