ML Observe
End-to-end LLM trace waterfall
ops

End-to-End Tracing for LLM Applications: What Belongs in a Span

Production LLM apps span multiple model calls, tool invocations, retrieval steps, and re-tries. A complete trace makes them debuggable; a sparse one leaves you guessing.

By Priya Anand · · 8 min read

A user clicks a button. Three seconds later they get an answer. In between: 4 model calls, 2 vector-store retrievals, 1 web fetch, 1 re-rank step, 2 retries, and a structured-output validation. If you can see all of that as a unified trace, you can debug latency, cost, or correctness regressions in 5 minutes. If you can only see the final response, you cannot.

This is the structure of a complete LLM trace and what to put in each span.

The waterfall

A typical RAG-with-tool-use request decomposes into spans like:

[parent: handle_request]
  ├─ [retrieve_context]
  │   ├─ [embed_query]
  │   └─ [vector_search]
  ├─ [first_pass_llm]   <- planning step
  ├─ [tool_call: search_docs]
  │   └─ [http_request]
  ├─ [second_pass_llm]  <- synthesis step
  └─ [validate_output]

Each child span is a measurable unit: latency, cost, error class. The parent’s latency is the sum of child latencies plus orchestration overhead (which itself should be a span if non-trivial).

What belongs in an LLM span

A complete LLM-call span at minimum:

The OpenTelemetry GenAI semantic conventions standardize most of these. Use the conventions; vendor-specific names are technical debt.

For privacy, two attributes need careful handling:

A reasonable default: capture both at 1% sample rate in production, full capture in staging. Override per-feature based on data-class.

What belongs in a tool-call span

When the model calls a tool, capture:

The “ignored tool output” attribute is one of the most useful debugging signals. If a tool fires but its output never appears in the response, either the prompt isn’t using the result, or the model is hallucinating around it. Both are bugs worth catching.

What belongs in a retrieval span

For RAG retrievals:

If your retrieval system has a tiered cache, capture the tier each result came from. This is where most retrieval-latency regressions hide.

What does NOT belong in a span

Cardinality discipline

OpenTelemetry traces are stored in a backend with tag cardinality limits. High-cardinality attributes (raw user IDs, free-text titles, full URLs with query strings) blow up your bill or your backend.

Strategies:

Without cardinality discipline, your trace backend’s monthly bill outpaces your AI compute spend within a few months.

Sampling

For production at scale, you cannot trace 100% of requests. Reasonable sampling:

This requires “tail-based sampling” — the sampling decision is made after the trace completes. Most managed observability vendors support it; if you self-host, OpenTelemetry Collector’s tail-sampling processor is the standard option.

Attribute propagation

Across service boundaries (your app → vector store → tool service → external API), trace context must propagate via the traceparent HTTP header (W3C Trace Context). Most language SDKs do this automatically; verify in your stack.

When trace context is dropped, you get fragmented traces — the parent span shows in your backend, but the child doesn’t link. Debugging across the gap requires manual ID-stitching, which is error-prone.

Tooling

OpenLLMetry for instrumentation; vendor-neutral, instruments most LLM SDKs. Sends to any OTel-compatible backend.

Phoenix (Arize) as backend if you want LLM-specific UI; runs locally or as a service. Strong on trace exploration.

Helicone as a LLM-specific gateway-and-observability product. Open-source self-hosted available.

LangSmith if you’re already in the LangChain ecosystem; otherwise OTel + Phoenix is a more portable stack.

What the trace lets you do

Once instrumented, common debugging questions become 5-minute trace queries:

Without traces: spend 2 hours theorizing. With traces: see the answer immediately.

What to instrument first

If you’re starting from zero, instrumentation order:

  1. The outermost request boundary (handle_request span)
  2. Each LLM call (with the conventions above)
  3. Each tool call
  4. Each retrieval
  5. Errors (caught exceptions become span events)
  6. Cache lookups (separate span, even if cheap)
  7. Output validation / classification

You can ship #1-3 in a day with OpenLLMetry. The remaining items roll out over a sprint. Within two weeks you have full coverage and the debugging-time payback starts immediately.

Cross-references

For cost observability specifically, see llmops.report on token-cost observability. For drift detection with the same trace data, mlmonitoring.report on silent quality decay. For the security side of trace data (what NOT to log), see guardml.io on output classification.

The traces are the foundation; everything else builds on them.

Sources

  1. OpenLLMetry
  2. OpenTelemetry GenAI Semantic Conventions
  3. Phoenix (Arize)
#observability #tracing #opentelemetry #llm-ops #debugging
Subscribe

ML Observe — in your inbox

ML observability deep dives — drift, debugging, monitoring. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments