ML Observe
ML Observe cover card
ops

LLM Cost & Latency Observability with OpenTelemetry

Token spend and tail latency are the two metrics that decide whether an LLM feature ships or gets killed. How to instrument both with OpenTelemetry so you can answer 'why did this cost double?' in a query, not a war room.

By Priya Anand · · 8 min read

Two numbers decide the fate of most LLM features: what it costs per request and how slow the slow requests are. Leadership kills features that are quietly expensive and users abandon features whose p99 is five seconds. The good news is that both are observable from the same place — your LLM-call spans — if you instrument them deliberately. The standard wire format for doing this is OpenTelemetry’s GenAI conventions, and using them rather than a vendor’s bespoke attributes is what keeps the data portable and the bill legible.

Compute cost yourself; don’t wait for the invoice

The first principle of LLM cost observability: derive cost at request time from token counts, do not reconcile it from the provider’s monthly bill. The bill arrives too late to debug a regression and is aggregated past the point of usefulness. Every LLM-call span should carry the token usage and a cost you compute from a local price table:

  • gen_ai.usage.input_tokens and gen_ai.usage.output_tokens — the raw counts from the provider response.
  • gen_ai.request.model and gen_ai.response.model — request vs actually-served model; providers sometimes route or downgrade, and you pay for what served.
  • A derived cost_usd attribute = (input_tokens × input_price + output_tokens × output_price) for the served model, from a price table you version in code.

Pricing your own spend has a second benefit: when a provider changes prices, you change one table and every historical query re-prices consistently, instead of waiting a billing cycle to see the impact.

The latency breakdown that actually helps

“The request was slow” is useless; “the second LLM call’s time-to-first-token spiked” is actionable. Capture latency in parts:

  • Time to first token (TTFT) for streaming responses — the number users actually feel. Track it separately from total duration.
  • Total generation time and, where you can derive it, inter-token latency — distinguishes a slow start from a slow stream.
  • Queue/scheduling time vs provider time — if you have a gateway or rate-limiter in front, time spent waiting in your queue is a different problem than the provider being slow.
  • For multi-step requests, per-span latency across the waterfall (retrieval, first LLM pass, tool call, synthesis) so you can see which step regressed. (For the full span structure, see our piece on what belongs in an LLM trace.)

The single highest-value latency signal is TTFT at the tail, broken down per step. A p99 regression almost always localizes to one span, and the breakdown turns a two-hour investigation into a one-query lookup.

Use the conventions, not vendor attributes

OpenTelemetry’s GenAI semantic conventions standardize most of these attribute names. The reason to use them religiously is portability: every OTel-compatible backend understands gen_ai.* and gen_ai.usage.*, so you can switch backends, run two in parallel, or self-host without re-instrumenting. Vendor-specific attribute names are technical debt the day you write them. OpenLLMetry instruments most LLM SDKs to these conventions with minimal code, emitting to any OTel endpoint.

# illustrative span attributes on an LLM call
span.set_attribute("gen_ai.system", "anthropic")
span.set_attribute("gen_ai.request.model", "claude-opus-4-7")
span.set_attribute("gen_ai.response.model", resp.model)        # what actually served
span.set_attribute("gen_ai.usage.input_tokens", resp.usage.input_tokens)
span.set_attribute("gen_ai.usage.output_tokens", resp.usage.output_tokens)
span.set_attribute("cost_usd", price(resp.model, resp.usage))  # computed locally
span.set_attribute("ttft_ms", time_to_first_token_ms)          # streaming

Slice cost and latency by dimensions that mean something

Aggregate numbers hide the story. The dimensions worth slicing by — while respecting cardinality limits — are:

  • Per feature/route — which product surface is expensive or slow. Use a low-cardinality feature_id, never a raw URL.
  • Per model — cost and latency profile differs sharply across models; routing decisions live here.
  • Per cache outcome — prompt-cache hit vs miss is often the dominant cost lever; a falling hit rate is a cost regression in disguise.
  • Per finish reason — a spike in length/truncation finishes inflates output tokens and signals a prompt problem.

Keep these low-cardinality. Raw user IDs, full prompts, and free-text titles as attributes will blow up your trace backend’s bill faster than the LLM spend you are trying to control — hash or bucket them.

Sample without losing the expensive tail

You cannot afford to keep 100% of traces at scale, but naive head sampling discards exactly the requests you most need: the slow and costly ones. Use tail-based sampling, where the keep/drop decision is made after the trace completes:

  • 100% of errors
  • 100% of slow requests (above your p95 latency)
  • 100% of high-cost requests (top decile of cost_usd)
  • A baseline sample (e.g., 10%) of normal successful requests
  • 100% for the first 24h of any newly deployed feature, then decay

The OpenTelemetry Collector tail-sampling processor implements this if you self-host; most managed backends support it too. The point is that your aggregate cost and latency metrics come from counters on every request (cheap, complete), while your detailed traces are sampled toward the interesting tail (expensive, selective). Don’t conflate the two: emit metrics on 100%, sample spans on the tail.

Alert on the two numbers that kill features

Close the loop with alerts tied to the business reality:

  • Cost per request, per feature, with a hard ceiling and a trend alert. A doubling almost always traces to one cause: a prompt that grew, a cache hit-rate drop, output-length creep, or a model swap. Your slices answer “which” in seconds.
  • Tail latency (p95/p99 TTFT), per feature, against an SLO. Page on sustained breach, not a single spike.

These two alerts catch the regressions that otherwise surface as an end-of-month invoice surprise or a quiet drop in feature usage — both of which are far more expensive to discover late.

The payoff

Instrument cost and latency to the OTel GenAI conventions and the questions that used to mean a war room become queries: “why did this feature’s cost double last Tuesday?” → compare token-per-call and cache-hit-rate across the two days. “Why is the assistant slow for some users?” → TTFT-by-step on the slow-tail traces. The instrumentation is a few attributes per span and a price table you maintain. The return is that the two numbers most likely to get your LLM feature killed are the two you can see coming.

See also

Sources

  1. OpenTelemetry GenAI semantic conventions
  2. OpenTelemetry semantic-conventions repo (gen-ai)
  3. OpenLLMetry instrumentation
  4. OpenTelemetry Collector tail-sampling processor
Subscribe

ML Observe — in your inbox

ML observability deep dives — drift, debugging, monitoring. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments