How to Monitor LLM in Production: Metrics, Drift, and Alerting

Knowing how to monitor LLM in production is not the same as knowing how to monitor a REST API. The request succeeds with HTTP 200, latency looks fine, and the model is quietly giving wrong answers to 15% of users. Traditional APM catches none of that. For the broader monitoring picture, browse our topic index. Production LLM observability splits into three distinct layers that must be instrumented separately: infrastructure performance, output quality, and distribution shift. Miss any one of them and you are flying blind. To plan exactly which attributes each layer should emit per span, our Trace Span Designer builds the schema before you write the instrumentation.

Layer 1: Infrastructure Metrics — Latency, Throughput, and Cost

The first layer is the one that looks most familiar. You want p50/p95/p99 latency histograms, error rates, and request throughput — but the specific metrics that matter for LLMs differ from a typical service.

Time-to-first-token (TTFT) is the number that users feel. If TTFT exceeds ~2 seconds the interaction feels broken even if total generation time is fast. Track TTFT separately from end-to-end latency. For a vLLM or Ray Serve deployment, TTFT is the duration from when the request enters the scheduler to when the first output token is emitted.

Tokens per second (throughput) tells you whether your GPU is saturated. A healthy batch-inference setup will have stable tokens/sec under load; a drop usually signals KV cache pressure, tensor parallelism bottlenecks, or queue starvation. Ray Serve exposes these as Prometheus histogram and counter metrics ↗ out of the box — queue depth, replica states, and autoscaling decisions included.

Cost per request should be derived at span time, not reconciled from a provider invoice. Tag each LLM call span with input_tokens, output_tokens, and model ID. Compute cost inline from a local price table. This gives you cost broken down by user, endpoint, or prompt template — the invoice never will.

A minimal Prometheus scrape config for a Ray Serve deployment:

scrape_configs:
  - job_name: "ray-serve"
    static_configs:
      - targets: ["localhost:8080"]
    metrics_path: "/metrics"
    scrape_interval: 15s

Instrument your LLM call spans with OpenTelemetry’s Python SDK ↗ using the GenAI semantic conventions (gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.request.model). These are standard across providers and keep your telemetry portable when you switch models.

Set SLOs before you deploy. A concrete starting point:

TTFT p95 < 1.5 seconds
End-to-end latency p99 < 8 seconds
Error rate < 0.5%
Cost per 1000 requests < $X (derive from your budget)

Alert when any of these breach for two consecutive 5-minute windows. Alertmanager with PromQL rules handles this without custom code.

Layer 2: Output Quality and Distribution Drift

Infrastructure metrics tell you the model answered; they say nothing about whether it answered correctly. This is where most teams underinvest.

A 2023 Stanford study that tracked GPT-3.5 and GPT-4 across multiple months found that GPT-4’s accuracy on prime number identification dropped from 84% to 51% between March and June. Instruction-following behavior changed, code generation outputs shifted. The underlying model changed under the teams relying on it. The paper (arXiv:2307.09009 ↗) is a concrete proof that LLM behavior drifts even when the version string stays the same — monitoring output quality is not optional.

Eval sets and golden sets are the baseline. Maintain a fixed set of inputs with known-good outputs. Run this set on every deploy and on a daily schedule against the live endpoint. Regression on the golden set before traffic starts degrading is your early warning system. Aim for at least 200 examples covering the core user intents your model serves.

Online quality signals are harder but necessary. For tasks where ground truth exists (classification, extraction, structured output), compute accuracy directly. For open-ended generation, use LLM-as-judge patterns — a smaller, faster model that scores the primary model’s outputs on a rubric. This is imperfect but detectable at scale; a shift in the judge score distribution is a signal worth investigating.

Input distribution drift catches problems before outputs degrade. If your users start submitting prompts that are statistically different from your training distribution, quality will drop before your eval set catches it. Use the Population Stability Index (PSI) or Kolmogorov-Smirnov test on feature distributions (prompt length, embedding distance from centroid, vocabulary shift). Evidently AI ↗ provides drift detection for text data including PSI and KS tests, with dashboard integration that can alert when drift exceeds a threshold.

For RAG pipelines, add retrieval quality metrics: recall@k (what fraction of relevant chunks are in the top-k results), mean reciprocal rank (MRR), and context relevance scores. A drop in recall@k often precedes answer quality problems and is easier to compute than downstream accuracy.

A sample Evidently drift report configuration:

from evidently.report import Report
from evidently.metric_preset import TextOverviewPreset
from evidently.metrics import ColumnDriftMetric

report = Report(metrics=[
    ColumnDriftMetric(column_name="prompt_embedding", stattest="ks"),
    TextOverviewPreset(column_name="response_text"),
])

report.run(
    reference_data=reference_df,
    current_data=production_df,
)
report.save_html("drift_report.html")

For the security angle on output monitoring — prompt injection attacks and jailbreak attempts are a distribution shift in attacker-controlled inputs. See aisec.blog ↗ for detection patterns that overlap with standard drift detection. And if you want deeper coverage of MLOps-side monitoring architecture, sentryml.com ↗ covers model debugging and observability infrastructure in detail.

Layer 3: Alerting, Canary Deploys, and Incident Response

Alerting on LLMs requires slightly different logic than standard services. The noisiest false-alarm source is short bursts of degraded quality from a single bad input type. Use a 5-minute or 10-minute evaluation window before paging anyone. Alert on sustained degradation, not single-request failures.

Canary deploys before full rollout are the only safe way to ship model updates. Route 5-10% of traffic to the new version, compare TTFT, cost, and online quality signals against the baseline for 30 minutes. Automate the promotion decision if metrics stay within bounds; automate rollback if they don’t.

Shadow deploys are useful when you cannot afford any user-facing risk. Run the new model in parallel on all requests, log outputs without serving them, and compare offline. This costs double the inference compute but gives you a clean A/B comparison with zero user impact.

When an incident fires, the first three queries you should run:

Is it latency or quality? (infrastructure vs output layer)
Is it model-wide or subset-specific? (slice by user segment, endpoint, prompt template)
Did anything deploy in the last 24 hours? (model update, dependency change, prompt modification)

Instrument your spans with enough context — user segment, prompt template ID, model version, serving replica — so these queries take seconds, not hours.

Sources

Ray Serve Monitoring and Observability ↗ — Official Ray Serve docs covering Prometheus metric exports for latency histograms, queue depth, and autoscaling events.
Prometheus: Systems Monitoring & Alerting Toolkit ↗ — CNCF reference for scrape configuration, PromQL, and alert rule definition.
How Is ChatGPT’s Behavior Changing over Time? (arXiv:2307.09009) ↗ — Stanford study documenting behavioral drift in GPT-3.5 and GPT-4 across months; motivates continuous output quality monitoring.
OpenTelemetry Python Instrumentation ↗ — Standard SDK for adding traces, metrics, and logs to Python LLM services using GenAI semantic conventions.
Evidently AI: ML Monitoring Framework ↗ — Open-source and managed platform for drift detection, text quality metrics, hallucination scoring, and PII detection in production ML systems.

How to Monitor LLM in Production: Metrics, Drift, and Alerting

Layer 1: Infrastructure Metrics — Latency, Throughput, and Cost

Layer 2: Output Quality and Distribution Drift

Layer 3: Alerting, Canary Deploys, and Incident Response

Sources

Sources

ML Observe — in your inbox

Related

ML Model Monitoring Best Practices for Production Systems

How to Detect Data Drift: Statistical Tests, Thresholds, and Production Wiring

Alerting for ML Model Drift: A Practical Setup

Comments