ML Observe
ML Observe cover card
ops

Alerting for ML Model Drift: A Practical Setup

Drift alerting fails in one of two ways — it never fires, or it fires constantly until everyone mutes it. A concrete setup for alerts that fire when performance is actually at risk, and stay quiet when it isn't.

By Priya Anand · · 8 min read

Most drift alerting is broken in one of two predictable ways. Either it never fires — a quarterly drift report nobody reads — or it fires on every feature that wobbles, until the channel is muted and the one alert that mattered scrolls past unseen. The fix is not a better statistical test. It is deciding, deliberately, what you alert on and what an alert is allowed to mean. Here is a setup that holds up in production.

The core principle: alert on consequence, not on movement

Input distributions move all the time. A feature drifting is not, by itself, a problem — it is a problem only if it degrades the model’s predictions. The single most important design decision is to push detection up the stack: alert on estimated or measured performance loss first, and treat raw input drift as a diagnostic you consult after a performance alert, not as a pager trigger on its own.

This is exactly the gap NannyML was built for: estimating a metric like ROC AUC on unlabeled production data (via methods such as CBPE) before ground-truth labels arrive, so you can alert on “the model is probably less accurate now” weeks before the labels confirm it. Evidently complements it with detailed feature-level drift and data-quality checks, packaged as pass/fail test suites you can run in a pipeline. The division of labor: performance estimate triggers the page; feature drift report explains it.

A three-tier alert design

Not every signal deserves a page. Split alerts into three tiers with different routing:

  • Tier 1 — Page (wake someone up). Estimated performance has dropped below a hard threshold, or a data-integrity break (a feature is suddenly all-null, a schema changed, an upstream join is empty). These mean the model may be producing harmful output now.
  • Tier 2 — Ticket (handle this week). Sustained moderate drift on a performance-relevant feature, slow degradation trend, or rising prediction-confidence skew. Real, not urgent.
  • Tier 3 — Dashboard only (no notification). Individual feature drift, distribution shifts on low-importance features, expected seasonal movement. Visible when you go looking; silent otherwise.

The mistake teams make is wiring everything to Tier 1. The discipline is that most drift belongs in Tier 3, and earning a promotion to Tier 1 requires evidence of consequence.

Thresholds that don’t cry wolf

A single statistical-test p-value as a pager threshold is a false-positive machine — with enough features and enough traffic, something is always “significant.” Better practices:

  • Tie thresholds to performance impact, not to a generic significance level. “AUC estimate dropped more than X below the validation baseline” beats “PSI exceeded 0.2 on some feature.”
  • Require persistence. Fire only when the condition holds for N consecutive windows (e.g., 3 hourly windows), not on a single noisy sample. This alone removes most spurious pages.
  • Weight by feature importance. Drift on a top-5 SHAP feature matters; drift on a feature the model barely uses does not. Gate Tier 1/2 on importance-weighted drift.
  • Set seasonal baselines. Compare against the same window last week or last cycle, not a frozen training-time reference, so predictable daily and weekly patterns do not page you.

Respect the assumptions, or the alert lies

Performance estimation is powerful but has assumptions you must honor or it will mislead. The big one: estimators like CBPE assume no concept drift within the estimation window — they reweight known performance, they do not detect a fundamentally changed input-output relationship. So pair the estimate with two guards:

  • A realized-vs-estimated reconciliation alert that fires when labels eventually arrive and disagree with the estimate. Persistent disagreement means your estimator’s assumptions are breaking, which is itself a Tier 2 signal.
  • Out-of-distribution / novel-category detection on inputs, because a brand-new category the model never trained on is exactly the case estimation handles poorly.

Wiring it into your existing alerting

You do not need an ML-specific alerting product. Emit drift and performance metrics as time series and route them through the alerting system you already run. With a Prometheus-style stack: your monitoring job (Evidently test suite, NannyML estimation) writes metrics like model_estimated_auc, feature_drift_score{feature=...,importance=...}, and data_integrity_violation; alert rules encode the tiering and the persistence (for: 3h) and routing; Alertmanager handles dedup, grouping, and escalation. This reuses on-call infrastructure your team already trusts instead of standing up a parallel one.

# illustrative Prometheus alert rule — tier 1
- alert: ModelPerformanceDegraded
  expr: model_estimated_auc < 0.78          # baseline-relative hard floor
  for: 3h                                    # persistence: 3 consecutive windows
  labels:
    severity: page
    tier: "1"
  annotations:
    summary: "Estimated AUC below floor for 3h on {{ $labels.model }}"
    runbook: "https://runbooks.internal/ml-drift"

Don’t forget the LLM case

For generative systems, “drift” is less about feature distributions and more about output-quality decay — rising refusal rates, falling eval scores from an LLM-as-judge, retrieval-relevance slipping, or cost/latency creeping up. Capture those as metrics on your traces (instrumented to the OpenTelemetry GenAI conventions so they stay portable) and apply the same three-tier discipline: page on a quality-eval floor breach or a hard cost spike, ticket on slow eval-score decline, dashboard everything else.

Close the loop with a runbook

An alert that pages someone with no instruction is half an alert. Every Tier 1 rule should link a runbook that answers: which dashboard shows the contributing drift, who owns the model, what the rollback or retrain trigger is, and how to silence correctly if it is a known event. The first time a Tier 1 fires and the responder lands on a clear runbook instead of a panic, the setup has paid for itself.

What good looks like

A healthy drift-alerting setup is quiet most weeks, fires Tier 2 tickets a few times a month that turn into scheduled retrains, and pages on Tier 1 rarely — and when it does, it is right. If your channel is noisy, you are alerting on movement instead of consequence. If it is silent through an incident, you are running reports instead of alerts. The target is the narrow band between: consequence-driven, persistence-gated, importance-weighted, and wired into the on-call system you already trust.

See also

Sources

  1. Evidently AI — data drift and test suites (docs)
  2. NannyML — performance estimation without labels (docs)
  3. Prometheus Alerting overview
  4. OpenTelemetry GenAI semantic conventions
Subscribe

ML Observe — in your inbox

ML observability deep dives — drift, debugging, monitoring. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments