ML Observe
Isometric diagram showing two statistical distribution curves diverging over time, with labeled test statistic overlays and threshold markers.
monitoring

How to Detect Data Drift: Statistical Tests, Thresholds, and Production Wiring

A practitioner's guide to how to detect data drift: PSI, KS, Wasserstein, and Jensen-Shannon compared, with Evidently code, threshold guidance, and real production caveats.

By Mlobserve Editorial · · 8 min read

Knowing how to detect data drift is the difference between finding out your model degraded in production and finding out six weeks later from a support ticket. Drift is a shift in the statistical distribution of model inputs — when the data arriving at your serving endpoint no longer matches the distribution you trained on. The model’s weights are unchanged; the world around them shifted.

This post covers the four main detection methods, how to choose between them, how to wire Evidently into a production pipeline, and the caveats that trip teams up in practice.

The four tests and when to use each

No single statistical test is universally best. The right choice depends on dataset size, feature type, and how sensitive the alert needs to be.

Kolmogorov-Smirnov (KS) test. The default for numerical features on small samples. The KS statistic measures the maximum gap between two empirical CDFs. A p-value below 0.05 at the 95% confidence level signals drift. The problem surfaces at scale: Evidently’s benchmark comparing five methods shows KS firing on shifts as small as 0.5% once you exceed 100,000 observations — changes too small to affect model performance. At production log volumes, KS generates so many alerts that teams mute the channel.

Population Stability Index (PSI). The finance industry’s standard for distribution shift, and the most operationally predictable test. PSI bins both distributions and computes:

PSI = Σ (actual_pct - expected_pct) × ln(actual_pct / expected_pct)

Industry thresholds are well-established: PSI < 0.1 means no significant shift; 0.1–0.2 means investigate; > 0.2 means significant drift. Critically, PSI is sample-size-stable — the score does not increase just because you have more rows. Its weakness: it is not sensitive enough to catch subtle but consequential shifts.

Wasserstein distance (earth mover’s distance). The Evidently comparison describes it as “a good compromise between ‘way-too-sensitive’ KS and ‘notice-only-big-changes’ PSI.” Wasserstein measures the minimum transport cost to convert one distribution into another. Normalized by the feature’s standard deviation, it enables cross-feature comparison on a consistent scale and behaves stably as sample size grows.

Jensen-Shannon divergence (JSD). A symmetric, bounded variant of KL divergence. KL divergence is documented to produce unstable results when bin frequencies approach zero, and its asymmetry means swapping reference and current changes the score. JSD bounds values between 0 and 1 and applies smoothing that avoids the zero-frequency instability, making thresholds intuitive and consistent.

For categorical features, KS does not apply. Use chi-squared on smaller datasets, Jensen-Shannon or PSI on large production logs.

Wiring it up with Evidently

Evidently handles test selection automatically based on column type and row count, with full override support. Below 1,000 rows, it defaults to KS (numerical) and chi-squared (categorical) at 95% confidence. Above 1,000 rows, it switches to Wasserstein (numerical) and Jensen-Shannon (categorical).

import pandas as pd
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

reference = pd.read_parquet("reference.parquet")  # training or recent stable window
current   = pd.read_parquet("current.parquet")    # latest production window

report = Report(metrics=[
    DataDriftPreset(
        stattest="psi",           # override auto-select; use PSI for large logs
        stattest_threshold=0.2,   # PSI > 0.2 = drifted column
        drift_share=0.5,          # flag dataset if >50% of columns drift
    )
])

report.run(reference_data=reference, current_data=current)
report.save_html("drift_report.html")

# Programmatic access for CI pipelines and alerting
result      = report.as_dict()
dataset_drift = result["metrics"][0]["result"]["dataset_drift"]
n_drifted     = result["metrics"][0]["result"]["number_of_drifted_columns"]
print(f"Dataset drift: {dataset_drift} | Drifted columns: {n_drifted}")

You can apply different tests per column, which matters when you have a mix of high-cardinality numerics and low-cardinality categoricals:

from evidently.calculations.stattests import psi_stat_test, ks_stat_test

report = Report(metrics=[
    DataDriftPreset(
        per_column_stattest={
            "revenue_usd":    psi_stat_test,  # high-volume numeric — PSI avoids over-fire
            "item_count":     ks_stat_test,   # low-volume count — KS sensitivity is useful
        }
    )
])

For a broader look at connecting input drift to model performance signals, sentryml.com covers the full MLOps observability stack, including performance estimation on unlabeled production data — which tells you whether the distribution shift is actually hurting your model.

What good and bad look like on the chart

A healthy drift report shows most features with PSI below 0.1, no column flagged, and distribution overlays that roughly align between reference and current windows. Some movement is always present — production distributions are never perfectly stationary — and a well-calibrated threshold ignores the noise.

Patterns that warrant action:

  • Single feature with PSI > 0.2, rest stable. Usually a data-pipeline change, a new upstream vendor, or a market-segment shift. Investigate the feature’s data source before retraining.
  • Broad simultaneous drift across correlated features (PSI 0.1–0.2 on 20+ features at once). Points to upstream schema change or broken join rather than genuine distribution shift. Fix the pipeline, not the model.
  • Monotonically increasing drift week over week. Not a spike — a slow build. This is concept drift accumulating. It warrants scheduled retraining, not an incident response.

Caveats

KS over-fires at scale. At datasets above 50,000 rows, switch to Wasserstein or PSI. Teams that leave KS as the default at production volumes discover the alert muted itself.

PSI is bin-count-sensitive. Binning choices affect PSI scores. Near-zero bin frequencies push PSI toward infinity or NaN. Use 10–20 bins for numeric features, and set a minimum frequency floor (0.001 is standard) to prevent division-by-zero failures.

High-cardinality categoricals will always look drifted. User IDs, session tokens, and raw SKUs introduce new categories every day by design. Either group rare values into an “other” bucket or exclude them from drift monitoring and track them as cardinality count metrics instead.

Reference window choice is a tuning decision. Drift is relative. A reference anchored to your training set from two years ago will show permanent drift from day one of production. A rolling 30-day reference window detects genuine distributional change against a recent stable baseline.

Drift does not equal degradation. Input distribution shift does not guarantee your model performs worse — it may generalize. Drift detection identifies a risk signal; it takes outcome monitoring and performance estimation to confirm whether the risk materialized and whether a retrain or rollback is warranted.

Sources

Sources

  1. Evidently AI — Which test is the best? 5 methods compared on large datasets
  2. Evidently AI — Data drift metrics documentation
  3. Detecting drifts in data streams using KL divergence — Journal of Data, Information and Management (Springer, 2024)
Subscribe

ML Observe — in your inbox

ML observability deep dives — drift, debugging, monitoring. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments