ML Observe
A soldier monitoring multiple screens at a tactical operations center
tooling

The Open-Source ML Observability Stack: Evidently to Phoenix

An honest breakdown of the three open-source tools most teams reach for — what problem each was built for, where they overlap, where they don't, and how to assemble them without buying a platform you don't need yet.

By ML Observe Editorial · · 8 min read

There is no single open-source tool that does ML observability, and the most common mistake is treating one of them as if it were. Evidently, NannyML, and Phoenix get reached for constantly, they sound similar in their marketing, and they were built to solve genuinely different problems. Knowing which problem each owns is what lets you assemble a working stack without prematurely buying a commercial platform.

Evidently: distribution and quality reports for tabular and LLM data

Evidently’s core competence is computing and presenting drift and quality reports. Point it at a reference dataset and a current dataset and it produces statistical comparisons — data drift per feature, target drift, data-quality checks — and, more recently, LLM evaluation metrics. It has both an interactive report mode and a “test suite” mode that returns pass/fail, which is what you wire into a pipeline.

What it is good at: structured tabular drift and data-quality monitoring, and producing artifacts a human will actually read. Where it stops: it tells you the input distribution moved; it does not, on its own, tell you what that did to model performance when labels are delayed, and it is not a tracing system for multi-step LLM apps.

NannyML: estimating performance when labels are late or absent

NannyML solves the problem the other two mostly do not: what is my model’s performance right now when ground-truth labels won’t arrive for weeks? Its central capability is performance estimation — methods like CBPE (confidence-based performance estimation) and DLE that estimate metrics such as ROC AUC on unlabeled production data, then reconcile against realized performance once labels land. It also does multivariate drift with an emphasis on drift that actually matters for performance, rather than every feature that moved.

What it is good at: answering “is the model still accurate?” before labels exist, and cutting through drift noise to the drift that predicts performance loss. Where it stops: it is a tabular-model tool. It is not built for LLM tracing or generative-output evaluation, and performance estimation has assumptions (no concept drift in the estimation window) you must respect or the estimate misleads.

Phoenix: tracing and evaluation for LLM and RAG applications

Phoenix is the odd one out and the one to use for generative systems. It ingests OpenTelemetry traces of LLM applications — spans for model calls, retrieval, tools — and layers evaluation (including LLM-as-judge) and embedding-drift analysis on top. Its job is to make a multi-step LLM/RAG request debuggable and continuously evaluable.

What it is good at: end-to-end LLM/RAG tracing, retrieval debugging, embedding-drift visualization, and online evaluation of generative output. Where it stops: it is not a tabular performance-monitoring tool. It will not estimate a churn model’s AUC under label delay; that is not its problem.

How they actually compose

These tools are complementary far more than competing:

  • Classical tabular model in production with delayed labels: NannyML for performance estimation and performance-relevant drift; Evidently for detailed feature-level drift and data-quality reporting and pipeline test gates. They answer different questions about the same model.
  • LLM or RAG application: Phoenix for tracing, retrieval debugging, and online evaluation; Evidently can still contribute LLM-output quality reports if you want batch report artifacts. NannyML is generally not in this picture.
  • Common backbone: instrument with OpenTelemetry GenAI conventions regardless. Standard traces keep you portable across Phoenix and most commercial backends and prevent lock-in to any one tool’s SDK.

When open source is enough — and when it isn’t

For a single team running a handful of models or one LLM app, this stack is genuinely sufficient: NannyML or Evidently for tabular, Phoenix for generative, OpenTelemetry as the wire format, your existing metrics system for alerting. You add real engineering cost in the glue — running the services, storing traces and reference windows, building the alerting integration, and keeping evaluators calibrated. The honest trigger for evaluating a commercial platform is not “we need observability”; it is “we have many models and teams, need access control and managed retention, and the glue has become its own maintenance burden.” Until then, the open-source stack is not a downgrade — it is the correct amount of tool for the problem, and it keeps the question “what is each component actually doing for me?” answerable, which a bundled platform tends to obscure.

Sources

  1. Evidently AI — open-source ML and LLM evaluation/monitoring (docs)
  2. NannyML — performance estimation without labels (docs)
  3. Phoenix by Arize — open-source LLM tracing and evaluation (docs)
  4. OpenTelemetry GenAI semantic conventions
Subscribe

ML Observe — in your inbox

ML observability deep dives — drift, debugging, monitoring. — delivered when there's something worth your inbox.

No spam. Unsubscribe anytime.

Related

Comments