DocumentationAPI ReferenceRelease Notes
DocumentationLog In
Documentation

How Evaluation Works

Once your data is in Deepchecks, the platform evaluates every interaction automatically. This section explains how evaluation works and what the results mean.

In the Get Started section you tried Deepchecks with sample data, and in Integrate Your Data you connected your real pipeline. Now your data is flowing into Deepchecks. This section explains what the platform does with that data - how it measures quality on every interaction using properties and annotations.


The evaluation pipeline

When data arrives in Deepchecks - whether from auto-instrumentation, the Python SDK, or a CSV upload - the platform runs an automatic evaluation pipeline. This is the same pipeline described in the Integrate Your Data section's "What happens after upload":

  1. Interaction types are assigned - each interaction (or span, for agentic data) is mapped to an interaction type (Q&A, Root, Agent, Chain, Tool, LLM, Retrieval, etc.) that determines which evaluation rules apply.
  2. System metrics are computed - latency, token usage, and cost are calculated per interaction and aggregated per session, so you can assess operational efficiency alongside quality.
  3. Spans are grouped into traces - for agentic data, the parent-child hierarchy is reconstructed so you can inspect full executions in the Sessions view.
  4. Properties are calculated on every interaction - individual quality scores like Grounded in Context, Avoided Answer, Toxicity, or Plan Efficiency. Each interaction type has its own set of relevant properties. Properties are the foundation of evaluation in Deepchecks.
  5. Automatic annotations are assigned based on property scores and other signals - configurable rules label each interaction as Good, Bad, or Unknown. Annotations give you a single quality verdict per interaction.

Once the pipeline completes, you can explore your results on the analysis screens (Overview, Sessions, Interactions) and drill into individual interactions, sessions, and properties.


What is in this section

Properties

Start here to understand how Deepchecks measures quality:

Annotations

Properties produce scores. Annotations turn those scores into actionable quality verdicts: