How Evaluation Works
Once your data is in Deepchecks, the platform evaluates every interaction automatically. This section explains how evaluation works and what the results mean.
In the Get Started section you tried Deepchecks with sample data, and in Integrate Your Data you connected your real pipeline. Now your data is flowing into Deepchecks. This section explains what the platform does with that data - how it measures quality on every interaction using properties and annotations.
The evaluation pipeline
When data arrives in Deepchecks - whether from auto-instrumentation, the Python SDK, or a CSV upload - the platform runs an automatic evaluation pipeline. This is the same pipeline described in the Integrate Your Data section's "What happens after upload":
- Interaction types are assigned - each interaction (or span, for agentic data) is mapped to an interaction type (Q&A, Root, Agent, Chain, Tool, LLM, Retrieval, etc.) that determines which evaluation rules apply.
- System metrics are computed - latency, token usage, and cost are calculated per interaction and aggregated per session, so you can assess operational efficiency alongside quality.
- Spans are grouped into traces - for agentic data, the parent-child hierarchy is reconstructed so you can inspect full executions in the Sessions view.
- Properties are calculated on every interaction - individual quality scores like Grounded in Context, Avoided Answer, Toxicity, or Plan Efficiency. Each interaction type has its own set of relevant properties. Properties are the foundation of evaluation in Deepchecks.
- Automatic annotations are assigned based on property scores and other signals - configurable rules label each interaction as Good, Bad, or Unknown. Annotations give you a single quality verdict per interaction.
Once the pipeline completes, you can explore your results on the analysis screens (Overview, Sessions, Interactions) and drill into individual interactions, sessions, and properties.
What is in this section
Properties
Start here to understand how Deepchecks measures quality:
- Interaction-Level Properties - How properties work, the different types (built-in, prompt, user-value, session-level), and how to manage them
- Built-in Properties - The full catalog of Deepchecks' proprietary quality models
- Agent Use-Case Properties - Properties designed for evaluating agentic workflows (Plan Efficiency, Tool Completeness, etc.)
- RAG Use-Case Properties - Properties for retrieval-augmented generation (Grounded in Context, Retrieval Relevance, etc.)
- Prompt Properties - Custom LLM-as-a-judge properties you define with natural language guidelines
- User-Value Properties - Your own numeric or categorical metrics, sent alongside interaction data
- Session-Level Properties - Properties that evaluate entire multi-turn sessions rather than individual interactions
Annotations
Properties produce scores. Annotations turn those scores into actionable quality verdicts:
- Automatic Annotations - How the rule-based annotation pipeline works and how to configure it
- Manual Annotations - Human review workflows for labeling interactions as Good or Bad
Updated 8 days ago