Key Concepts

This page explains the core building blocks of Deepchecks. Read it once to understand every term you will encounter in the product and the docs, or use it as a reference when something is unclear.

Running example

To make these concepts concrete, we'll use a single example throughout this page:

Customer Support Bot - A RAG-based chatbot that answers questions about company policies. Your team is testing whether switching from GPT-4o to Claude improves answer quality and reduces hallucinations. The bot handles multi-turn conversations, and you want to monitor it in production after deployment.

The Data Hierarchy

Deepchecks organizes your data in a hierarchy: Organization > Application (project) > Version > Environment (evaluation or production) > Session > Interaction (span). Understanding this structure is essential for uploading data correctly and interpreting results.

Organization

An organization is the top-level workspace in Deepchecks. It contains all your applications, users, and settings. If you are on a team, everyone on your team works within the same organization.

In our example: your company's Deepchecks workspace is the organization.

Application

An application is a complete LLM-powered task - for example, the Customer Support Bot as a whole. Each application has its own set of configured interaction types, properties, and annotation rules.

Deepchecks does not support cross-application comparison. If you need to compare two different pipeline strategies against each other, upload them as different versions within the same application rather than creating separate applications.

You create and manage applications from the Manage Applications screen.

Version

A version represents a specific implementation of your LLM pipeline. In our example:

v1 - GPT-4o with the original prompt template
v2 - Claude Sonnet 4.5 with a revised prompt template

Version comparison is one of Deepchecks' core workflows - you upload the same evaluation dataset to multiple versions and compare quality scores, annotation distributions, and property averages to decide which version performs best.

Environment

Every interaction belongs to one of three environments:

Evaluation - for benchmarking and version comparison. Use a consistent evaluation set so comparisons are apples-to-apples. Also used for CI/CD and pre-deployment testing.
Production - for live traffic from deployed applications. Used for monitoring score trends over time and detecting degradation.
Pentesting - a separate environment for safety and adversarial testing. Interactions here are isolated from quality evaluation workflows.

Pentesting access: The Pentesting environment is designed for security and red-teaming workflows. Contact us to discuss whether it fits your use case.

Session

A session is a group of related interactions that belong to the same user flow. In our example, a single customer conversation is a session - it might span three or four turns before the user's question is resolved.

Each session has a session_id. Interactions with the same session_id are grouped together and displayed in the Sessions view, where Deepchecks can evaluate session-level quality (e.g., did the full conversation fulfill the user's intent?). If you do not provide a session_id, one is auto-generated - each interaction ends up in its own session.

For simple single-interaction pipelines, you can ignore sessions - each interaction is implicitly its own session.

For agentic workflows: a session can contain one or more traces. A trace is the full execution of a single agent run. When you log an agentic trace, all spans from that trace are grouped into the same session automatically. Multiple pipeline runs in the same user session are linked via the session_id.

Interaction

An interaction is the minimal unit of evaluation in Deepchecks. It represents a single step in your pipeline - whether that is a full end-to-end exchange, an individual LLM call, a tool invocation, or any other component that Deepchecks evaluates independently.

Each interaction has:

Data fields (input, output, information_retrieval, history, full_prompt, expected_output, steps) - the actual text content, used for property calculations
Metadata fields (user_interaction_id, session_id, started_at, finished_at, interaction_type, user_annotation) - used for organization, comparison, latency tracking, and observability

At minimum, every interaction needs at least one of input or output. All other fields are optional but unlock additional evaluation and observability capabilities.

For agentic workflows: each span in a trace is converted into a single interaction in Deepchecks, preserving the parent-child hierarchy. The full trace corresponds to a session (or part of one). So when you think about agentic data: spans → interactions, traces → sessions (or sub-sessions).

Interaction Types

An interaction type defines the logical nature of an interaction. It determines which built-in properties are available by default, what auto-annotation rules apply, and how the interaction is displayed in the UI.

Every interaction belongs to exactly one interaction type. An application can contain multiple interaction types - an agent application might have Chain, Agent, Tool, LLM, and Retrieval types all within the same application.

Classic interaction types

For single-step or simple multi-step LLM tasks:

Interaction Type	Use case
Q&A	Question answering, RAG pipelines
Summarization	Condensing documents or transcripts
Generation	Creative or structured content generation
Classification	Multi-class or multi-label LLM classification
Feature Extraction	Extracting structured data from free text
Chat	Multi-turn conversational assistants
Retrieval	Evaluating the retrieval step of a RAG pipeline independently
Other	General-purpose fallback for anything not covered above
Custom	User-defined type, built from scratch

Agentic interaction types

For multi-agent and agentic workflows, where each span in a trace is captured as a separate interaction:

Interaction Type	Use case
Root	The top-level span of a trace - the full end-to-end execution
Agent	An AI agent that plans and delegates to tools or sub-agents
Chain	A multi-step chain that is not the root span
Tool	A tool called by an agent (web search, code execution, API call)
LLM	A direct LLM call within an agentic workflow
Retrieval	A retrieval step within an agentic trace

When using framework integrations (LangGraph, CrewAI, Google ADK), each span is automatically assigned an interaction type based on its span kind. You can also override interaction types manually in the UI.

See Supported Use Cases for a detailed breakdown of each type's properties and data requirements.

Properties (Quality Metrics)

Properties are one-dimensional metrics calculated on each interaction. They are the quantitative backbone of evaluation in Deepchecks - turning every interaction into a set of measurable scores you can analyze, filter, and track over time.

A property can be as simple as "how many words is the output?" or as complex as "does the output faithfully reflect the retrieved documents?" Properties produce either numeric or categorical values.

Built-in properties

Built-in properties are created and maintained by Deepchecks and are automatically available for every new application. Examples include:

Grounded in Context - is the output supported by the retrieved context?
Retrieval Relevance - are the retrieved documents relevant to the input?
Avoided Answer - did the model dodge the question?
Toxicity - does the output contain harmful or offensive content?
Fluency - is the text grammatically well-formed?
Sentiment - is the tone positive, negative, or neutral?
Completeness - does the output fully address what was asked?
Instruction Following - does the output follow the provided instructions?

Some built-in properties use simple text analysis (fast, no LLM cost). Others use LLM-as-judge evaluation.

See Built-in Properties for the full catalog.

Prompt properties

Prompt properties are custom evaluators that you define using a natural-language prompt. Deepchecks runs the prompt against each interaction using your configured LLM and returns a score.

This lets you evaluate any aspect of your interactions that built-in properties don't cover - for example, "Does the response maintain the brand's professional tone?" or "Is the extracted JSON schema valid for our API?". See Prompt Properties for details.

User-value properties

User-value properties are metrics you calculate yourself and send to Deepchecks alongside your interaction data. They can be numeric (e.g., a custom similarity score) or categorical (e.g., a topic label from your own classifier).

User-value properties appear in the UI alongside built-in and prompt properties and can be used in filtering, root cause analysis, and auto-annotation rules.

Session-level properties

Session-level properties evaluate an entire session holistically rather than individual interactions. For example, Intent Fulfillment asks whether the full conversation fulfilled the user's goal across all turns - something interaction-level evaluation alone cannot answer.

Session-level properties support the same three subtypes as interaction-level properties: built-in, prompt, and user-value. You configure them in the same way.

See Session-Level Properties for details.

How properties are used

Beyond evaluation, properties also drive observability workflows:

Automatic annotation - property scores feed into rules that classify interactions as Good, Bad, or Unknown
Root cause analysis - filter and segment interactions by property scores to identify what is causing failures
Version comparison - compare property score distributions across versions to understand what changed
Monitoring - track property trends over time in production to detect drift or degradation

System Metrics

System metrics capture the operational characteristics of each interaction - not quality, but performance and reliability. They give you the observability layer that sits alongside quality evaluation:

Latency - automatically calculated from started_at and finished_at timestamps
Input tokens / Output tokens / Total tokens - token usage per interaction
Cost - computed from token counts and your configured model pricing (see Cost Tracking)
Run status - whether the span executed successfully or failed
Custom metadata - any additional key-value data you attach via the steps field or span metadata

In agentic pipelines, system metrics are also aggregated across child spans - so you can see total token usage and cost for an entire agent trace, not just individual LLM calls.

Framework integrations (LangGraph, CrewAI, Google ADK, LangChain) capture system metrics automatically. For manual uploads, include started_at and finished_at timestamps and token counts to get a complete operational picture.

System metrics are available throughout the UI: in the interactions table (for filtering and sorting), in the single session view, and in version comparison dashboards.

Annotations

An annotation is a Good/Bad/Unknown label on an interaction. Deepchecks uses two types of annotations that coexist:

Estimated annotations

Estimated annotations are automatically computed by Deepchecks using its automatic annotation pipeline. They are based on property scores combined through configurable rules - for example, "mark as Bad if Grounded in Context is below 3 and Avoided Answer is True."

Estimated annotations appear in the UI as outlined badges (border only, no fill). They run on every interaction automatically, so you always have a quality signal even with no human review.

The annotation configuration is fully customizable. You choose which properties to use, set thresholds, and define the logic. See Automatic Annotations for details.

Manual annotations

Manual annotations are labels provided by humans - either uploaded with your data (via the annotation column in a CSV or the SDK), or added directly in the UI by a reviewer.

Manual annotations appear as filled badges in the UI. They take precedence over estimated annotations for analysis purposes and are used to evaluate and improve the accuracy of the automatic annotation system.

Possible values: Good, Bad, or Unknown. Unknown means a human reviewer examined the interaction and could not make a clear determination - it is distinct from having no annotation at all, and is used to represent genuine ambiguity.

Automatic Annotation Pipeline

The automatic annotation pipeline ties together properties and annotations into a scalable quality signal. Here is how it works:

Properties are calculated on every new interaction (text-based properties run immediately; LLM-based properties run asynchronously)
Annotation rules are evaluated - your configured rules combine property scores using logic you define (e.g., AND/OR conditions on thresholds)
An estimated annotation is assigned - Good, Bad, or Unknown, based on the rules
The annotation appears in the UI as an outlined badge alongside any manual annotation

This pipeline runs automatically on every data upload. You can configure and refine it at any time, and recalculate annotations retroactively when you change the configuration.

See Automatic Annotations for the full configuration guide.

Datasets

A dataset in Deepchecks is a curated, named collection of test samples used for systematic evaluation. Datasets are distinct from production data - they are managed, versioned, and reused.

The primary use case is evaluation set management: build a representative set of test cases, run each new version of your pipeline against the same set, and compare results. This gives you a stable benchmark for measuring improvement over time.

Datasets can be created:

By manually uploading CSV or JSONL files
By cloning interactions from production or evaluation data
By generating synthetic test cases using Deepchecks' AI data generation feature

See Dataset Management for details.

Topics

Topics are automatically generated category labels that group your interactions by semantic theme. Deepchecks uses LLMs to analyze your data, identify recurring topics (e.g., "billing questions," "technical support," "product features"), and assign each interaction to a topic.

Topics are useful for:

Root cause analysis - understanding if failures are concentrated in a particular topic area
Filtering - drilling into a specific topic to examine performance in detail
Coverage analysis - ensuring your evaluation set covers the topics your users actually care about

Topic detection runs automatically and can be disabled per application in Workspace Settings if not needed.