AI-Assisted Annotations 🖌️

Deepchecks allow in-system annotations within the Interactions page and in the sample view by clicking and editing the suggested estimated annotation. Unlike the estimated annotation, manual annotations are used as a source of truth for advanced features such as Deepchecks Evaluator[link] and can be tracked separately from estimated annotations.

AI-Assisted Annotations can be done by utilizing the properties and auto annotations in the system. These can be incorporated both when annotating with an external tool (by downloading the data, or consuming it via SDK), or within the Deepchecks UI. To learn more about this flow see AI-Assisted Annotations.

Evaluation Set Creation & Management 🌲

To effectively evaluate a version before releasing it to production or to compare it with other versions it is crucial to have a high quality and representative evaluation set. Deepchecks can assist in this process via two main functionalities:

Interaction Generation

The interaction generation is an LLM-based process that utilizes provided context (either documents or webpages*) as well as information about the application to generate diverse high-quality inputs that can be appended to the evaluation set.

*Deepchecks scraping process also looks at additional webpages under the domain. For example, supplying the https://deepchecks.com/ URL will automatically parse all web pages on the website.

Copy Production Samples

Once you have identified types of interactions from the production environment that are unrepresented in your evaluation set, you can copy them to your evaluation dataset. This will ensure that your next rounds of testing will show a more realistic score. You can also download these examples into a CSV, for further analysis.

For more information on how to use this functionality see Evaluation Dataset Management.

Pentest 🛑

LLMs are able to tackle an amazing range of textual tasks, but this capability comes with a price - the end users are often able to input free text to your system that will end up directly as part of the prompt sent to your LLM. There are several known possible "attacks", or vulnerabilities, of LLM-based systems. The Pentest environment is built to test your system for a wide range of known vulnerabilities. Many of these are based on the Garak OSS package. This testing is done by providing you with a set of malicious inputs which you can then run through your LLM pipeline. Deepchecks will then automatically test and score your LLM's app resilience to these attack attempts, by looking at the corresponding outputs and identifying if the pipeline has been able to block the attack.

For more information on how to use this functionality see Pentesting Your LLM-Based App.

Translation 🔤

Deepchecks support a wide range of languages including most of the European and South Asia languages using a translation module. As such, data is stored in Deepchecks in its original form alongside an English translation which is used by models that were trained on English data. We found that for multi-national companies the translation itself can be valuable and as such we made it available in the Deepchecks system.

A partial list of the languages Deepchecks clients use includes: Spanish, German, French, Dutch, Hebrew, Japanese, Mandarin, and Malay.

To select the translation model, go to the "Edit Application" screen and choose a model under "Advanced Settings."

Topic Detection 🫧

Deepchecks uses a multi-step approach to cluster interactions based on topics. The topics are useful to better understand your data—especially in the production environment—and are included in all of Deepchecks' RCA features.

In the first step, sessions are clustered via semantic similarity. These clusters are then filtered and assigned a topic through an LLM-based process. Topics are determined at the session level, meaning each session is assigned a single topic. Samples that do not fit into any of the identified categories are grouped under the "Other" category.

Topic categories are initialized once at least 50 unique samples are available in the application and can be recalculated when a significant amount of new data is ingested. Importantly, topics are generated separately for each environment—Evaluation and Production—allowing users to detect drift and changes in user behavior more effectively.

Custom Topics

Deepchecks offers the option to override the default topic detection by uploading your own topics. You can simply add a column named "topic" to your CSV file and specify your custom topic for each interaction. You can also add a "topic" argument when defining each interaction via SDK.

Data Sampling (in Production Environment)

To optimize performance and cost, you can enable session‐level data sampling in production environments. When sampling is active, only a configurable percentage of sessions undergo full evaluation—unevaluated sessions are simply stored as JSON without being processed. This drastically reduces usage: uploading an unevaluated session uses roughly one‑sixth of the resources compared to a fully evaluated one. The sampling ratio is set per application via the "Edit Application" flow in the Manage Application window.

Modify Production Data Sampling Ratio Through the "Edit Application" Flow

All unevaluated sessions remain accessible on the Storage screen in the UI, where you can search and filter them by basic attributes. If needed, you can manually push any session through the full evaluation pipeline at a later time. For best practices on choosing sampling rates and setups, check out our detailed “How to Use Sampling Properly” guide.

an Unevaluated Session in the Storage Screen

📘
Data Retention
Unevaluated data is subject to retention limits—it'll be automatically removed after a certain time period or once storage exceeds a defined size. If you wish to preserve this data, you can either evaluate it or contact the Deepchecks team to discuss extended retention options. You can view your data retention period by hovering over the tooltip next to the Storage tab:

Additional Features

AI-Assisted Annotations 🖌️

Evaluation Set Creation & Management 🌲

Interaction Generation

Copy Production Samples

Pentest 🛑

Translation 🔤

Topic Detection 🫧

Custom Topics

Data Sampling (in Production Environment)

📘
Data Retention

AI-Assisted Annotations 🖌️

Evaluation Set Creation & Management 🌲

Interaction Generation

Copy Production Samples

Pentest 🛑

Translation 🔤

Topic Detection 🫧

Custom Topics

Data Sampling (in Production Environment)

📘Data Retention

📘
Data Retention