DocumentationAPI ReferenceRelease Notes
DocumentationLog In
Documentation

Automatic Annotations

How the automatic annotations work and how to configure them

Evaluating the quality of generative AI-based applications is a difficult task. In order for an output to be considered good it needs to follow several requirements. It needs to provide an accurate solution (ex. relevant answer, summary with high coverage) that is not harmful (ex. no hallucinations, no PII) and to adhere to specific product requirements.

Deepchecks provide a high-quality multi-step pipeline to generate auto annotations per interaction which can be used to compare different versions' performance as well as evaluate performance on production data without the use of manual annotation.

The auto annotations are based on three customizable components:

👩‍🎓

Customizing the Auto Annotation Configuration

All aspects of these rules can be customized by modifying the Auto Annotation YAML file via the UI. Such changes can include adding new steps, modifying the different thresholds or introducing new properties, and specifically Custom Properties.

📘

Recalculating Annotations

After uploading a new automatic annotation configuration, the pipeline has to be rerun to update the annotations. If additional annotations were added, we recommend also the retrain Deepchecks Evaluator for the application. Rerunning the annotation pipeline, and retraining Deepchecks Evaluator does not incur any token usage to your application.

Properties

Each property scores the interaction on a specific aspect. For example, if a sample has a low Grounded in Context score it means that the output is not really based on the information retrieval, and is likely a hallucination. The property scores are incorporated in the auto-annotation pipeline via rule-based clauses. In different application types, different properties can be useful, see Supported Applications for recommendations per application type and Properties for the full catalog.

Similarity

Deepchecks compares the interaction's output to previously annotated outputs (ex. domain expert responses) via Deepchecks' similarity mechanism. Similarity is used only for auto-annotation of the evaluation set and is specifically useful for regression testing.

Deepchecks Evaluator

A LLM-based technique that learns from user-provided annotated interactions. It is specifically useful for detecting use case specific problems which cannot be caught using the built-in properties. Requires at least 50 annotated samples, including at least 15 bad and 15 good annotations, in order to initilize. The more user-annotated interactions across more versions, the better it performs.