DocumentationAPI ReferenceRelease Notes
DocumentationLog In
Documentation

Automatic Annotations

How the automatic annotations work and how to configure them

Evaluating the quality of generative AI-based workflows is a difficult task. In order for an output to be considered good, it needs to follow several requirements. It needs to provide an accurate solution (e.g., relevant answer, summary with high coverage) that is not harmful (e.g., no hallucinations, no PII) and adhere to specific product requirements.

Deepchecks provides a high-quality multi-step pipeline to generate auto annotations per interaction, which can be used to compare different versions' performance as well as evaluate performance on production data without the use of manual annotation.

The auto annotations are based on three customizable components:

👩‍🎓

Customizing the Auto Annotation Configuration

All aspects of the automatic annotations can be customized by modifying the Auto Annotation Yaml. Such changes can include adding new steps, modifying the different thresholds, or introducing new properties, and specifically Custom Properties.

Each property scores the interaction on a specific aspect. For example, if a sample has a low Grounded in Context score it means that the output is not really based on the information retrieval, and is likely a hallucination. The property scores are incorporated in the auto-annotation pipeline via rule-based clauses. In different interaction types, different properties can be useful, see Supported Applications for recommendations per interaction type and Properties for the full catalog.

Deepchecks compares the interaction's output to previously annotated outputs (e.g., domain expert responses) via Deepchecks' similarity mechanism. Similarity is used only for auto-annotation of the evaluation set and is specifically useful for regression testing.

An LLM-based technique that learns from user-provided annotated interactions. It is specifically useful for detecting use-case-specific problems that cannot be caught using the built-in properties. Requires at least 50 annotated samples, including at least 15 bad and 15 good annotations, in order to initialize. The more user-annotated interactions across more versions, the better it performs.