Automatic Annotations
How the automatic annotations work and how to configure them
Evaluating the quality of generative AI-based applications is a difficult task. In order for an output to be considered good it needs to follow several requirements. It needs to provide an accurate solution (ex. relevant answer, summary with high coverage) that is not harmful (ex. no hallucinations, no PII) and to adhere to specific product requirements.
Deepchecks provides a high-quality multi-step pipeline to generate auto annotations per interaction which can be used to compare different versions' performance as well as evaluate performance on production data without the use of manual annotation.
The auto annotations are based on three customizable components:
Customizing the Auto Annotation Configuration
All aspects of the automatic annotations can be customized by modifying the Auto Annotation Yaml. Such changes can include adding new steps, modifying the different thresholds or introducing new properties, and specifically Custom Properties.
Each property scores the interaction on a specific aspect. For example, if a sample has a low Grounded in Context score it means that the output is not really based on the information retrieval, and is likely a hallucination. The property scores are incorporated in the auto-annotation pipeline via rule-based clauses. In different application types, different properties can be useful, see Supported Applications for recommendations per application type and Properties for the full catalog.
Deepchecks compares the interaction's output to previously annotated outputs (ex. domain expert responses) via Deepchecks' similarity mechanism. Similarity is used only for auto-annotation of the evaluation set and is specifically useful for regression testing.
A LLM-based technique that learns from user-provided annotated interactions. It is specifically useful for detecting use case specific problems which cannot be caught using the built-in properties. Requires at least 50 annotated samples, including at least 15 bad and 15 good annotations, in order to initilize. The more user-annotated interactions across more versions, the better it performs.
Updated 2 months ago