Customizing the Auto Annotation Configuration
Configuring your auto annotation mechanism when using Deepchecks to evaluate your LLM-based-app.
The auto-annotation process may differ between various apps. Deepchecks has a default auto annotation process that is customizable to fit your needs and data. This page will cover the technical aspects of the auto annotation configurations. The first part will focus on the structure of the Yaml, while the second part will provide a guide on how to update the pipeline for your specific data.
This page will include two sections:
Recalculating Annotations
After uploading a new automatic annotation configuration, the pipeline has to be rerun to update the annotations. If additional annotations were added, we recommend also the retrain Deepchecks Evaluator for the application. Rerunning the annotation pipeline, and retraining Deepchecks Evaluator does not incur any token usage to your application.
Configuration Yaml Structure
The multi-step auto-annotation pipeline is defined in a Yaml file, where each block represents a step in the process. These steps run sequentially, meaning each block only attempts to annotate samples that remain unannotated from the previous blocks. At each step, the specified annotation condition is applied to the unannotated samples, and those that meet the condition are labeled accordingly.
You can delete existing steps or add new ones by including additional sections in the file. Any samples that remain unannotated after the final step will receive the label "unknown."
Below, you will find a general explanation with examples for each type of annotation block. To fully understand the Yaml mechanism and customize it for your data, it’s recommended to explore the UI, as will be described in the second part.
Properties based annotation
Properties' scores play a crucial role in evaluating interactions. The structure for property-based annotation includes the following key options:
- Annotation: The label assigned to samples that meet the specified conditions.
- Relation Between Conditions: Determines whether all conditions (AND) or any condition (OR) must be satisfied.
- Type: Indicates the source of the property score, with options such as input, output, LLM, or custom.
- Operator: Defines how conditions are evaluated. This includes greater than (GT), greater equal (GE), less than (LT), and less equal (LE) for numerical properties, as well as equality (EQ), inequality (NEQ), membership in a set (IN, NIN), and overlap between sets (OVERLAP) for categorical properties.
- Value: The specific value to which the operator is applied. For example, for the GE operator, the value can be seen as a threshold, while for the IN operator, you would provide a set or list of values.
For example, the block shown in the code snippet below labels any samples as "bad" if they meet at least one property condition. For instance, if the 'toxicity' score of the output is less than or equal to 0.96, the interaction is annotated as bad.
- type: property
annotation: bad
relation_between_conditions: OR
conditions:
- property_name: grounded_in_context
type: output
operator: LE
value: 0.1
- property_name: toxicity
type: output
operator: GE
value: 0.96
- property_name: PII Risk
type: output
operator: GT
value: 0.5
Another example, the block shown in the code snippet below labels any samples as 'bad' if they meet both property conditions. For instance, if the 'Text Quality' score of the output is greater than or equal to 2, and the property column_name is either 'country' or 'city', then the interaction is annotated as 'bad'.
- type: property
annotation: bad
relation_between_conditions: AND
conditions:
- property_name: Text Quality
type: llm
operator: LE
value: 2
- property_name: column_name
type: custom
operator: NIN
value: ['country', 'city']
Similarity based annotation
Using the similarity mechanism is useful for auto annotation of an evaluation set during regression testing. The similarity score ranges from 0 to 1 (1 being identical outputs), and is calculated between the output of a new sample and the output of previously annotated sample with the same user interaction id, if such a sample exists.
For example, in the code snippet below, if an output closely resembles a previously annotated response (with a similarity score of 0.9 or higher) that shares the same user interaction id, it will copy its annotation. You can customize the threshold as needed.
- type: similarity
annotation: copy
condition:
operator: GE
value: 0.9
Another useful example is using similarity to identify examples with bad annotations. The code snippet below illustrates the following mechanism: Whenever there is an unannotated interaction that has a low similarity score to a 'good' annotated example (with the same user interaction id), the unannotated interaction will be labeled as 'bad.'
- type: similarity
annotation: bad
condition:
operator: LE
value: 0.1
Deepchecks Evaluator based annotation
The last block is the Deepchecks Evaluator, our high-quality annotator that learns from your data.
- type: Deepchecks Evaluator
Updating the Configuration Yaml
- First, download the existing Yaml file. To do this, go to the 'Annotation Config' tab in the app and click on 'Download Current Configuration.'
- Modify the Yaml file locally as needed. You have the flexibility to add or remove properties, including your own custom ones, adjust the threshold, modify the condition logic, rearrange the order of different blocks, and more. This allows you to tailor the configuration to better fit your specific needs.
- Once you’re done, upload your new Yaml file and click 'Save.'
- Next, choose the version and environment in the app for recalculating the annotations based on the modified Yaml, and decide whether to include retraining the DC evaluation. This step does not effect user provided annotations and does not incur any token usage.
- Lastly, you need to wait for the new estimated annotations to be completed. Check the application management to see that the 'Pending' processing status of your application changes to 'Completed'.
Updated 11 days ago