DocumentationAPI ReferenceRelease Notes
DocumentationLog In
Documentation

Customizing the Auto Annotation Configuration

Configuring your auto annotation mechanism when using Deepchecks to evaluate your LLM-based-app.

The auto-annotation process may differ between various apps. Deepchecks has a default auto annotation process that is customizable to fit your needs and data. This page will cover the technical aspects of the auto annotation configurations. The first part will focus on the structure of the Yaml, while the second part will provide a guide on how to update the pipeline for your specific data.

This page will include two sections:

📘

Recalculating Annotations

After uploading a new automatic annotation configuration, the pipeline has to be rerun to update the annotations. If additional annotations were added, we recommend also the retrain Deepchecks Evaluator for the application. Rerunning the annotation pipeline, and retraining Deepchecks Evaluator does not incur any token usage to your application.


Configuration Yaml Structure

The multi-step auto-annotation pipeline is defined in a Yaml file, where each block represents a step in the process. These steps run sequentially, meaning each block only attempts to annotate samples that remain unannotated from the previous blocks. At each step, the specified annotation condition is applied to the unannotated samples, and those that meet the condition are labeled accordingly.

You can delete existing steps or add new ones by including additional sections in the file. Any samples that remain unannotated after the final step will receive the label "unknown."

Below, you will find a general explanation with examples for each type of annotation block. To fully understand the Yaml mechanism and customize it for your data, it’s recommended to explore the UI, as will be described in the second part.


Properties based annotation

Properties' scores play a crucial role in evaluating interactions. The structure for property-based annotation includes the following key options:

  • Annotation: The label assigned to samples that meet the specified conditions.
  • Relation Between Conditions: Determines whether all conditions (AND) or any condition (OR) must be satisfied.
  • Type: Indicates the source of the property score, with options such as input, output, LLM, or custom.
  • Operator: Defines how conditions are evaluated. This includes greater than (GT), greater equal (GE), less than(LT), and less equal(LE) for numerical properties.
  • Value: The specific threshold that each condition must meet.

For example, the block shown in the code snippet below labels any samples as "bad" if they meet at least one property condition. For instance, if the 'toxicity' score of the output is less than or equal to 0.96, the interaction is annotated as bad.

  - type: property
    annotation: bad
    relation_between_conditions: OR
    conditions:
    - property_name: grounded_in_context
      type: output
      operator: LE
      value: 0.1
    - property_name: toxicity
      type: output
      operator: GE
      value: 0.96
    - property_name: PII Risk
      type: output
      operator: GT
      value: 0.5

Similarity based annotation

Using the similarity mechanism is useful for auto annotation of an evaluation set during regression testing. The similarity score ranges from 0 to 1 (1 being identical outputs), and is calculated between the output of a new sample and the output of previously annotated sample with the same user interaction id, if such a sample exists.

For example, in the code snippet below, if an output closely resembles a previously annotated response (with a similarity score of 0.9 or higher) that shares the same user interaction id, it will copy its annotation. You can customize the threshold as needed.

  - type: similarity
    annotation: copy
    condition:
      operator: GE
      value: 0.9

Another useful example is using similarity to identify examples with bad annotations. The code snippet below illustrates the following mechanism: Whenever there is an unannotated interaction that has a low similarity score to a 'good' annotated example (with the same user interaction id), the unannotated interaction will be labeled as 'bad.'

  - type: similarity
    annotation: bad
    condition:
      operator: LE
      value: 0.1

Deepchecks Evaluator based annotation

The last block is the Deepchecks Evaluator, our high-quality annotator that learns from your data.

  - type: Deepchecks Evaluator

Updating the Configuration Yaml

  1. First, download the existing Yaml file. To do this, go to the 'Annotation Config' tab in the app and click on 'Download Current Configuration.'

  1. Modify the Yaml file locally as needed. You have the flexibility to add or remove properties, including your own custom ones, adjust the threshold, modify the condition logic, rearrange the order of different blocks, and more. This allows you to tailor the configuration to better fit your specific needs.
  2. Once you’re done, upload your new Yaml file and click 'Save.'

  1. Next, choose the version and environment in the app for recalculating the annotations based on the modified Yaml, and decide whether to include retraining the DC evaluation. This step does not effect user provided annotations and does not incur any token usage.
  1. Lastly, you need to wait for the new estimated annotations to be completed. Check the application management to see that the 'Pending' processing status of your application changes to 'Completed'.