Auto Annotation Design

After Selecting the Right Properties and Adjusting the Thresholds, you're ready to build your end-to-end auto annotation pipeline. The goal is to replicate domain expert annotations as closely as possible. The most effective way to start is by using a base version's responses, ideally on the evaluation set, that have been manually annotated by an expert. Roughly 50 samples—balanced between good and bad outputs—are often enough, depending on your data complexity. Aim to include at least 3 examples for each quality issue you want to detect.

Pro Tip: Upload expert annotations as a custom property with reasoning. Since manual labels override estimated ones, you’ll be able to compare system auto-annotation with ground truth, making gaps easier to spot.

Recommended Starting Point

We recommend beginning with a simple auto annotation pipeline that uses a clear, rule-based hierarchy:

  1. Reference-based annotation: Annotate as "good" if the output matches an expected reference output, measured by the Expected Output Similarity property
  2. Property-based filtering: Annotate as "bad" if any key property aspect scores below the defined threshold
  3. Edge case handling: Annotate as "unknown" for complex cases, such as when:
    • The model avoided answering the question
    • A sample received borderline scores on critical property aspects
  4. Default classification: Annotate as "good" if all property scores are above their respective thresholds

This hierarchical approach ensures that clear-cut cases are handled first, while ambiguous samples are flagged for manual review rather than being misclassified.

Analyzing & Improving Performance

After you run your initial auto-annotation configuration—whether the Deepchecks default or a custom YAML—review the results and iterate based on what you learn:

Improving Precision

  1. Open the Data page and filter for samples whose estimated annotation is "good" but whose expert annotation is "bad".
  2. Examine the experts' reasoning for these false-positive samples.
  3. Identify missing properties or thresholds that are too lenient, then adjust them accordingly.

Improving Recall

  1. In the Data page, filter for samples where the expert annotation is "good" yet the estimated annotation is "bad".
  2. Click the Reason column to see a frequency breakdown of why each sample was marked bad.
  3. Focus on the most common reasons—these properties may need looser thresholds or refined guidelines.

Handling Edge Cases
If you notice a recurring scenario that the pipeline mislabels, add a dedicated block to handle it. For example, explicitly label a response as "good" when the model correctly refuses to answer due to detected prompt-injection, off-topic content, or other policy violations by combining Avoided Answer with Input Safety or Instruction Fulfillment.

Examples

Meeting Scheduling Agent

Consider an agent designed to schedule meetings. It uses tools to check calendar availability, create new events, and manage invites. When evaluating this agent, you might find that the default auto-annotation configuration incorrectly flags certain interactions as "bad," leading to a recall problem.

For instance, let's say expert annotators have marked interactions as "good" where the agent correctly identifies that it cannot access a user's calendar. However, the auto-annotation pipeline, relying on the "Tool Completeness" property, labels these as "bad" because a "no access" response from the calendar tool receives a low score (e.g., 3 out of 5).

Since the agent is designed to handle this "no access" scenario appropriately (e.g., by informing the user), the low score from "Tool Completeness" is misleading. To fix this, you can adjust the auto-annotation YAML. By lowering the failure threshold for the "Tool Completeness" property to 2, you can ensure that these valid "no access" responses are no longer penalized, thus resolving the recall issue and aligning the automated annotations with expert judgment.

E-commerce Scraping Bot

Imagine a bot that scrapes product pages from e-commerce websites and extracts information into a structured JSON format with predefined keys. During evaluation, you might find that interactions where the bot correctly returns an empty message for an unavailable page are wrongly annotated as "bad." This happens because the absence of a structured JSON output is seen as a failure by the default annotation rules.

To address this, you can add a dedicated block to your auto-annotation YAML. This block can use a combination of properties to identify these specific cases and annotate them as "good." For example, you can check if the 'Avoided Answer' property is True (indicating the bot deliberately avoided from generating a JSON) and 'Input Text Length' is below a certain threshold (indicating the input is indeed empty). This ensures the bot is rewarded for correctly handling unavailable pages, improving the accuracy of your auto-annotations.