Evaluation Dataset Management

Use Deepchecks to maintain a broad set of inputs for an LLM-based app that represent the inputs in production, and that enables providing a score for each new version of the app

The evaluation set is a pivotal component in the lifecycle of LLM-based apps. It serves as the benchmark for assessing the performance, quality, and behavior of each version. Deepchecks offers a suite of functionalities designed to streamline the management of your evaluation set, enabling a streamlined evaluation process taking into account automated scoring, property calculation, and many other capabilities that have been described throughout the different sections of the documentation.

Should All Versions Share the Same Evaluation Set?

The core idea of an evaluation set is to have a constant, well-defined, set of inputs that is used systematically across all of the versions of an LLM-based app. This means that each time there is a new version of the app (e.g. the system prompt was modified), the evaluation set is "fed" into the new version, and the outputs of the latest version can be compared to all previous versions. In other words - the ideal comparison setting has identical inputs but different outputs.

It's worth noting that in practice, when the evaluation set gets modified over time, teams don't always ensure "backward compatibility", thus creating situations in which versions being compared may have only a partial overlap between the inputs of their evaluation sets. This setting isn't ideal from an evaluation point of view but is necessary in many cases due to limited resources.

For this reason, Deepchecks doesn't enforce the fact that all versions use the exact same evaluation set, but we highly recommend ensuring that the most recent versions (or leading "candidate" versions) do share the exact same evaluation set inputs.

Core Functionalities: Managing Your Evaluation Set with Deepchecks

Managing your evaluation set is a field that combines multiple different workflows. For convenience, we've separated these workflows into separate pages:

  1. Generating an Initial Evaluation Set (RAG Use Cases Only)
  2. Sending an Existing Evaluation Set via Deepchecks SDK
  3. Uploading an Existing Evaluation Set via Drag & Drop UI
  4. Excluding Undesired Samples from the Evaluation Set
  5. Cloning Samples from Production into the Evaluation Set
  6. Adding New Samples to the Evaluation Set from an External Source

Aside from these technical aspects that should be done within Deepchecks, determining the samples used in the evaluation set along with the policy for updating it are significant steps in setting up the policies for your LLM-based app. It's highly recommended that multiple team members collaborate on this to make sure this is done optimally and takes nuances into account.