The evaluation set is a critical component in the development lifecycle of LLM-based applications. It acts as the benchmark for assessing performance, identifying strengths and weaknesses, comparing versions, and conducting pre-production root cause analysis. Because of its central role, it's essential to invest in building a high-quality, production-representative evaluation set.

As your understanding of real-world usage deepens and new edge cases emerge, the evaluation set must evolve. Deepchecks supports this ongoing process with tools to assess evaluation quality—such as Built-in Properties and Automatic Annotations—and features that make it easy to manage, update, and scale your evaluation set over time.

Should All Versions Share the Same Evaluation Set?

The core idea of an evaluation set is to have a constant, well-defined, set of inputs that is used systematically across all of the versions of an LLM-based app. This means that each time there is a new version of the app (e.g. the system prompt was modified), the evaluation set is "fed" into the new version, and the outputs of the latest version can be compared to all previous versions outputs. This ensures fair, apples-to-apples comparisons during Version Comparison.

In practice, evaluation sets often evolve, and teams don’t always maintain full backward compatibility—leading to version comparisons based on only partially overlapping inputs. While not ideal for evaluation, this is often a practical tradeoff due to resource constraints.

Deepchecks doesn’t enforce identical evaluation sets across versions, but we strongly recommend that the latest or leading candidate versions use the exact same inputs for reliable comparison.

Build & Maintain a High-Quality Evaluation Set

Should All Versions Share the Same Evaluation Set?

Core Components