Version Comparison

Use Deepchecks to compare versions of an LLM-based-app. The versions may differ in the prompts, models, routing mechanism or any

A common need in the lifecycle of LLM-based applications is to compare multiple versions across key factors, such as performance, cost, and safety, to choose the best option. This typically comes up during iterative development of a new app, or when an existing production system needs improvement due to poor performance, changing data, or shifts in user behavior.

Comparisons can be simple—between two or three versions, such as the current production version and a few new candidates—or more complex, involving dozens of alternatives. The latter is common when experimenting with different prompts or testing multiple base models to find the best trade-off between cost and performance.

🚧

Comparing Versions via Identical user_interaction_ids

  • To fairly compare different versions, you'd want to run them on similar input data and compare their properties and annotations. We recommend building an evolving evaluation dataset, enabling comparing 🍏 to 🍏 via the user_interaction_id field. For more info, see Evaluation Dataset Management

Deepchecks allows you to compare versions both at a high level and with fine-grained detail.

Deepchecks' High-Level Comparison

Deciding on the desired version by comparing the high-level metrics, such as overall score, selected property scores and tracing metrics. This is enabled thanks to the built-in, user-value, and prompt properties, along with the auto-annotation pipeline for calculating the overall scores.

All of this is available on Deepchecks' Versions page. A practical starting point is to sort the versions by "Score w. Est"—which reflects the average auto annotation score—and then examine the key property averages and average latency and tokens of the top candidates.

For example, one version might consistently produce responses with decent completeness and coherence—enough for a "good" annotation—while another version generates highly complete and coherent responses but ends up with a slightly lower overall score. Deepchecks makes it easy to spot and interpret these subtle differences. Notice that the comparison of two versions is always done on the Interaction Type level.

To streamline comparison, you can click the up-arrow icon next to any version to “stick” it to the top of the list. This is especially useful when focusing on a few specific versions—allowing you to sort, filter, and explore them specifically.


Version Comparison CSV Export

After reviewing the high-level comparison, you can dive deeper by exporting the results into a CSV file.

This option is useful when:

  • You want to review comparisons without the UI’s display constraints.
  • Your team needs to continue analysis using external tools.

The exported CSV includes all selected versions, shown side by side, with details such as:

  • Overall performance assessment
  • System metrics (latency, token usage, etc.)
  • Score breakdowns by main properties
  • And more

Because the output is a CSV, you can extend the analysis in whichever way fits your workflow—filtering, aggregating, or combining results with other data sources.

How to use it:

  1. Go to the Versions screen.
  2. Select any number of versions you’d like to compare.
  3. Click Export to CSV.
  4. The file will be automatically downloaded within a few seconds.

🚧

Version Comparison CSV Export Notes

  • The property score list in the CSV does not include every property on the platform, but focuses on the numerical properties highlighted in the Applications Overview dashboard—these are typically the most relevant for comparison.
  • The CSV export feature, as opposed to the granular comparison, does not enforce that all versions come from the same dataset. Keep this in mind when interpreting results.

Deepchecks' Granular Comparison

Selecting two versions allows you to drill down to the specific differences between them. It can highlight specific interactions from the evaluation set whose outputs in the different versions are most dissimilar or interactions that received different annotations between the versions.

You can choose according to what you'd like to compare between the versions, e.g. see the interactions according to where they differ on a specific property or where they differ on latency and tokes.


Identifying the Better Version and Interaction

The version comparison view also highlights which version and individual interactions perform "better"—based on numerical property analysis:

  • Interaction-Level Star Indicator
    Each interaction pair is evaluated based on the average of the first three pinned numerical properties (typically representing the most important metrics). The interaction with the higher average receives a star. In cases where the averages are equal, both interactions receive a star.
  • Version-Level Star Indicator
    A star icon is displayed next to the version that contains a majority of “better” interactions. This provides a quick visual indication of which version performs more favorably overall.
  • Explanation on Hover
    Hovering over a star—at either the version or interaction level—reveals a tooltip explaining the rationale behind the selection.