Version Comparison
A common scenario during the lifecycle of LLM-based apps is the need to compare a few alternative versions, across various parameters (e.g. performance, cost, safety, etc.) and choose the desired one. This usually arises either during the iterative development cycle, when working on a new application, or alternatively, when an application already in production needs to be improved or updated.
Comparing Versions: Suppling Identical
user_interaction_id
s
- To fairly compare different versions, you'd want to run them on similar input data and compare their properties and annotations. We recommend building an evolving evaluation dataset, enabling comparing 🍏 to 🍏. For more info, see Evaluation Dataset Management
- Some of deepchecks' version comparison features require that the interactions are the same. In order to enable seeing those features, note that when uploading similar interactions to the different versions, you should give them the same
user_interaction_id
value, allowing deepchecks to identify that these are simliar interactions.- For more info about the recommended data fields see the Supported Applications page.
Deepchecks enable you to compare different versions on a high level as well as in a granular approach.
Deepchecks' High-Level Comparison
The Versions screen allows you to see all available versions with your selected key comparison criteria. Those criteria can include the auto annotation score, key properties, similarity to expert responses, and cost and latency metrics.
Deepchecks' Granular Comparison
Selecting two versions allows you to drill down to the specific differences between them. It can highlight specific interactions from the evaluation set whose outputs in the different versions are most dissimilar or interactions that received different annotations between the versions.
You can choose according to what you'd like to compare between the versions, e.g. see the interactions according to where they differ on a specific property.
Updated 4 months ago