Compare Between Versions
One of the foundational steps in improving our LLM-based application is comparing its various outputs to identify the optimal configurations for enhanced performance.
This can be done using two comparison options available in the Deepchecks system: Single Sample Comparison and Version Comparison. We will explain how to use each method, applying them to our GVHD demo use case.
Single Interaction Comparison
The Single Interaction Comparison feature allows you to compare interactions with the same “user_interaction_id” across different versions. In the Deepchecks system, each sample from one application can be compared to its corresponding interactions in other versions.
Note: The versions do not need to have the same number of interactions to compare samples with identical “user_interaction_id”.
To access the single interaction comparison panel:
- Go to the Data page.
- Click on one of the interactions in the table.
- Click the “Compare to Other Version” button on the bottom left side.
In our GVHD demo use case, both versions contain the same input interactions but differ in the information retrieval provided to the LLM pipeline, resulting in different outputs. Therefore, you can compare each sample in one version to its equivalent in the other version.
To search for a specific interaction, use the “Search by Interaction ID” bar located on the right side above the samples table (see GIF below).
Let’s compare 2 samples in our GVHD application:
-
qa-medical-gvhd-83: In this sample, the application was asked for immediate steps due to GVHD-related pain. While both versions provide several steps, the answers in version v2_improved_IR offer a more comprehensive and contextually relevant range of actions. This interaction is currently annotated as “good” by the Deepchecks Evaluator.
-
qa-medical-gvhd-29: In this example, version v2_improved_IR received a “bad” auto-annotation due to high similarity with the output in version v1_gpt3.5 Despite the different contexts provided to the LLM, as shown in the “Information Retrieval” tab, the outputs were similar enough to receive the same annotation.
Version Comparison
The Deepchecks’ Version Comparison screen is essential for evaluating performance differences between versions. It enables users to identify overlapping interactions, by comparing samples with similar user_interaction_id, and determine where one version outperforms another.
For our GVHD demo use case, we will compare versions v1_gpt3.5 and v2_improved_IR.
Navigate to the Versions page, select both versions, and click “Compare 2 versions” to start the analysis.
For more information check out the Version Comparison page.
You can see the three types of comparisons available in Deepchecks' Version Comparison panel in the GIF below:
Note: In all comparison panels, interactions with the most divergent scores or annotations are positioned on the left side of the interaction bar, while those with the most similar outputs are on the right.
Updated 23 days ago