DocumentationAPI ReferenceRelease Notes
DocumentationLog In
Documentation

Compare Between Versions

In the E-Commerce Summarization application, the goal is to compare prompts and identify one that optimally balances two key properties: Attractiveness and Grounded in Context. Additionally, we assess other relevant Deepchecks properties for summarization: Coverage and Text Quality.

Application Versions

The Deepchecks Application Versions screen provides an overview of key metrics and their differences across three versions.

Here we can see the first insight about the different prompts - there is an inherent tradeoff between the three metrics which are most important for our use case: "Attractiveness", "Grounded in Context" and "Coverage". In this scenario, the user selected the Balanced version, which achieves satisfactory scores in all the key metrics even though it was not the top performer in any of them.

📘

Note

We can see that user preferences are represented in the estimated score calculated by the auto annotation yaml in which the selected version received a significantly higher score.

Comparing Interactions Between Versions

After gaining a high-level understanding of the version differences, the user might wish to compare two specific versions.

To perform the comparison, select the versions named Attractive and Balanced, then click on "Compare 2 Versions."

Next we can look into the differences between the two across different samples, Attractive (right) and Balanced (left):

Remember: We created auto annotation yaml which helps us estimate the 'Good' and 'Bad' annotations.

In the sample above we can see that the Balanced (left) was evaluated as good by properties: Coverage(0.88), Text Quality(5), Attractiveness(5), Grounded In Context(0.99), while the Attractive was evaluated as bad by the low Grounded In Context score (0.5).

We can see that the Attractive (right) version contains an hallucination, by marking the difference between the two versions and reading the Consistency explanation:

"It also features DTS X: Ultra technology for immersive gaming audio. Support: This fact is not explicitly mentioned in the input."

Single Interaction Comparison

Let’s explore interaction id: 46: summary of the Galaxy M33 5G, a smartphone with a 5nm Exynos 1280 processor, FHD+ 120Hz Display, 50MP quad camera, and a 6000 mAh battery.

One version manages to list all the features of the device in a dry manner, while the other manages to phrase the same product description in a much more appealing manner.

To search for this specific interaction, use the “Search by Interaction ID” bar located on the right side above the samples table:

Let’s compare 2 samples in our E-Commerce application:


interaction id: 46

In the below screen we can see the differences between two versions:

  • Base (Left) - Covers the basics but lacks an engaging tone. Annotated as bad for uninspired delivery.
  • Balanced (Right) - Polished and engaging, estimated as good by properties: Coverage(0.82), Consistency(5), Text Quality(5), Attractiveness(5), Grounded In Context(0.96)