Analyze & Improve Overview

In the Evaluation section you learned how Deepchecks measures quality - properties score every interaction and annotations label each one as Good, Bad, or Unknown. Now you have data with quality signals. This section is about acting on those signals: finding what is wrong, testing improvements, and making sure things stay good after deployment.

The improvement loop

Improving an LLM application is not a one-time task - it is a cycle. Deepchecks provides tools for each step:

Find - The Interactions and Sessions screens are where you filter, search, and narrow down your data to the subset worth investigating - by annotation, property score, version, interaction type, or date range.
Investigate - Root Cause Analysis tells you what is failing and why. Property explainability, annotation breakdowns, and AI-generated insights pinpoint the specific issues dragging quality down.
Inspect - The Session View lets you walk through individual traces interaction by interaction, so you can see exactly what happened at each step of a pipeline run.
Build test sets - Dataset Management lets you turn failure cases into curated evaluation sets. Every future version gets tested against the same inputs, so comparisons are fair.
Compare - Version Comparison puts two or more implementations side by side on quality, latency, and token usage, down to the individual interaction level. You pick the winner.
Weigh cost - Cost Tracking shows what each version costs to run, so you can make informed trade-offs between quality improvement and spending.
Monitor - Once you deploy the best version, Production Monitoring tracks quality over time. When it detects degradation, you start the loop again.

This creates a feedback loop: production reveals problems, problems become test cases, new versions are tested against those cases, the best version is deployed, and production monitoring watches for the next issue.

What is in this section

Interactions and Sessions Screens - Filter, search, and navigate your data to find the interactions worth investigating
Root Cause Analysis - Property explainability, annotation breakdowns, automated insights, and failure mode analysis
Navigating the Session View - Walk through individual traces interaction by interaction to understand what happened
Dataset Management - Create and manage curated evaluation sets for reproducible testing
Version Comparison - Compare implementations side by side on quality, cost, and latency
Cost Tracking - Automatic LLM cost tracking based on token usage and configured model pricing
Production Monitoring - Track performance trends over time in production with external integrations