13 days ago

0.42.0 Release Notes

by Yaron Friedman

Deepchecks LLM Evaluation 0.42.0 Release

We're excited to announce version 0.42 of Deepchecks LLM Evaluation. This release focuses on deeper session-level insights, streamlined configuration, and enterprise-grade governance. Highlights include session-level properties that evaluate entire conversations, a visual auto-annotation editor that replaces manual YAML editing, multi-turn dataset support, and new audit and usage tracking tools for workspace administrators.

Deepchecks LLM Evaluation 0.42.0 Release:

🧵 Session-Level Properties
🎛️ Visual Auto-Annotation Editor
💬 Multi-Turn Datasets
📋 Audit Logs & Usage Export
🔔 In-App Release Notes

What's New and Improved?

Session-Level Properties

Evaluating individual interactions only tells part of the story. Session-level properties analyze entire multi-turn conversations to detect patterns that only emerge across multiple exchanges - like user frustration building over time, instruction drift, or whether all parts of a complex request were ultimately fulfilled.

Two built-in session properties are now available:

User Satisfaction - Scores how satisfied the user appears throughout the conversation by detecting signals like repeated corrections, resignation, or genuine enthusiasm
Intent Fulfillment - Evaluates whether the assistant addressed all user requests across the session, accounting for recovery from early mistakes and multi-step task completion

Each property reviews the full session transcript and returns a score, and reasoning with specific turn citations. You can set pass/fail thresholds and filter and sort sessions by property scores.

Session-level properties complement interaction-level properties to give you a holistic view of conversation quality - not just whether individual responses were good, but whether the overall experience succeeded.

Read the full Session-Level Properties documentation →

Visual Auto-Annotation Editor

Configuring auto-annotation rules no longer requires editing YAML files by hand. The new visual editor provides a drag-and-drop interface for building and refining your annotation pipeline directly in the UI.

What's new:

Visual Block Builder - Create ordered blocks of conditional rules that determine whether interactions are annotated as Good, Bad, or Unknown
Drag-and-Drop Reordering - Rearrange block priority with simple drag-and-drop, since the first matching block determines the annotation
Built-in Distribution Insights - See histograms showing how your property values are distributed, with real-time previews of how many interactions each condition would match

The editor reads and writes the same YAML configuration used by the pipeline, so you can switch between the visual editor and the raw YAML at any time. Access it from Interaction Types → Choose Interaction Type → Edit YAML.

Read the full UI Auto-Annotation Configuration documentation →

Multi-Turn Datasets

Datasets now support multi-turn conversations, enabling you to build test suites that evaluate how your application handles extended, back-and-forth interactions - not just isolated single-turn exchanges.

What's new:

Multi-Turn Dataset Type - Create datasets specifically designed for conversational testing, where each sample represents an entire conversation scenario
Simulated User Behavior - Each sample defines a user goal along with behavioral dimensions like persistence, clarity, frustration, and directness, creating realistic and diverse conversation dynamics
AI Generation for Multi-Turn - Describe your application and generate diverse multi-turn scenarios automatically, with AI creating varied user goals and combining behavioral dimensions for comprehensive coverage
AI-Generated Labeling - Samples created through AI generation are automatically marked with a sparkle icon. If you manually edit an AI-generated sample, the label updates to reflect that it was edited, so you always know the origin and modification history of each sample

Audit Logs & Usage Export

Two new tools give workspace administrators better visibility into platform activity and resource consumption.

Audit Logs

Every create, update, and delete action across the platform is now automatically logged. Administrators can download audit logs as a CSV file for any date range, covering who performed each action, what was changed, and when. Sensitive data like API keys and credentials are automatically redacted. Access audit log downloads from the Workspace Settings.

Granular Usage Export

A new Download Usage button in Workspace Settings → Usage lets administrators export token consumption data as a CSV. The export breaks down usage by application, version, environment, service type, and model, with separate input and output token counts for precise cost analysis. Select any date range to generate a detailed usage report.

In-App Release Notes

You'll now see a release notes popup when your Deepchecks instance is upgraded to a new version. The popup highlights what's new and links to the full changelog, so your team stays informed about new features without needing to check external documentation. You can dismiss it with a "don't show again" option, and it will only reappear when the next version is deployed.

about 1 month ago

0.41.0 Release Notes

by Yaron Friedman

Deepchecks LLM Evaluation 0.41.0 Release

We're excited to announce version 0.41 of Deepchecks LLM Evaluation. This release brings comprehensive cost visibility, smart property refinement, and powerful dataset generation capabilities. Highlights include automatic cost tracking across all interactions, AI-generated test data for agentic systems, and human-in-the-loop property feedback that makes evaluations more accurate over time.

Deepchecks LLM Evaluation 0.41.0 Release:

📊 Dataset Management
🤖 Agentic Dataset Generation
💰 Cost Tracking with Token-Level Visibility
🎯 Property Refinement with User Feedback

What's New and Improved?

Dataset Management

Datasets are now a first-class feature in Deepchecks, providing a structured way to create, manage, and run curated test collections for systematic LLM evaluation.

What datasets enable:

Reproducible Testing - Run the same test suite across versions to catch regressions and measure improvements
Controlled Evaluation - Move beyond random production sampling to intentional test scenarios
Systematic Coverage - Ensure your application handles edge cases, error scenarios, and diverse inputs
Benchmark Tracking - Compare performance, latency, and cost across versions with consistent test data

Core capabilities:

Create & Organize - Build datasets with up to 500 samples, each containing input, optional reference output, and optional metadata
Flexible Input - Add samples manually, upload CSV files, use AI generation (more on this below), or copy samples from your production data
Sample Management - Edit samples directly in the UI, update metadata, delete unwanted entries

Datasets work seamlessly with both SDK and UI workflows. Create them programmatically, populate via CSV, or use the AI generation tools to build comprehensive test suites in minutes.

Read the full Dataset Management documentation →

Agentic Dataset Generation

Creating comprehensive test datasets for agentic systems is now as simple as describing what your agent does. The new Agents generation mode uses dimensional analysis to automatically create diverse, challenging scenarios that stress-test your agent across complexity levels, ambiguity, multi-step reasoning, and edge cases.

What's new:

No Data Source Required - Generate purely from your agent's description
Dimensional Coverage - Automatically tests simple, medium, and hard scenarios across multiple complexity axes
Agent-Specific Challenges - Creates situations involving multi-step workflows, ambiguous instructions, and constraint conflicts

This joins existing generation methods (RAG for document-based apps and Pentest for security testing) to give you the right tool for every evaluation need.

Read the full AI Data Generation documentation →

Cost Tracking with Token-Level Visibility

Understanding LLM costs is now effortless. Deepchecks automatically tracks spending across every interaction by calculating costs based on input and output token usage.

What this gives you:

Automatic Cost Calculation - Configure model pricing once at the organization level, and costs are computed automatically for every interaction
Token-Level Tracking - Monitor input tokens and output tokens separately to understand where costs come from
Aggregated Insights - View session-level and version-level cost totals to identify expensive patterns and compare cost-efficiency across versions
Filter by Cost - Find your most expensive queries instantly to optimize prompts or switch to more efficient models

Cost tracking works seamlessly with both interactions and spans. For multi-step agentic workflows, costs from nested LLM calls automatically roll up to parent interactions, giving you accurate end-to-end spending visibility.

Simply configure your model pricing once, and every interaction automatically shows its cost. No additional integration required beyond logging the model, input_tokens, and output_tokens fields you're likely already tracking.

Read the full Cost Tracking documentation →

Property Refinement with User Feedback

Custom LLM properties now learn from your corrections. When a property's evaluation doesn't match your judgment, provide feedback explaining the correct score - and the property uses it as a training example for future evaluations.

What this enables:

Continuous Improvement - Properties become more accurate and aligned with your quality standards over time
Domain Adaptation - Teach properties your industry terminology, edge cases, and subjective quality thresholds
No Code Required - Refine evaluations through simple UI feedback, no prompt engineering needed

How it works:

For any custom LLM property evaluation, click the feedback icon and provide:

Your corrected score (1-5 stars for numerical properties or category selection for categorical properties)
Reasoning explaining why your score is more accurate

Deepchecks incorporates your feedback as in-context learning examples, so future evaluations reference your corrections when judging similar cases. Currently, the property-enhancing feedback is available for prompt properties only.

This transforms properties from static evaluation rules into adaptive quality models that evolve with your needs.

Read the full Property Refinement documentation →

about 2 months ago

0.40.0 Release Notes

by Yaron Friedman

Deepchecks LLM Evaluation 0.40.0 Release

We’re excited to announce version 0.40 of Deepchecks LLM Evaluation. This release strengthens self-hosted deployments, improves how agentic systems are evaluated end-to-end, and continues to simplify platform governance. Highlights include a new self-hosted deployment guide, children-based annotation aggregation for agentic workflows and centralized model management for self-hosted environments.

Deepchecks LLM Evaluation 0.40.0 Release:

🏗️ Self-Hosted Deployment Guide
🧠 Model Management for Self-Hosted Deployments
🧩 Children Annotation Aggregation for Agentic Workflows
🔐 RBAC Safeguards for Owners
⚠️ Tool Use Interaction Deprecation

What’s New and Improved?

Self-Hosted Deployment Guide

Deepchecks now includes a dedicated deployment guide for Self-Hosted Enterprise, making it easier than ever to run Deepchecks entirely in your own infrastructure.

What this gives you:

Deploy Anywhere - Run Deepchecks inside your own environment with full control over networking, security, and data boundaries
Production-Ready by Design - Built for Kubernetes to support reliability, scalability, and real-world workloads
Clear Deployment Path- A structured walkthrough of the required components and how they fit together

This guide focuses on helping teams understand what’s required and why, without forcing them to become Deepchecks experts on day one. Whether you’re deploying on AWS or adapting to another environment, the core idea is simple: Deepchecks is designed to fit cleanly into your existing infrastructure.

Model Management for Self-Hosted Deployments

Self-hosted Deepchecks deployments now include first-class model management.

Unlike SaaS deployments, self-hosted environments don’t rely on Deepchecks-managed models. Instead, you explicitly configure the models you own and operate - and Deepchecks makes them available everywhere they’re needed.

What this enables:

Centralized Configuration - Manage all models in one place at the organization level
Broad Provider Support - OpenAI, Azure OpenAI, AWS Bedrock, and self-hosted endpoints via LiteLLM
Immediate Availability - Once added, models appear instantly across evaluations, applications, and preferences

Key details:

Models are validated and connectivity-tested before being saved
No models are configured by default - you stay fully in control

This ensures self-hosted users get the same smooth evaluation experience, without compromising ownership or security.

Children Annotation Aggregation (Agentic Evaluation)

Evaluating agentic systems requires more than looking at a single interaction in isolation. With Children Annotation Aggregation, parent interactions are now automatically annotated based on the quality of their child interactions.

Why this matters:

In agentic workflows, structure is hierarchical:

An Agent may invoke multiple tools and LLM calls
A Chain coordinates a sequence of steps
A Root represents the full end-to-end execution

A parent interaction might look fine on its own - but if its children fail, hallucinate, or misuse tools, the overall outcome isn’t truly successful.

What’s new:

Automatic Upward Propagation - Parent interactions inherit annotations based on their children
Configurable Rules - Define thresholds (e.g. “mark bad if any child is bad” or “if >50% fail”)
Type Filtering - Apply aggregation only to specific child interaction types (LLM, tool, etc.)

The result is a more honest, system-level view of agentic behavior - one that reflects what actually happened beneath the surface.

Example of children annotation aggregation where the interaction would get a Bad auto annotation if any of it's direct Tool or LLM children is annotated Bad

RBAC Safeguards for Owners

To protect organizational stability, we’ve added an important safeguard to role management.

What’s changed:

Every organization must always have at least one Owner
An Owner can't downgrade their own role unless another Owner already exists

This prevents accidental lockouts and ensures there’s always someone with the permissions required to manage configuration, users, and critical settings.

Tool Use Interaction Deprecation

The legacy Tool Use interaction type has been deprecated.

What to know:

Tool Use is no longer a standalone interaction type
All agentic workflows now use the unified agentic interaction model: Root, Agent, Chain, Tool, and LLM

This change simplifies the interaction model and better reflects how modern agentic systems actually operate, while enabling more consistent evaluation and aggregation across hierarchies.

3 months ago

0.39.0 Release Notes

by Yaron Friedman

Deepchecks LLM Evaluation 0.39.0 Release

We're excited to announce version 0.39 of Deepchecks LLM Evaluation - introducing per-application access control, enhanced prompt properties with descendant data access, Google ADK integration, and CloudWatch metrics for SageMaker deployments. This release strengthens access control granularity, expands framework support, and provides deeper insights into complex agentic workflows.

Deepchecks LLM Evaluation 0.39.0 Release:

🔐 Per-Application Access Control (New RBAC Tier)
🌳 Prompt Properties: Access Descendant Data
🤖 Google ADK Integration
💡 AI Assistant for Documentation
📊 CloudWatch Integration for SageMaker Deployments
⚠️ Deprecations & API Changes

What's New and Improved?

Per-Application Access Control (New RBAC Tier)

Deepchecks now supports a second tier of access control that enables fine-grained application-level permissions. When enabled, users can be granted access to specific applications rather than having automatic access to all applications in the organization.

What this means:

Granular Access Control - Admins and Owners can specify which applications each user can access when inviting users or updating permissions
Enhanced Security - Users only see and can access applications they've been explicitly granted access to

This feature is particularly valuable for organizations with multiple teams or applications that require stricter data access boundaries.

To learn more about Deepchecks RBAC and the new per-application access tier, click here.

Prompt Properties: Access Descendant Data

Prompt properties can now access data from descendant (child) spans in hierarchical workflows, enabling more comprehensive evaluation of complex agentic use-cases.

What's new:

Descendant Scopes - Add scopes to your prompt properties to access data from child spans at different levels
Granular Control - Specify depth levels, filter by interaction types, and select which data fields to include from descendant spans
Flexible Configuration - Each scope can be configured independently, allowing you to gather different types of information from different parts of the hierarchy

Value:

This enhancement enables you to create highly tailored properties for agentic workflows. For example, you can evaluate whether all tool calls in an agentic interaction were appropriate by accessing all descendant Tool spans, or evaluate the quality of responses from an agent's direct LLM children by looking one level down at LLM spans only.

The Data component provides full flexibility over which data fields are sent to the LLM, ensuring token efficiency by only including what's needed for evaluation.

For detailed information on configuring descendant data access, click here.

Google ADK Integration

Deepchecks now integrates seamlessly with Google ADK, enabling automatic trace collection and evaluation for Google ADK workflows.

What you get:

Automatic Instrumentation - Uses OTEL + OpenInference to automatically instrument Google ADK interactions
Rich Traces - Captures LLM calls, tool invocations, and agent-level spans within the graph
Simple Setup - Register with Deepchecks through a single register_dc_exporter call
Full Evaluation - Benefit from Deepchecks' observability, evaluation, and monitoring capabilities

Google ADK joins the popular frameworks we're already integrated with - CrewAI and LangGraph - expanding Deepchecks' support for the most widely-used agentic frameworks.

For installation and setup instructions, click here.

AI Assistant for Documentation

You can now get instant help directly inside our documentation with the new AI Assistant.

Just click “Ask AI” next to the search field on the documentation site and ask anything you need - from feature explanations and setup guides to “how do I…” questions and best practices.

Why it’s useful:

No more digging - Ask questions in plain language instead of searching through pages
Context-aware answers - Get relevant guidance across features, guides, and workflows
Faster onboarding & troubleshooting - Find what you need in seconds

Whether you’re exploring new capabilities or looking for a specific how-to, the AI Assistant helps you get answers instantly, right where you need them.

CloudWatch Integration for SageMaker Deployments

SageMaker customers can now send Deepchecks monitoring data and LLM evaluation metrics directly to AWS CloudWatch, consolidating model-monitoring data in one place.

What this enables:

Unified Monitoring - View all your Deepchecks metrics alongside your existing AWS dashboards and alarms
Quick Setup - Start getting metrics in CloudWatch within seconds after configuration
Seamless Integration - Use your existing AWS monitoring infrastructure without additional tooling

How it works:

Add the required IAM permissions to your SageMaker execution role
Enable CloudWatch metrics in Workspace Settings (Owner role required)
Metrics automatically flow to CloudWatch under the DeepChecksLLM namespace

This integration allows you to leverage your existing AWS monitoring workflows while benefiting from Deepchecks' comprehensive LLM evaluation capabilities.

For setup instructions and IAM policy details, click here.

Deprecations & API Changes (Public Endpoints)

As part of ongoing improvements to our API surface, we’re beginning the deprecation process for several public API endpoints. These endpoints will be removed after three upcoming milestones, following the same policy used for SDK deprecations.

Endpoints being deprecated:

GET /reference/listappversions
POST /reference/createappversion
POST /reference/publiccreateinteractions

SDK upgrade strongly recommended The newer SDK versions include improvements aligned with the new endpoints and permissions model, and provide clearer, more helpful messaging.

Improved permission handling With the updated access control model, API responses now return 403 in cases where:

the application does not exist, or
the user does not have permission to access the application. The latest SDK improves this experience by explicitly surfacing “application not found” when applicable (the most common case).

3 months ago

0.38.0 Release Notes

by Yaron Friedman

We’re excited to announce version 0.38 of Deepchecks LLM Evaluation - introducing framework-agnostic data ingestion for agentic workflows, a more expressive Avoidance evaluation property, and a new Metric Viewer role. This release expands who can use Deepchecks, improves failure signal quality, and strengthens access control and platform clarity.

Deepchecks LLM Evaluation 0.38.0 Release:

🧩 Framework-Agnostic Agentic Data Ingestion
🚫 Avoided Answer → Avoidance (Enhanced Property)
🔐 New RBAC Role: Metric Viewer
🤖 New Models Available for LLM-Based Features
⚠️ SDK Deprecation Notice: send_spans()

What's New and Improved?

Framework-Agnostic Agentic Data Ingestion

Deepchecks now supports uploading agentic and complex workflow data via the SDK, without relying on automatic tracing from a supported framework. This enables full observability and evaluation for teams using custom frameworks, in-house orchestration layers, or unsupported agent runtimes.

You can manually structure and send sessions, traces and spans to Deepchecks while still benefiting from the full evaluation, observability, and root-cause analysis capabilities.

For a step-by-step guide, click here.

Avoided Answer → Avoidance (Enhanced Property)

The existing Avoided Answer property has been upgraded to Avoidance, providing richer and more actionable signals.

What changed:

Previously: a binary (0/1) score indicating whether an answer was avoided.
Now: a categorical property that distinguishes between:
- valid — the input was not avoided
- Specific avoidance modes (e.g. policy-based, lack of knowledge, safety constraints, and more)

This enables clearer diagnosis of why an answer was avoided and supports more meaningful aggregation and analysis across versions.

📌 Deprecation note: The legacy Avoided Answer property is deprecated but will continue to function for existing applications.

For full property definitions and migration details, click here.

New RBAC Role: Metric Viewer

We’ve added a new role to Deepchecks Role-Based Access Control: Metric Viewer.

This role is designed for stakeholders who need high-level insights without access to raw data.

Metric Viewer capabilities:

Read-only access to aggregated metrics and evaluation results
Access limited to the version level
❌ No access (via UI or SDK) to raw spans and traces
❌ No write permissions

This complements the existing Viewer role by enabling stricter data-access boundaries for security-sensitive environments.

To learn more about Deepchecks RBAC roles, click here.

New Models Available for LLM-Based Features

The following models are now supported:

GPT-5.1
Amazon Nova 2 Lite
Amazon Nova Pro

These models can be selected for evaluation, analysis, and automation features across the platform.

SDK Deprecation Notice: send_spans()

The SDK function send_spans() has been renamed to log_spans_file() to better reflect its behavior and usage.

📌 Deprecation notice: send_spans() is now deprecated and will remain supported for the next few releases. We recommend migrating to log_spans_file() to ensure forward compatibility.

Updated SDK documentation and examples reflect the new function name.

4 months ago

0.37.0 Release Notes

by Yaron Friedman

We’re excited to announce version 0.37 of Deepchecks LLM Evaluation — featuring enhanced Agent Execution Flow Graphs, flexible span-to-interaction mapping, comprehensive version-level failure mode analysis, CloudWatch integration, and improved hierarchical views. This release helps users navigate complex agentic workflows, consolidate failure insights, and monitor metrics with even greater clarity and control.

Deepchecks LLM Evaluation 0.37.0 Release:

🕸️ Enhanced Agent Execution Flow Graph
🔄 Map Spans to Custom Interaction Types
📊 Version-Level Failure Mode Summary
☁️ CloudWatch Integration for Metrics
📂 Collapsible Trees in Hierarchical Views

What's New and Improved?

Enhanced Agent Execution Flow Graph

The Agent Execution Graph now offers more interactivity and insights:

Clicking a node filters the Interactions screen to that specific node.
Hovering over nodes or edges shows metadata across all filtered runs.
Node and edge styles indicate consistency: solid lines appear in all filtered runs, dashed lines if only in some.

Map Spans to Custom Interaction Types

You can now assign specific span names to custom interaction types, overriding default mappings. This allows spans with the same kind to have distinct properties or auto-annotation rules, for example, mapping a “Reader” span and a “Writer” span to separate interaction types. For more details, click here.

Version-Level Failure Mode Summary

In addition to property-level insights, Deepchecks now generates a **version-level failure mode analysis report. **It aggregates failures across all interaction types and selected properties, providing a consolidated view of dominant issues in the system. Ideal for spotting cross-cutting problems, prioritizing improvements, and detecting regressions. Access this from the Version page via “Analyze version failures.” For more information, click here.

AWS CloudWatch Integration

Deepchecks can now send monitoring data and LLM evaluation metrics directly to AWS CloudWatch. SageMaker users benefit immediately, with metrics appearing in CloudWatch dashboards and alarms without any additional setup. For more details click here.

Collapsible Trees in Hierarchical Views Hierarchical use cases now include collapse buttons, making it easier to navigate and focus on relevant branches of your workflows.

5 months ago

0.36.0 Release Notes

by Yaron Friedman

We’re excited to announce version 0.36 of Deepchecks LLM Evaluation - featuring a powerful new Agent Graph visualization, enhanced filtering and analysis capabilities, and deeper AWS SageMaker configurability. This release helps users better understand complex agentic executions, gain sharper analytical control, and manage configurations with greater transparency.

Deepchecks LLM Evaluation 0.36.0 Release:

🕸️ New Agent Execution Graph
🎯 Advanced Filtering by Span Attributes
☁️ Dedicated SageMaker Owner Panel
⚡ New Processing Status: Partial

What's New and Improved?

Agent Execution Graph

Visualize how your agentic workflows actually execute - including branches, loops, and transitions - directly within Deepchecks.
The new Agent Execution Graph provides a dynamic graph-style view of your pipeline runs, built automatically from your existing span metadata (no instrumentation changes required). It’s highly useful for agentic frameworks like LangGraph and CrewAI, helping you understand real runtime behavior at a glance.

You can find the Agent Execution Graph on the Overview page, within the Sessions tab, whenever you’re viewing an agentic use case that includes relevant span data.

Advanced Filtering by Span Attributes

You can now apply granular filters based on span-level attributes, such as span name, metadata fields, and more, directly in the Overview screen.
This enhancement gives analysts more precise control over their investigations, allowing them to isolate behaviors, patterns, or anomalies tied to specific spans or framework metadata.

Dedicated SageMaker Owner Panel

In Deepchecks in SageMaker, the user designated as the Owner now has access to a dedicated Owner Panel with advanced permissions to configure organization- and application-level settings directly through the UI.
This empowers teams to fine-tune Deepchecks behavior across environments - without code changes or redeployment.

Using the Owner Panel on SageMaker →

New Processing Status: Partial

We’ve added a third processing status: Partial.
This new state appears when an interaction or session is stuck or failed during a key processing phase, helping users easily distinguish between ongoing runs and those that will not progress further without further user-action.
This improvement brings more transparency and reliability, especially in non-SaaS or self-managed deployments.

5 months ago

0.35.0 Release Notes

by Yaron Friedman

We’re excited to announce version 0.35 of Deepchecks LLM Evaluation - packed with new integrations, evaluation properties, and expanded documentation. This release strengthens our support for agentic workflows, improves evaluation flexibility, and deepens our collaboration with AWS SageMaker users.

Deepchecks LLM Evaluation 0.35.0 Release:

🚀 New LangGraph Integration
🧠 New Research-Backed Evals for Agentic Workflows
⚙️ Configurable Number of Judges for Custom Prompt Properties
🤖 New Model Support: Claude Sonnet 4.5
☁️ New AWS SageMaker Documentation

What's New and Improved?

New LangGraph Integration

Seamlessly connect LangGraph applications to Deepchecks for effortless data upload and evaluation. With this integration, you can automatically log traces, spans, and metadata from LangGraph workflows and visualize and evaluate them directly in Deepchecks. Learn more here.

New Research-Backed Evals for Agentic Workflows

We’ve added two new built-in evaluation properties tailored for agentic applications: Reasoning Integrity and Instruction Following. These properties are based on cutting-edge research and provide deeper insight into reasoning quality, task adherence, and logical consistency across agent runs. Learn more here.

Example of an Instruction Following score on an LLM span

Configurable Number of Judges for Custom Prompt Properties

You can now configure the number of judges (1, 3, or 5) used for custom prompt-based evaluations. This feature gives you more control over evaluation robustness and cost-performance trade-offs. Learn more here.

The "# of Judges" configuration on the "Create Prompt Property" screen

New Model Support: Claude Sonnet 4.5

We’ve added support for Claude Sonnet 4.5 as a model option for your prompt properties. This enables you to leverage Anthropic’s latest model for more nuanced, high-quality evaluations within your existing Deepchecks workflows.

New AWS SageMaker Documentation

We’ve added dedicated documentation for users running Deepchecks on AWS SageMaker. The new guides explain how to effectively use LLM-based features and how to optimize your DPU utilization in SageMaker environments.
Using LLM Features on SageMaker →
Optimizing DPUs on SageMaker →

6 months ago

0.34.0 Release Notes

by Yaron Friedman

This release focuses on flexibility, observability, and deeper insights: evaluate multi-agent workflows, simplify trace logging, tailor performance to your setup, and keep control with new filters and organization-wide logs.

Deepchecks LLM Evaluation 0.34.0 Release:

🤖 Advanced support of agent evaluation (including nested interactions, new interaction types, child-aware properties and CrewAI integration)
📡 Trace logging with instrumentation
🎛️ Configurable processing speed for AWS SageMaker & On-Prem deployments
🔖 Save filter presets for quick navigation
📊 Version Comparison CSV Export
🗂️ Organization-level logs across applications

What's New and Improved?

Advanced support of agent evaluation

We’ve significantly expanded our observability and evlauation support for agentic workflows - across frameworks, tracing methods, and evaluation:

New interaction types (Root, Agent, LLM, Tool), each with research-backed built-in properties and auto-annotation configurations.
New built-in properties tailored for agents and tools, including Plan Efficiency, Tool Coverage, and Tool Completeness.
Nested spans & interactions, with properties that leverage child data for richer evaluation.
Enhanced UI for single-trace view, showing run status, logged system metrics, attributes, and events - making debugging and analysis more transparent.
Seamless framework support: thanks to our new OpenTelemetry tracing (see the following item), agent frameworks like CrewAI can now be logged and evaluated directly.

Example of a the new single Agent span view

Trace logging with instrumentation

We’ve just introduced native support for trace logging via OpenTelemetry and OpenInference 🎉.

Now, you can automatically capture and centralize traces and spans from your LLM and agentic frameworks into Deepchecks - no more manual logging required. If you’re already using frameworks with built-in instrumentors (like CrewAI, LangGraph, and others), setup is seamless and requires only a few lines of code for configuration.

This makes it easy to collect rich, structured trace data from your agents and pipelines, and immediately make it available for:

Evaluation: Run properties and analyses directly on trace-level data.
Monitoring: Keep track of performance across workflows.
Debugging: Quickly drill down into problematic spans, traces and versions.

This ensures you get the deepest possible visibility with minimal effort.

Configurable Processing Speed for Deepchecks On-Prem & SageMaker products

For our non-SaaS deployments, evaluation speed depends on your own LLM capacity. To reduce rate limits and bottlenecks, you can now choose between three processing modes—fast, balanced, or reduced load - so evaluations always complete smoothly at the pace that fits your setup.

Changing the processing speed on the Workspace Settings screen

Save Filter Presets for Quick Navigation

Tired of reapplying the same filters and sorts? You can now save your setup as a preset on the Interactions screen, then reload it with a single click. This makes it effortless to jump back to your most useful views and keep analyses consistent across sessions.

Loading a save preset on the interactions screen

Version Comparison CSV Export

Need to take your version comparisons beyond the UI? You can now export comparison results directly to CSV for deeper analysis.

Review all selected versions side by side.
Access details like overall performance, system metrics, and property-level breakdowns.
Extend analysis using your own tools - filter, aggregate, or merge with external data sources.

This makes it simple to share results across teams and continue working in the format that best fits your workflow.

Export to CSV option on the Versions screen

Organization-level logs across applications

Previously, logs were only available per application. Now, you can access a centralized log view for your entire organization - covering all applications - in the Workspace Settings screen (top-right).

"View Logs" on the organization level on the Workspace Settings screen

7 months ago

0.33.0 Release Notes

by Yaron Friedman

This release focuses on clarity, speed, and smarter insights: visualize property performance over time, optimize evaluation guidelines with AI, track processing status at a glance, and jump straight to relevant data with new hyperlinks and filters.

Deepchecks LLM Evaluation 0.33.0 Release:

🤖 AI-assisted optimization for property guidelines
📈 Property graphs view in both Evaluation and Production
⏳ Processing status indicators for interactions and sessions
🔍 Filter-by-click from score-breakdown component
❓ Reasoning explanations for N/A properties
🔗 Hyperlinked examples in Property Failure Mode Analysis

What's New and Improved?

AI-Assisted Optimization of Property Guidelines

Writing robust prompt guidelines can be hard - especially without prompt engineering experience. Now, after you fill in essential fields (property name, guidelines, interaction steps), an Optimize button appears. Clicking it opens an expansion panel:

Your current input is pre-filled as “Additional Guidelines.”
All relevant context (name, description, categories, examples, steps) is sent to a research-backed LLM, which returns polished, AI-generated Suggested Guidelines—fully editable before saving.

You can save to overwrite your draft, or cancel to retain it. And if you adjust your draft, Optimize becomes available again for further refinement.

Why it matters: More context means smarter suggestions—so the richer your original details, the better the AI helps refine them.

See more details here: https://llmdocs.deepchecks.com/docs/improve-guidelines-with-ai

Property Graphs View in Evaluation & Production

We’ve added a versatile graphs option to the Overview screen:

Evaluation environment: Visualize property score distributions, helping you spot outliers or skewed metrics at a glance.
Production environment: Track average property scores over time. Compare these alongside the overall production score to pinpoint which properties most influence trends.

This gives you a clearer, data-driven view into what’s driving performance.

Processing Status Indicators for Interactions & Sessions

Keep tabs on what’s done and what’s still running:

In Progress: Analysis steps (property calculations, annotations, topic inference, similarity checks, etc.) are still underway.
Completed: Everything’s finished, and results are ready.

Where to see it:

Single Interaction View: Status icon at the top denotes real-time progress.

An interaction with a "completed" processing status (can be seen on the right of the screen)

Interactions List: Each row shows an icon (with hover text) to quickly assess readiness.
Sessions List: Each session displays a summary status—completed only when all interactions are done.

This way, you always know exactly what’s ready to review.

See more details here: https://llmdocs.deepchecks.com/docs/interaction-and-session-completion-status

Click-to-Filter from Score Breakdown

In the Score Breakdown component, now clicking any property or annotation reason instantly filters the Interactions screen to show only relevant items. It makes digging into causes intuitive and fast.

Reasoning for N/A Properties

When a property is marked N/A, you’ll now see a brief explanation—why it couldn’t be calculated. Over the next few weeks, this reasoning will be extended to cover more property types, offering transparency and aiding debugging.

Hyperlinked Examples in Failure Mode Analysis

Failure Mode Analysis now outputs interactive examples - every example includes a hyperlink that opens the specific interaction in a new window. This makes deep-dives from summaries directly actionable.

Failure mode analysis example with a hyperlink to the interaction