Retrieval Use-Case Properties

Overview of the classification of documents and built-in retrieval properties on the Deepchecks app

Introduction

In a RAG (Retrieval-Augmented Generation) pipeline, the retrieval step is pivotal, providing essential context that shapes the subsequent stages and ultimately the final output. To ensure the effectiveness of this step, we have introduced specialized properties for thorough evaluation. The process for each interaction is divided into two key stages:

  1. Document Classification: An LLM is used to classify retrieved documents into distinct quality categories—Platinum, Gold, or Irrelevant—based on their relevance and completeness.
  2. Retrieval Property Calculation: Using the classifications from the first stage, various properties are computed for each interaction to assess performance and identify potential areas for enhancement.

📘

Enabling Classification and Property Calculation

If you don't see document classification taking place, it means both the classification process and property calculations are disabled to prevent unintended use. To enable these features, navigate to the "Edit Application" flow in the "Manage Applications" screen.

Enabling/Disabling the Classification on the "Edit Application" Window

Enabling/Disabling the Classification on the "Edit Application" Window


Document Classification

Deepchecks evaluates the quality of documents by looking at how well they cover the information needed to answer the user’s input. Every input usually requires several pieces of information to be fully addressed. Based on how much of this information a document provides, it is classified into one of three categories:

  • Platinum: The document includes all the necessary information to fully and confidently answer the user’s input. No important details are missing, and the document alone is enough to provide a complete response.
  • Gold: The document includes some relevant information that helps answer the input but does not cover everything. It contributes useful parts of the answer but would need to be combined with other documents for a complete response.
  • Irrelevant: The document does not provide any of the information needed to answer the input. It does not help in understanding or addressing the user’s question.
Example of Document Classification for a Single Interaction

Example of Document Classification for a Single Interaction

The bottom section of each document, in it's expanded form, states the reasons for it's classification.

❗️

"Unknown" Documents

Note: If classification fails or is disabled, documents will be marked as "unknown" and shown in the default interaction color.

This classification forms the basis for calculating retrieval properties, allowing the quality of the retrieval process to be measured both at the individual interaction level and across the entire version.


Retrieval Properties

Retrieval properties are based on the Platinum, Gold, and Irrelevant classifications. They offer insights into the effectiveness of retrieval, helping to assess performance and spot areas for improvement in the RAG pipeline.

Normalized DCG (nDCG)

Normalized Discounted Cumulative Gain (nDCG) is a score ranging from 0 to 1. It evaluates the relevance of ranked retrieval results while normalizing scores to account for the ideal ranking. This metric helps determine how effectively the retrieved documents maintain their relevance order, with higher scores representing better alignment with an ideal ordering.

Retrieval Precision

Retrieval precision is a score ranging from 0 to 1. This property calculates the proportion of Gold and Platinum documents among all retrieved documents. High retrieval precision indicates that a significant portion of the retrieved content is relevant, showcasing the quality of the retrieval step.

# Distracting

This metric identifies the number of irrelevant documents that are mistakenly ranked above relevant ones (Gold or Platinum). A high number of distracting documents suggests potential issues in ranking relevance, which may negatively impact the pipeline's output quality.

Mean Reciprocal Rank (MRR)

Mean Reciprocal Rank (MRR) is a score ranging from 0 to 1. It measures the rank of the first relevant document, be it Gold or Platinum. It reflects the efficiency of retrieving relevant content. A higher MRR indicates that relevant content is retrieved earlier in the process, contributing to a more efficient retrieval pipeline.