Retrieval Use-Case Properties

Overview of the classification of documents and built-in retrieval properties on the Deepchecks app

Introduction

In a RAG (Retrieval-Augmented Generation) pipeline, the retrieval step is pivotal, providing essential context that shapes the subsequent stages and ultimately the final output. To ensure the effectiveness of this step, we have introduced specialized properties for thorough evaluation. The process for each interaction is divided into two key stages:

  1. Document Classification: An LLM is used to classify retrieved documents into distinct quality categories—Platinum, Gold, or Irrelevant—based on their relevance and completeness.
  2. Retrieval Property Calculation: Using the classifications from the first stage, various properties are computed for each interaction to assess performance and identify potential areas for enhancement.

📘

Enabling Classification and Property Calculation

If you don't see document classification taking place, it means both the classification process and property calculations are disabled to prevent unintended use. To enable these features, navigate to the "Edit Application" flow in the "Manage Applications" screen.

Enabling/Disabling the Classification on the "Edit Application" Window

Enabling/Disabling the Classification on the "Edit Application" Window


Document Classification

Deepchecks evaluates the quality of documents by looking at how well they cover the information needed to answer the user’s input. Every input usually requires several pieces of information to be fully addressed. Based on how much of this information a document provides, it is classified into one of three categories:

  • Platinum: The document includes all the necessary information to fully and confidently answer the user’s input. No important details are missing, and the document alone is enough to provide a complete response.
  • Gold: The document includes some relevant information that helps answer the input but does not cover everything. It contributes useful parts of the answer but would need to be combined with other documents for a complete response.
  • Irrelevant: The document does not provide any of the information needed to answer the input. It does not help in understanding or addressing the user’s question.
Example of Document Classification for a Single Interaction

Example of Document Classification for a Single Interaction

The bottom section of each document, in it's expanded form, states the reasons for it's classification.

❗️

"Unknown" Documents

Note: If classification fails or is disabled, documents will be marked as "unknown" and shown in the default interaction color.

This classification forms the basis for calculating retrieval properties, allowing the quality of the retrieval process to be measured both at the individual interaction level and across the entire version.


Retrieval Properties

Retrieval properties are based on the Platinum, Gold, and Irrelevant classifications. They offer insights into the effectiveness of retrieval, helping to assess performance and spot areas for improvement in the RAG pipeline.

Normalized DCG (nDCG)

Normalized Discounted Cumulative Gain (nDCG) is a score ranging from 0 to 1. It evaluates the relevance of ranked retrieval results while normalizing scores to account for the ideal ranking. This metric helps determine how effectively the retrieved documents maintain their relevance order, with higher scores representing better alignment with an ideal ordering.

Retrieval Coverage

Retrieval coverage is a score between 0 and 1. It evaluates whether the retrieved gold and platinum documents provide all the necessary information to fully address the input query. Since a query may require multiple distinct pieces of information, the evaluation considers each part individually. A high retrieval coverage score indicates that most of the required information is present in the retrieved documents.

Explainability

The Retrieval Coverage property includes attached reasoning, showing a breakdown of the input into sub-queries and assessing whether each is fully, partially, or not addressed by the information retrieval. You can also review the Gold and Platinum documents to see what information is covered or missing.


Retrieval Precision

Retrieval precision is a score ranging from 0 to 1. This property calculates the proportion of Gold and Platinum documents among all retrieved documents. High retrieval precision indicates that a significant portion of the retrieved content is relevant, showcasing the quality of the retrieval step.

# Distracting

This metric identifies the number of irrelevant documents that are mistakenly ranked above relevant ones (Gold or Platinum). A high number of distracting documents suggests potential issues in ranking relevance, which may negatively impact the pipeline's output quality.

Mean Reciprocal Rank (MRR)

Mean Reciprocal Rank (MRR) is a score ranging from 0 to 1. It measures the rank of the first relevant document, be it Gold or Platinum. It reflects the efficiency of retrieving relevant content. A higher MRR indicates that relevant content is retrieved earlier in the process, contributing to a more efficient retrieval pipeline.

# Retrieved Docs

This property shows the total number of documents retrieved in the information retrieval process (i.e., the K in top-K). This property is primarily useful for filtering and root cause analysis (RCA) within the system.


Interpreting Retrieval Properties

🔍 Understanding Low Normalized DCG (nDCG)

A low normalized DCG score typically indicates that the ranking of the retrieved documents is suboptimal. This could reflect a problem with the ranking model’s ability to order documents by relevance. However, the cause can vary depending on other metrics. Below are key patterns and how to interpret them:

📉 Case 1: Low nDCG + High #Distracting

This combination suggests that relevant documents are present in the retrieval, but the ranking model failed to prioritize them correctly. Distracting documents (those unrelated but still ranked high) are taking precedence.

🔎 What to do:

  • Investigate your ranking or re-ranking strategy.
  • Check if the query is ambiguous.
  • Explore methods to better prioritize Gold and Platinum documents.


📉 Case 2: Low nDCG + Low #Distracting

Here, like in the example below, few documents are distracting, but the ranking is still poor.

This usually means:

  • Either relevant documents don’t exist in the database, or
  • The retrieval depth (i.e., #Retrieved Docs) is too shallow to reach them.

🔎 What to do:

  • If #Retrieved Docs is low, increase it and check if relevant documents emerge.
  • If #Retrieved Docs is already high, consider whether the information simply doesn’t exist in the database.

📚 Detecting Retrieval Overload

Retrieval overload occurs when the information retrieval process successfully fetches enough relevant information to answer the query, but also retrieves too many irrelevant documents. This can be detected using two properties: retrieval precision and retrieval coverage.

When retrieval precision is low (i.e., few of the retrieved documents are relevant), but retrieval coverage is high (i.e., enough relevant information was retrieved to fully address the query), it indicates retrieval overload.

🔎 Common causes:

  • Choosing too large a value for K. Consider reducing K to retrieve fewer, more relevant documents.
  • A poor ranking mechanism. Consider improving the ranking or reranking strategy.

❌ Detecting Wrongly Avoided Answers

When building a Q&A system, it’s not always obvious whether a bot’s decision to avoid answering a question is intentional or problematic. In some cases, if the relevant information was not retrieved, it’s desirable for the bot to refrain from answering rather than hallucinate a response. In other cases, if the information was available but the bot still avoided the question, this may signal an issue.

To address this, we use two key properties:

  • Avoided Answer: Indicates whether the bot chose not to answer.
  • Retrieval Coverage: Measures whether the retrieved documents contain the full information needed to answer the query.

By combining these properties, we can distinguish between correctly and wrongly avoided answers—enabling better debugging and refinement of the system’s behavior.