Quality & Accuracy Properties

These properties answer the core question: did the output correctly and fully address the input? They measure task-level quality - whether the model was relevant, complete, accurate relative to a reference, and whether it followed the instructions it was given.

Relevance

The Relevance property measures how relevant the LLM output is to the input, ranging from 0 (not relevant) to 1 (very relevant). It is useful mainly for evaluating use cases such as Question Answering, where the LLM is expected to answer given questions.

The property is calculated by passing the user input and the LLM output to a model trained on the GLUE QNLI dataset. Unlike the Grounded in Context property, which compares the LLM output to the provided context, the Relevance property compares the LLM output to the user input given to the LLM.

Examples

LLM Input	LLM Output	Relevance
What is the color of the sky?	The sky is blue	0.99
What is the color of the sky?	The sky is red	0.99
What is the color of the sky?	The sky is pretty	0

Expected Output Similarity

The Expected Output Similarity metric assesses the similarity of a generated output to a reference output, providing a direct evaluation of accuracy against a ground truth. This metric, scored from 1 to 5, determines whether the generated output includes the main arguments of the reference output.

Several LLM judges break both the generated and reference outputs into self-contained propositions, evaluating the precision and recall of the generated content. The judges' evaluations are then aggregated into a unified score.

This property is evaluated only if an expected_output is supplied - it provides an additional metric alongside Deepchecks' intrinsic evaluation.

This property uses LLM calls for calculation.

Examples

Output	Expected Output	Score
Many fans regard Payton Pritchard as the GOAT because of his clutch plays and exceptional shooting skills.	Payton Pritchard is considered by some fans as the greatest of all time (GOAT) due to his clutch performances and incredible shooting ability.	5.0
Payton Pritchard is a solid role player for the Boston Celtics but is not in the conversation for being the GOAT.	Payton Pritchard is considered by some fans as the greatest of all time (GOAT) due to his clutch performances and incredible shooting ability.	1.0

Handling low scores

Low Expected Output Similarity scores indicate minimal overlap between generated outputs and reference ground truth outputs. Common approaches:

Multiple valid solutions - Especially for Generation tasks, various outputs may correctly fulfill the instructions. Evaluate whether strict adherence to reference outputs is necessary.
Prompt adjustment - Identify pattern differences between generated and reference outputs, then modify prompts to guide the model toward reference-like responses. Use Version Comparison to iteratively refine.
In-context learning - Provide the model with example input-output pairs to demonstrate expected behavior (few-shot prompting).
Fine-tuning - Consider this more resource-intensive option only when other approaches prove insufficient.

Completeness

The Completeness property evaluates whether the output fully addresses all components of the original request, providing a comprehensive solution. An output is considered complete if it eliminates the need for follow-up questions. Scoring ranges from 1 to 5, with 1 indicating low completeness and 5 indicating a thorough and comprehensive response.

This property uses LLM calls for calculation.

Intent Fulfillment

The Intent Fulfillment property is closely related to Completeness, but is specifically designed for multi-turn settings like chat. It evaluates how accurately the output follows the instructions provided by the user, and also reviews the entire conversation history to identify any previous user instructions that remain relevant to the current turn.

Scoring ranges from 1 to 5, with 1 indicating low adherence to user instructions and 5 representing precise and thorough fulfillment.

This property is calculated using LLM calls.

Instruction Fulfillment

The Instruction Fulfillment property assesses how accurately the output adheres to the specified instructions or requirements in the input. It is especially valuable for evaluating how effectively an AI assistant follows system instructions in multi-turn scenarios, such as a tool-using chatbot.

Scoring ranges from 1 to 5, where 1 indicates low adherence and 5 signifies precise and thorough alignment with the provided guidelines.

This property uses LLM calls for calculation.

Handling low Completeness / Instruction Fulfillment scores

Low scores often result from application design and architectural choices. Common mitigation strategies:

Simplicity - LLMs perform better when complex tasks are broken down into simpler steps. Defining a series of simpler tasks can significantly enhance performance.
Explicitness - Ensure all instructions are explicitly stated. Avoid implicit requirements and clearly specify expectations.
Model choice - Some models are better at instruction following than others. Experiment with different models to find a cost-effective solution for your use case.

Coverage

Coverage measures how effectively a language model preserves essential information when generating summaries or condensed outputs. The Coverage Score quantifies how comprehensively the output captures the key topics in the input text, scored from 0 (low coverage) to 1 (high coverage).

A low score means the summary covers a low ratio of the main topics in the input text.

This property uses LLM calls for calculation.

Examples

LLM Input	LLM Output	Coverage	Uncovered Information
The Renaissance began in Italy during the 14th century and lasted until the 17th century. It was marked by renewed interest in classical art and learning, scientific discoveries, and technological innovations.	The Renaissance was a cultural movement in Italy from the 14th to 17th centuries, featuring revival of classical learning.	0.7	Scientific discoveries were a significant aspect of the Renaissance. 2. Technological innovations also played a key role during this time.
Our story deals with a bright young man, living in a wooden cabin in the mountains. He wanted nothing but reading books and bathe in the wind.	The young man lives in the mountains and reads books.	0.3	The story centers around a bright young man. 2. He lives in a wooden cabin in the mountains.

Handling low scores

Prompting - Explicitly instruct the model to extract and summarize the most important details. Adjust the balance between Coverage and Conciseness.
Model choice - Some models are better suited for accurately identifying and summarizing key arguments. Deepchecks also provides a Conciseness property to ensure improvements in Coverage don't come at the expense of brevity.