Built-in Properties
Learn what built-in properties are calculated in Deepchecks LLM Evaluation and how they are defined
The built-in properties in Deepchecks LLM Eval are properties that are calculated by the system using built-in algorithms and NLP models trained to perform specific tasks. They are useful for assessing and validation specific characteristics of LLM interactions - varying from the length of the sample, to whether it contains hallucinations.
Some of the built-in properties are calculated on the input and the output independently (such as Toxicity), and you'll see separate values for the LLM input and the LLM output. Others, such as Relevance, measure a relation between the different components of an LLM interaction, so you'll be seeing only a single value, defined on the LLM output.
Invalid Links
The Invalid Links property represents the ratio of the number of links in the text that are invalid links, divided by the total number of links. A valid link is a link that returns a 200 OK HTML status when sent a HTTP HEAD request. For text without links, the property will always return 0 (all links valid).
Reading Ease
A score calculated based on the Flesch reading-ease, calculated for each text sample. The score typically ranges from 0 (very hard to read, requires intense concentration) to 100 (very easy to read) for english text, though in theory the score can range from -inf to 206.835 for arbitrary strings.
Toxicity
The Toxicity property is a measure of how harmful or offensive a text is. The Toxicity property uses a RoBERTa model trained on the Jigsaw Toxic Comment Classification Challenge datasets. The model produces scores ranging from 0 (not toxic) to 1 (very toxic).
Examples
Text | Toxicity |
---|---|
Hello! How can I help you today? | 0 |
You have been a bad user! | 0.09 |
I hate you! | 1 |
Fluency
The Fluency property is a score between 0 and 1 representing how “well” the input text is written, or how close it is to being a sample of fluent English text. A value of 0 represents very poorly written text, while 1 represents perfectly written English. The property uses a bert-based model, trained on a corpus containing examples for fluent and non-fluent samples.
Examples
Text | Fluency |
---|---|
Natural language processing is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence. | 0.97 |
Pass on what you have learned. Strength, mastery, hmm… but weakness, folly, failure, also. Yes, failure, most of all. The greatest teacher, failure is. | 0.75 |
Whispering dreams, forgotten desires, chaotic thoughts, dance with words, meaning elusive, swirling amidst. | 0.2 |
Formality
The Formality model returns a measure of how formal the input text is. The model was trained to predict for English sentences, whether they are formal or informal, where a score of 0 represents very informal text, and a score of 1 very formal text. The model uses the roberta-base architecture, and was trained on GYAFC from Rao and Tetreault, 2018 and online formality corpus from Pavlick and Tetreault, 2016.
Examples
Text | Formality |
---|---|
I hope this email finds you well | 0.79 |
I hope this email find you swell | 0.28 |
What’s up doc? | 0.14 |
Sentiment
The sentiment, ranging from -1 to 1 measures the emotion expressed in a given text, as measured by the TextBlob sentiment analysis model.
Examples
Text | Sentiment |
---|---|
Today was great | 0.8 |
Today was ordinary | -0.25 |
Today was not great | -0. |
Avoided Answer
The Avoided Answer property calculates the probability (0 to 1) of how likely it is that the LLM explicitly, or “on purpose”, avoided answering the question or user request. The property will return high probabilities for answers in which the model is avoiding answering the question for some reason, says that it can’t answer or says it doesn’t know the answer.
We tested this property on the Do-Not-Answer dataset. We extracted 193 examples where the model provided answers (negatives) and 193 examples where the model avoided answering (positives). Our model achieved a ROC-AUC score of 0.99 on this dataset.
Examples
Text | Avoided Answer Score |
---|---|
As a Large Language Model trained by Open AI, I can not answer this question | 0.901 |
Based on the provided documents, there is no information regarding your issue | 0.896 |
The answer is 42 | 0.147 |
No, it is not possible to take out a motorcycle insurance policy for your car. | 0.134 |
Grounded in Context
When it comes to generative AI one of the most concerning problems is Hallucinations. In order to avoid this kind of mistake we would want to make sure that all factual claims by our LLM are entailed by trusted sources of truth, in most cases, documentation about the relevant topic.
The Grounded in Context Score is a measure of how well the LLM output is grounded in the provided context, ranging from 0 (not grounded) to 1 (fully grounded). Specifically, this property separately validates that each factual statement in the output is entailed by the provided context.
Examples
Context | LLM Output | Grounded in Context Score |
---|---|---|
Michael Jordan (1963) is an American former professional basketball player and businessman. In what year was he born? | He was born in 1963. | 0.97 |
Michael Jordan (1963) is an American former professional basketball player and businessman. When was Michael born? | Michael Jeffrey Jordan was born in 1963 | 0.87 |
Michael Jordan (1963) is an American former professional basketball player and businessman. What did he achieve? | He won many NBA championships with the Cleveland Cavaliers | 0.07 |
Relevance
The Relevance property is a measure of how relevant the LLM output is to the input given to it, ranging from 0 (not relevant) to 1 (very relevant). It is useful mainly for evaluating use-cases such as Question Answering, where the LLM is expected to answer given questions.
The property is calculated by passing the user input and the LLM output to a model trained on the GLUE QNLI dataset. Unlike the Grounded in Context property, which compares the LLM output to the provided context, the Relevance property compares the LLM output to the user input given to the LLM.
Examples
LLM Input | LLM Output | Relevance |
---|---|---|
What is the color of the sky? | The sky is blue | 0.99 |
What is the color of the sky? | The sky is red | 0.99 |
What is the color of the sky? | The sky is pretty | 0 |
Retrieval Relevance
The Retrieval Relevance property is way of measuring the quality of the Information Retrieval (IR) performed as part of a RAG pipeline. Specifically, it measures the relevancy of a retrieved document to the user input, with a score ranging from 0 (not relevant) to 1 (very relevant).
This property based on several models and steps. In the first step the retrieved document is divided into meaningful chunks, then each chuck's relevancy is evaluated using both a semantic evaluator and a lexical evaluator. In the final step the scores per chunk are aggregated into a final score.
Coverage
Coverage is a metric for evaluating how effectively a language model preserves essential information when generating summaries or condensed outputs. While the goal of summarization is to create shorter, more digestible content, maintaining the core information from the source material is paramount.
The Coverage Score quantifies how comprehensively an LLM's output captures the key topics in the input text, scored on a scale from 0 (low coverage) to 1 (high coverage). It is calculated by extracting main topics from the source text, identifying which of these elements are present in the output, and finally computing the ratio of preserved information to total essential information.
Examples
LLM Input | LLM Output | Coverage | Uncovered Information |
---|---|---|---|
The Renaissance began in Italy during the 14th century and lasted until the 17th century. It was marked by renewed interest in classical art and learning, scientific discoveries, and technological innovations. | The Renaissance was a cultural movement in Italy from the 14th to 17th centuries, featuring revival of classical learning. | 0.7 | 1. Scientific discoveries were a significant aspect of the Renaissance. 2. Technological innovations also played a key role during this time. |
Our story deals with a bright young man, living in a wooden cabin in the mountains. He wanted nothing but reading books and bathe in the wind. | The young man lives in the mountains and reads books. | 0.3 | 1. The story centers around a bright young man. 2. He lives in a wooden cabin in the mountains. The setting emphasizes solitude and a connection to nature. |
Information Density
In various LLM use-cases, such as text generation, question-answering, and summarization, it's crucial to assess whether the model's outputs actually convey information, and how concisely they do it. For instance, we'd want to measure how often our QA agent requests additional information or provides verbose or indirect answers.
An information-dense output typically consists of readable factual sentences, instructions, or suggestions (illustrated by the first two examples below). In contrast, low information-density is characterized by requests for additional information, incomprehensible text, or evasive responses (illustrated by the last two examples below).
The Information Density Score is a measure of how information-dense the LLM output is, ranging from 0 (low density) to 1 (high density). It is calculated by evaluating each statement in the output individually. These individual scores are then averaged, representing the overall information density of the output.
Examples
Text | Information Density Score |
---|---|
To purchase the Stormlight Archive books, enter the kindle store | 0.898 |
The Stormlight Archive is a series of epic fantasy novels written by Brandon Sanderson | 0.867 |
Wow, so many things can be said about The Stormlight Archive books | 0.336 |
Can you elaborate about the reason you ask so I can provide a better answer? | 0.196 |
PII Risk
The PII Risk property indicates the presence of "Personally Identifiable Information" (PII) in the output text. This property ranges from 0 to 1, where 0 signifies no risk and 1 indicates high risk. The property utilizes a trained Named Entity Recognition model to identify risky entities and multiplies the output confidence score by the risk factor. If multiple entities are found in the text, we take the highest score after multiplying it by the risk factor.
Risky Entities
Below is a list of the entities we examine during property calculation, along with the assigned risk factor for each entity.
Entity | Risk Factor |
---|---|
US_ITIN | 1.0 |
US_SSN | 1.0 |
US_PASSPORT | 1.0 |
US_DRIVER_LICENSE | 1.0 |
US_BANK_NUMBER | 1.0 |
CREDIT_CARD | 1.0 |
IBAN_CODE | 1.0 |
CRYPTO | 1.0 |
MEDICAL_LICENSE | 1.0 |
IP_ADDRESS | 1.0 |
PHONE_NUMBER | 0.5 |
EMAIL_ADDRESS | 0.5 |
NRP | 0.5 |
LOCATION | 0.3 |
PERSON | 0.3 |
Examples
Input | PII Risk |
---|---|
Hey Alexa, there seems to be overage charges on 3230168272026619. | 1.0 |
Nestor, your heart check-ups payments have been processed through the IBAN PS4940D5069200782031700810721. | 1.0 |
Nothing too sensitive in this sentence. | 0.0 |
Content Type
The Content Type property indicates the type of the output text. It can be 'json', 'sql' or 'other'.
The types 'json' or 'sql' means the output is valid, according to the indicated type.
'other' is any text that is not a valid 'json' nor 'sql'.
Compression Ratio
The Compression Ratio property measures how much smaller the output is compared to the input. It's calculated by dividing the size of the input by the size of the output.
Updated 3 days ago