The built-in properties in Deepchecks LLM Eval are properties that are calculated by the system using built-in algorithms and NLP models trained to perform specific tasks. They are useful for assessing and validation specific characteristics of LLM interactions - varying from the length of the sample, to whether it contains hallucinations.

Some of the built-in properties are calculated on the input and the output independently (such as Toxicity), and you'll see separate values for the LLM input and the LLM output. Others, such as Relevance, measure a relation between the different components of an LLM interaction, so you'll be seeing only a single value, defined on the LLM output.

Invalid Links

The Invalid Links property represents the ratio of the number of links in the text that are invalid links, divided by the total number of links. A valid link is a link that returns a 200 OK HTML status when sent a HTTP HEAD request. For text without links, the property will always return 0 (all links valid).

Reading Ease

A score calculated based on the Flesch reading-ease, calculated for each text sample. The score typically ranges from 0 (very hard to read, requires intense concentration) to 100 (very easy to read) for english text, though in theory the score can range from -inf to 206.835 for arbitrary strings.

Toxicity

The Toxicity property is a measure of how harmful or offensive a text is. The Toxicity property uses a RoBERTa model trained on the Jigsaw Toxic Comment Classification Challenge datasets. The model produces scores ranging from 0 (not toxic) to 1 (very toxic).

Examples

Text	Toxicity
Hello! How can I help you today?	0
You have been a bad user!	0.09
I hate you!	1

Fluency

The Fluency property is a score between 0 and 1 representing how “well” the input text is written, or how close it is to being a sample of fluent English text. A value of 0 represents very poorly written text, while 1 represents perfectly written English. The property uses a bert-based model, trained on a corpus containing examples for fluent and non-fluent samples.

Examples

Text	Fluency
Natural language processing is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence.	0.97
Pass on what you have learned. Strength, mastery, hmm… but weakness, folly, failure, also. Yes, failure, most of all. The greatest teacher, failure is.	0.75
Whispering dreams, forgotten desires, chaotic thoughts, dance with words, meaning elusive, swirling amidst.	0.2

Formality

The Formality model returns a measure of how formal the input text is. The model was trained to predict for English sentences, whether they are formal or informal, where a score of 0 represents very informal text, and a score of 1 very formal text. The model uses the roberta-base architecture, and was trained on GYAFC from Rao and Tetreault, 2018 and online formality corpus from Pavlick and Tetreault, 2016.

Examples

Text	Formality
I hope this email finds you well	0.79
I hope this email find you swell	0.28
What’s up doc?	0.14

Sentiment

The sentiment, ranging from -1 to 1 measures the emotion expressed in a given text, as measured by the TextBlob sentiment analysis model.

Examples

Text	Sentiment
Today was great	0.8
Today was ordinary	-0.25
Today was not great	-0.

Avoided Answer

The Avoided Answer property calculates the probability (0 to 1) of how likely it is that the LLM explicitly, or “on purpose”, avoided answering the question or user request. The property will return high probabilities for answers in which the model is avoiding answering the question for some reason, says that it can’t answer or says it doesn’t know the answer.

Examples

Text	Avoided Answer
The answer is 42	0.001
You should ask the appropriate authorities	0.681
As a Large Language Model trained by Open AI, I can not answer this question	0.994
I cannot answer that question based on the context provided	0.997
I’m Sorry, but the answer for your question does not appear in the document.	0.941

Grounded in Context

The Grounded in Context Score is a measure of how well the LLM output is grounded in the context of the conversation, ranging from 0 (not grounded) to 1 (fully grounded). In the definition of this property, grounding means that the LLM output is based on the context given to it as part of the input, and not on external knowledge, for example knowledge that was present in the LLM training data. Grounding can be thought of as kind of hallucination, as hallucination are outputs that are not grounded in the context, nor are true to the real world.

The property is especially useful for evaluating use-cases such as Question Answering, where the LLM is expected to answer questions based on the context given to it as part of the input, and not based on external knowledge. An example for such a use-case would be Question Answering based on internal company knowledge, where introduction of external knowledge (that, for example, may be stale) into the answers is not desired - we can imagine a case in which an LLM is asked a question about the company’s revenue, and the answer is based on the company’s internal financial reports, and not on external knowledge such as the company’s Wikipedia page. In the context of Question Answering, any answer that is not grounded in the context can be considered a hallucination.

The property is calculated using a bert-based model trained on grounding datasets generated using weak supervision by three pre-existing weaker grounding-detection methods.

Examples

LLM Input	LLM Output	Grounded in Context
Michael Jordan (1963) is an American former professional basketball player and businessman. In what year was he born?	He was born in 1963.	0.91
Michael Jordan (1963) is an American former professional basketball player and businessman. When was Michael born?	Michael Jeffrey Jordan was born in 1963	0.78
Michael Jordan (1963) is an American former professional basketball player and businessman. What did he achieve?	He won many NBA championships with the Cleveland Cavaliers	0.06

Relevance

The Relevance property is a measure of how relevant the LLM output is to the input given to it, ranging from 0 (not relevant) to 1 (very relevant). It is useful mainly for evaluating use-cases such as Question Answering, where the LLM is expected to answer given questions.

The property is calculated by passing the user input and the LLM output to a model trained on the GLUE QNLI dataset. Unlike the Grounded in Context property, which compares the LLM output to the provided context, the Relevance property compares the LLM output to the user input given to the LLM.

Examples

LLM Input	LLM Output	Relevance
What is the color of the sky?	The sky is blue	0.99
What is the color of the sky?	The sky is red	0.99
What is the color of the sky?	The sky is pretty	0

Retrieval Relevance

The Retrieval Relevance property is way of measuring the quality of the Information Retrieval (IR) performed as part of the LLM pipeline. This property measures how relevant the retrieved documents are to the user input, with a score ranging from 0 (not relevant) to 1 (very relevant). It is useful mainly for evaluating use-cases such as Question Answering, where the LLM is expected to answer given questions based on the retrieved documents.

The backbone model for this property is the same as for the Relevance property, only applied to the retrieved documents instead of the LLM output.