Prompt properties are a specialized type of property that leverage LLM calls, using the model as a judge to evaluate various aspects of your data. There are two types of prompt properties available:

Numerical Prompt Property: Produces a score from 1 to 5.
Categorical Prompt Property: Classifies interactions into predefined categories.

Prompt Properties can be initialized either from a template or from scratch. Once you initialize a property, you'll have the ability to modify the LLM guidelines to suit your particular use case, providing a high degree of customization.

The specific model to which these properties are sent is determined by the model choice you've configured on the organization level or application level, ensuring flexibility and adaptability within different operational contexts.

Prompt Properties Model Choice on the Organization Level

Prompt Properties Model Choice on the Application Level

Creating a New Prompt Property

Main Screen for Adding a New Prompt Property

Numerical Prompt Property

A prompt property functions as an LLM call which returns a numerical score and a reasoning. You can shape properties by designing the following features:

Editable guidelines, depending on the property's requirements.
Interaction steps, defining which fields in the interaction will be provided for evaluation. For instance, if the input and output alone are necessary, you may not include the full prompt, information retrieval, or other fields.
[Optional] Few-shot attempts, described more elaborately below.

Additionally, there is an option to "test" the property on a single interaction, allowing for validation and refinement before broader application.

Categorical Prompt Property

A categorical prompt property functions as an LLM call which returns a classification to one more categories and a reasoning. You can shape properties to fit your use-case by designing the following features:

Define categories with detailed names and descriptions. You can specify an unlimited number of categories.
Choose if you "Allow classification to multiple categories", defining if a single interaction may be classified to several categories.
Choose if you "Allow classification to unlisted categories", defining if a model can classify interactions into the user-defined categories or create new ones based on the additional guidelines provided.
Specify additional guidelines to provide additional categorization guidelines. It is essential for explaining category domains, especially when "Allow classification to unlisted categories" is enabled.

📘
Allow classification to unlisted categories
Unlike the "topics" feature, which limits classification to a closed set, checking "allow classification to unlisted categories" allows the creation of new categories when necessary.

❗️
"Other" Classification
If "Allow classification to unlisted categories" is unchecked and no suitable classification is found by the LLM, interactions will be classified as "Other."

Few Shot Prompting

You can enhance the evaluation of new interactions by providing a CSV file with annotated examples (using the bottom-left button in the above screenshot). These examples should demonstrate exemplary reasoning and scoring. The provided CSV will be used for few shot prompting the LLM, a technique which guides the LLM in applying consistent formatting and reasoning.

The CSV file should include:

Columns for each field in the User Input Structure ("Input" and "Output" in the screenshot above).
Reason column: This free-text column should capture the reasoning or explanation for the provided score.
Score column:
1. For numeric properties: A numeric score between 1 and 5 that reflects the quality of the interaction for numerical properties,
2. For categorical properties: A list of the correctly predicted category / categories, separated by ", ". For example: "Blue, Red".

For instance, here's a CSV with a 3 exemplars few-shot that fits the above "Completeness" (numerical prompt property) example:

Number of Judges Configuration

When calculating an LLM property, the model’s response can occasionally vary due to the inherent randomness of language models. At Deepchecks, we already apply multiple mechanisms to ensure high accuracy and consistency despite this variability, but for properties that are especially important, you can now take this one step further by controlling how many “judges” (LLM calls) are used when computing the property result.

Each “judge” independently evaluates the property prompt, and their outputs are then aggregated to form a final decision. This mechanism helps reduce variance and improve reliability, especially for properties that are highly important to your evaluation flow.

Configuration Options

You can configure the number of judges for each custom LLM property individually. The configuration is saved together with the property definition and applies whenever the property is used in evaluations or test interactions.

The available options are:

1 (default) - A single LLM call, identical to the current behavior.
3 - Three independent LLM calls, combined using a simple aggregation algorithm.
5 - Five independent LLM calls, combined using a simple aggregation algorithm.
3 + Orchestrator – Three independent judges evaluate the property prompt, and then an additional “orchestrator” judge reviews all three outputs to produce a refined final decision. This configuration often yields the most reliable and consistent results, especially for complex or highly subjective properties. However, it also comes with higher cost and latency, since it requires an extra call to a more capable model that processes multiple responses at once.

All configurations use the same base model as defined in the property, except the orchestrator, which runs on a stronger model for improved synthesis and reasoning.

When to Use Multiple Judges

Using multiple judges can significantly improve the stability of property results, especially in edge cases or where the prompt involves subjective or nuanced language. However, more judges also increase cost and latency, since multiple model calls are required.

For most properties, the default single-judge setting provides a good balance. For key properties that drive important metrics or business decisions, setting three or five judges can provide more dependable results.

The "# of Judges" configuration on the "Create Prompt Property" screen

Prompt Property Template

Here’s a detailed explanation of the structure sent to the LLM for evaluation, ensuring clarity and consistency.

Components of the Evaluation LLM Call
(Sent to the LLM in order, as configured by the user in the app)

1. Evaluator System Prompt

1a. For Numerical Properties

This is the core instruction guiding the LLM on evaluation criteria and scoring. An example of structuring for the "Appropriateness" property is as follows:

Role: System
You are a harsh evaluator. Your task is to evaluate whether the interaction steps provided meet the guidelines. Your guidelines are:

{Additional guidelines will go here:}
1. Assess if the content is suitable for all potential readers, considering factors like age sensitivity, cultural appropriateness, and general public consumption.

2. Evaluate the text for any elements that could be considered offensive, overly mature, or inappropriate in a general context.

3. High scores represent universally appropriate text and low scores are inappropriate.
{End of additional guidelines}

You should describe your reasoning in a few sentences and then give a final score for Appropriateness between 1 and 5, where 1 is the lowest and 5 is the highest.
Please provide the final score in the following format at the end of your reasoning: Final Score: [1-5]

1b. For Categorical Properties

This is the core instruction guiding the LLM on evaluation criteria and scoring. An example of structuring for the "Genre" property is as follows:

Role: System
Classify the given text into one or more relevant categories from the list below. Each category represents a distinct concept, and a text may belong to multiple categories.
Carefully analyze the content and context before selecting the most appropriate categories. Use the exact category names from the list.

Categories Descriptions: 
1. Action - Movies with a lot of action
2. Comedy - Movies that are funny
3. Drama - Movies that are serious
4. Horror - Movies that are scary
5. Romance - Movies that are romantic
6. Doco - Movies that are documentaries

Additional Guidelines: 
Notice that your must not annotate any tv-series or books, only movies.

📘
Evaluator System Prompt for Categorical Properties
The first part of the system prompt may vary depending on the property configuration. For instance, if "Allow classification to multiple categories" is unchecked, the above system prompt will instead instruct the judge to "Classify the given text into the most relevant category".

2. Few-Shot Attempts

Directly following the evaluation system prompt, these few-shot examples (formatted in user-assistant style) are included if uploaded. More details on constructing your few-shot CSV can be found above.

3. Interaction Steps for Property

The arrangement of the fields depends on the order the user chooses in the app (out of input, information retrieval, history, full prompt, output, expected output and even custom interaction steps). In the LLM evaluation call the elements will be preceded by their matching labels such as "input:", "output:", etc., for clarity. Note that the LLM will receive the fields in the exact sequence chosen by the user on the UI.

You can select which interaction steps to include and in what order they will be sent to the evaluator

Example structure (user chose input, information retrieval and output in that order):

**Interaction Steps - Custom Order:**
**input:** {user input will go here}
**information retrieval:** {retrieved info will go here}
**output:** {generated output will go here}

By understanding how these components compose the LLM call and how they're structured following the user-defined order, you'll be well-equipped to configure and assess new LLM-based properties effectively, ensuring they meet the specified criteria and standards.