DocumentationAPI ReferenceRelease Notes
DocumentationLog In
Documentation

Data Fields for Evaluation

When you create a prompt property, you choose which data fields the LLM judge receives. This controls exactly what information goes into the evaluation - and what stays out. Keeping the context focused makes evaluations more accurate and reduces token usage.

You can draw data from two sources:

  1. The current interaction - fields like input, output, full prompt, retrieved context, etc.
  2. Child interactions - for agentic workflows, data from nested child interactions (tool calls, LLM sub-calls, etc.) at any depth

Selecting interaction data fields

The following fields are available from the current interaction:

  • Input - the user's query or message
  • Output - the generated response
  • Full Prompt - the complete prompt sent to the LLM
  • Information Retrieval - retrieved context or knowledge base content
  • History - prior conversation turns
  • Expected Output - a reference or ground-truth response
  • Custom Fields - any custom data fields defined for your application

Select only the fields your property actually needs. For example, a property evaluating output tone needs only the Output field; a hallucination property needs both the Output and Information Retrieval fields.


Must vs Optional fields

Each selected field is marked as either Must or Optional.

Must - If this field is missing on a specific interaction, the property cannot be calculated. The score returns N/A with an explanation of which field was absent. Use Must for fields that are essential to your evaluation logic.

Optional - If this field is missing, the property still calculates using the available fields. Use Optional for fields that add useful context but aren't strictly required.

Example: A "Response Completeness" property needs both Input and Output to determine if the response fully addressed the question - mark both as Must. A "Response Quality" property that benefits from the full prompt but can work without it - mark Full Prompt as Optional.

Tip: Only mark a field as Must if a missing value truly makes evaluation impossible. Over-using Must leads to N/A scores on interactions that could otherwise be evaluated.


Child interaction data (agentic workflows)

For agentic applications, an interaction can contain nested child interactions - tool calls, LLM sub-calls, retrieval steps, and so on. A prompt property can pull data from those children, not just the top-level interaction.

To access child data, add one or more Scopes to your property. Each scope defines:

  1. Depth - how far down the hierarchy to look

    • 1 level down - direct children only
    • 2 levels down - children and grandchildren
    • All levels - all descendants
  2. Interaction types - which child interaction types to include (e.g., only Tool interactions, only LLM interactions, or all types)

  3. Data fields - which fields to pull from those child interactions (Input, Output, Full Prompt, etc.)

All child interaction fields are treated as Optional - the property calculates even if some children are missing data.

Examples

Evaluate all tool calls in an agent trace

  • Depth: All levels
  • Types: Tool interactions only
  • Fields: Input, Output

Collects data from every tool call anywhere in the hierarchy, letting the property assess the agent's overall tool usage pattern.

Evaluate direct LLM sub-calls

  • Depth: 1 level down
  • Types: LLM interactions only
  • Fields: Output, Full Prompt

Looks only at LLM interactions that are direct children of the current interaction, ignoring deeper levels.

Multi-scope: combine LLM and tool data

  • Scope 1: Depth 1, LLM interactions, Output
  • Scope 2: Depth All, Tool interactions, Input + Output

Uses two scopes to gather different data from different parts of the hierarchy in a single property evaluation.


Best practices

  1. Start with the current interaction fields, then add child interaction scopes only if the property genuinely needs hierarchical context
  2. Use Must sparingly - only for fields without which evaluation is truly meaningless
  3. Match your architecture - design scopes to reflect how your agent is structured (depth, interaction types)
  4. Test before saving - use the built-in Test feature to confirm that your field selection produces the scores you expect on real interactions