This is your Rosetta Stone for mapping any LLM pipeline to Deepchecks' data model. Whether you're building a simple Q&A bot or a complex multi-agent system, this guide will help you understand exactly how to structure your data for optimal evaluation.

Data Hierarchy Overview

The Big Picture

Deepchecks organizes your LLM evaluation data in a clear hierarchy. Think of it like organizing files on your computer:

Organization (Your Company)
└── Application (Your LLM Product - e.g., "Customer Support Bot")
    └── Environment (Evaluation | Production)
        └── Version (v1.0, v2.0, etc.)
            └── Session (A conversation or workflow)
                └── Interaction (Single input → output)
                    └── Steps (Internal pipeline component or field)

Real-World Analogies

Deepchecks Level	Real-World Analogy	Example
Application	A product or service	"HR Chatbot", "Document Summarizer"
Environment	Development stage	Evaluation (pre-release), Production (live)
Version	Software release	v1.0 (Sonnet-3.7), v2.0 (Sonnet-4.0), v3.0 (new prompts)
Session	A conversation or workflow	Single chat thread, document processing batch
Interaction	One exchange	User asks → Bot responds (Chat), Tool Use LLM call
Steps	Internal pipeline component or field	LLM Reasoning, Retrieval Reranking

💡 Key Insight: Sessions group related interactions (like a chat conversation), while Interactions represent what happens within a single interaction or LLM invocation.

📖 Complete Details: For full explanations of each level, see Data Hierarchy Concepts.

Field Mapping Table

Your Pipeline → Deepchecks Fields

This table maps common LLM pipeline artifacts to the correct Deepchecks fields:

Your Pipeline Artifact	Deepchecks Field	Example Value	Notes
User question/prompt	`input`	`"What is the capital of France?"`	The user's original request
Bot response	`output`	`"The capital of France is Paris."`	Your LLM's final answer
Retrieved documents	`information_retrieval`	`["Paris is the capital...", "France is a country..."]`	RAG context (list of strings)
Chat history	`history`	`["Hi", "Hello! How can I help?"]`	Previous conversation turns
Full LLM prompt	`full_prompt`	`"You are a helpful assistant. User: What is..."`	Complete prompt sent to LLM
Ground truth answer	`expected_output`	`"Paris"`	Reference answer for evaluation
Human rating	`user_annotation`	`"Good"`	Human judgment: Good/Bad/Unknown
Rating explanation	`user_annotation_reason`	`"Accurate and concise"`	Why the human rated it this way
Conversation ID	`session_id`	`"chat_123"`	Groups related interactions, must be unique within version; Will be randomized if not supplied
Unique request ID	`user_interaction_id`	`"req_456"`	Must be unique within version; Will be randomized if not supplied
Request start time	`started_at`	`1715000000`	Unix timestamp
Request end time	`finished_at`	`1715000001`	Unix timestamp; For latency calculation
Request Total Tokens	`tokens`	`1005`	amount of total tokens processed during the request
Tool calls	`action`	`"search_knowledge_base('Select * FROM Capital WHERE Capital.country LIKE '%France%'"`	Agent tool invocation
Tool responses	`tool_response`	`"[\+33, Paris, France]"`	Tool call results
Pipeline type	`interaction_type`	`"Q&A"`	Q&A, Summarization, Generation, etc.

📖 Complete Field Reference: For detailed descriptions of all fields, see Interaction Data Fields and Interaction Metadata Fields.

🎯 Evaluation Goals → Required Fields

What You Want to Measure	Fields You Must Provide (in addition to Input)
Hallucination/Groundedness	`information_retrieval` & `output`
Accuracy Given a Ground Truth	`expected_output` OR `user_annotation`
Response Latency	`started_at` & `finished_at`
Agent Traceability	`action` & `tool_response`
Multi-turn Conversations	`session_id` & `history`

Common Pipeline Patterns

1. Simple Q&A Bot

{
  "input": "How do I reset my password?",
  "output": "To reset your password, click 'Forgot Password' on the login page...",
  "interaction_type": "Q&A"
}

2. RAG (Retrieval-Augmented Generation)

{
  "input": "What are our vacation policies?",
  "information_retrieval": [
    "Employees are entitled to 15 days of vacation per year...",
    "Vacation requests must be submitted 2 weeks in advance..."
  ],
  "output": "Based on our policies, you get 15 vacation days per year...",
  "interaction_type": "Q&A"
}

3. Multi-Step Agent

{
  "input": "Book a flight to Paris for next week",
  "output": "I found several flights to Paris. The best option is...",
  "session_id": "booking_session_456",
  "action": "Search('Flights to Paris next week')",
  "tool_response": "Flight EF-456 12:00-15:00...",
  "interaction_type": "tool_use"
}

4. Multi-Turn Conversation

{
  "input": "What's the weather like?",
  "history": [
    "Hi there!",
    "Hello! How can I help you today?",
    "I'm planning a trip to London"
  ],
  "output": "I'd be happy to help with weather information! Could you specify which city you're asking about?",
  "session_id": "chat_789",
  "interaction_type": "Q&A"
}

Best Practices & Common Pitfalls

✅ Do This

Unique IDs

Make user_interaction_ids are unique within each version. When creating two versions with the same evaluation set (user inputs), giving the same inputs the same user_interaction_id will enable you to utilize comparison features.
Use consistent session_ids for related interactions, and unique within each version
If you don't provide IDs, Deepchecks will generate them

Timestamps

Use Unix timestamps (seconds since epoch) for started_at and finished_at
Both timestamps are needed for latency calculation

Information Retrieval

Provide information_retrieval as a list of strings, not a single concatenated string
Each string should be a separate document/chunk
This enables proper groundedness evaluation

Sessions vs Steps

Sessions: Group related interactions (horizontal grouping)
- Example: Multiple Q&As in one chat conversation
Steps: Break down one interaction (vertical decomposition)
- Example: Retrieval → Reranking → Generation within one Q&A interaction

❌ Avoid These Mistakes

Missing Required Fields

// ❌ BAD: Missing both input and output
{
  "information_retrieval": ["Some context..."]
}

// ✅ GOOD: At least input OR output (usually both)
{
  "input": "What is AI?",
  "output": "AI stands for Artificial Intelligence..."
}

Wrong Information Retrieval Format

// ❌ BAD: Single concatenated string
{
  "information_retrieval": "Doc1: AI is... Doc2: Machine learning..."
}

// ✅ GOOD: List of separate documents
{
  "information_retrieval": [
    "AI is a field of computer science...",
    "Machine learning is a subset of AI..."
  ]
}

Inconsistent Session Usage

// ❌ BAD: Different session_id for same conversation
[
  {"input": "Hi", "session_id": "chat_1"},
  {"input": "What's 2+2?", "session_id": "chat_2"}  // Should be chat_1
]

// ✅ GOOD: Consistent session_id
[
  {"input": "Hi", "session_id": "chat_1"},
  {"input": "What's 2+2?", "session_id": "chat_1"}
]

Duplicate User Interaction IDs

// ❌ BAD: Same ID used twice in one version
[
  {"user_interaction_id": "q1", "input": "Question 1"},
  {"user_interaction_id": "q1", "input": "Question 2"}  // Duplicate!
]

Quick Reference Links

Data Structure Hierarchy — Complete field reference and concepts
Properties Guide — Available evaluation metrics
SDK Data Upload — Code examples for logging data
Supported Use Cases — Field requirements by interaction type
Integration Step-by-Step — Complete integration walkthrough