Data Model Cheat-Sheet

This is your Rosetta Stone for mapping any LLM pipeline to Deepchecks' data model. Whether you're building a simple Q&A bot or a complex multi-agent system, this guide will help you understand exactly how to structure your data for optimal evaluation.

Data Hierarchy Overview

The Big Picture

Deepchecks organizes your LLM evaluation data in a clear hierarchy. Think of it like organizing files on your computer:

Organization (Your Company)
└── Application (Your LLM Product - e.g., "Customer Support Bot")
    └── Environment (Evaluation | Production)
        └── Version (v1.0, v2.0, etc.)
            └── Session (A conversation or workflow)
                └── Interaction (Single input → output)
                    └── Steps (Internal pipeline component or field)

Real-World Analogies

Deepchecks LevelReal-World AnalogyExample
ApplicationA product or service"HR Chatbot", "Document Summarizer"
EnvironmentDevelopment stageEvaluation (pre-release), Production (live)
VersionSoftware releasev1.0 (Sonnet-3.7), v2.0 (Sonnet-4.0), v3.0 (new prompts)
SessionA conversation or workflowSingle chat thread, document processing batch
InteractionOne exchangeUser asks → Bot responds (Chat), Tool Use LLM call
Steps Internal pipeline component or fieldLLM Reasoning, Retrieval Reranking

💡 Key Insight: Sessions group related interactions (like a chat conversation), while Interactions represent what happens within a single interaction or LLM invocation.

📖 Complete Details: For full explanations of each level, see Data Hierarchy Concepts.


Field Mapping Table

Your Pipeline → Deepchecks Fields

This table maps common LLM pipeline artifacts to the correct Deepchecks fields:

Your Pipeline ArtifactDeepchecks FieldExample ValueNotes
User question/promptinput"What is the capital of France?"The user's original request
Bot responseoutput"The capital of France is Paris."Your LLM's final answer
Retrieved documentsinformation_retrieval["Paris is the capital...", "France is a country..."]RAG context (list of strings)
Chat historyhistory["Hi", "Hello! How can I help?"]Previous conversation turns
Full LLM promptfull_prompt"You are a helpful assistant. User: What is..."Complete prompt sent to LLM
Ground truth answerexpected_output"Paris"Reference answer for evaluation
Human ratinguser_annotation"Good"Human judgment: Good/Bad/Unknown
Rating explanationuser_annotation_reason"Accurate and concise"Why the human rated it this way
Conversation IDsession_id"chat_123"Groups related interactions, must be unique within version; Will be randomized if not supplied
Unique request IDuser_interaction_id"req_456"Must be unique within version; Will be randomized if not supplied
Request start timestarted_at1715000000Unix timestamp
Request end timefinished_at1715000001Unix timestamp; For latency calculation
Request Total Tokenstokens1005amount of total tokens processed during the request
Tool callsaction"search_knowledge_base('Select * FROM Capital WHERE Capital.country LIKE '%France%'"Agent tool invocation
Tool responsestool_response"[\+33, Paris, France]"Tool call results
Pipeline typeinteraction_type"Q&A"Q&A, Summarization, Generation, etc.

📖 Complete Field Reference: For detailed descriptions of all fields, see Interaction Data Fields and Interaction Metadata Fields.

🎯 Evaluation Goals → Required Fields

What You Want to MeasureFields You Must Provide (in addition to Input)
Hallucination/Groundednessinformation_retrieval & output
Accuracy Given a Ground Truthexpected_output OR user_annotation
Response Latencystarted_at & finished_at
Agent Traceabilityaction & tool_response
Multi-turn Conversationssession_id & history


Common Pipeline Patterns

1. Simple Q&A Bot

{
  "input": "How do I reset my password?",
  "output": "To reset your password, click 'Forgot Password' on the login page...",
  "interaction_type": "Q&A"
}

2. RAG (Retrieval-Augmented Generation)

{
  "input": "What are our vacation policies?",
  "information_retrieval": [
    "Employees are entitled to 15 days of vacation per year...",
    "Vacation requests must be submitted 2 weeks in advance..."
  ],
  "output": "Based on our policies, you get 15 vacation days per year...",
  "interaction_type": "Q&A"
}

3. Multi-Step Agent

{
  "input": "Book a flight to Paris for next week",
  "output": "I found several flights to Paris. The best option is...",
  "session_id": "booking_session_456",
  "action": "Search('Flights to Paris next week')",
  "tool_response": "Flight EF-456 12:00-15:00...",
  "interaction_type": "tool_use"
}

4. Multi-Turn Conversation

{
  "input": "What's the weather like?",
  "history": [
    "Hi there!",
    "Hello! How can I help you today?",
    "I'm planning a trip to London"
  ],
  "output": "I'd be happy to help with weather information! Could you specify which city you're asking about?",
  "session_id": "chat_789",
  "interaction_type": "Q&A"
}

Best Practices & Common Pitfalls

Do This

Unique IDs

  • Make user_interaction_ids are unique within each version. When creating two versions with the same evaluation set (user inputs), giving the same inputs the same user_interaction_id will enable you to utilize comparison features.
  • Use consistent session_ids for related interactions, and unique within each version
  • If you don't provide IDs, Deepchecks will generate them

Timestamps

  • Use Unix timestamps (seconds since epoch) for started_at and finished_at
  • Both timestamps are needed for latency calculation

Information Retrieval

  • Provide information_retrieval as a list of strings, not a single concatenated string
  • Each string should be a separate document/chunk
  • This enables proper groundedness evaluation

Sessions vs Steps

  • Sessions: Group related interactions (horizontal grouping)
    • Example: Multiple Q&As in one chat conversation
  • Steps: Break down one interaction (vertical decomposition)
    • Example: Retrieval → Reranking → Generation within one Q&A interaction

Avoid These Mistakes

Missing Required Fields

// ❌ BAD: Missing both input and output
{
  "information_retrieval": ["Some context..."]
}

// ✅ GOOD: At least input OR output (usually both)
{
  "input": "What is AI?",
  "output": "AI stands for Artificial Intelligence..."
}

Wrong Information Retrieval Format

// ❌ BAD: Single concatenated string
{
  "information_retrieval": "Doc1: AI is... Doc2: Machine learning..."
}

// ✅ GOOD: List of separate documents
{
  "information_retrieval": [
    "AI is a field of computer science...",
    "Machine learning is a subset of AI..."
  ]
}

Inconsistent Session Usage

// ❌ BAD: Different session_id for same conversation
[
  {"input": "Hi", "session_id": "chat_1"},
  {"input": "What's 2+2?", "session_id": "chat_2"}  // Should be chat_1
]

// ✅ GOOD: Consistent session_id
[
  {"input": "Hi", "session_id": "chat_1"},
  {"input": "What's 2+2?", "session_id": "chat_1"}
]

Duplicate User Interaction IDs

// ❌ BAD: Same ID used twice in one version
[
  {"user_interaction_id": "q1", "input": "Question 1"},
  {"user_interaction_id": "q1", "input": "Question 2"}  // Duplicate!
]

Quick Reference Links