Data Model Cheat-Sheet
This is your Rosetta Stone for mapping any LLM pipeline to Deepchecks' data model. Whether you're building a simple Q&A bot or a complex multi-agent system, this guide will help you understand exactly how to structure your data for optimal evaluation.
Data Hierarchy Overview
The Big Picture
Deepchecks organizes your LLM evaluation data in a clear hierarchy. Think of it like organizing files on your computer:
Organization (Your Company)
└── Application (Your LLM Product - e.g., "Customer Support Bot")
└── Environment (Evaluation | Production)
└── Version (v1.0, v2.0, etc.)
└── Session (A conversation or workflow)
└── Interaction (Single input → output)
└── Steps (Internal pipeline component or field)
Real-World Analogies
Deepchecks Level | Real-World Analogy | Example |
---|---|---|
Application | A product or service | "HR Chatbot", "Document Summarizer" |
Environment | Development stage | Evaluation (pre-release), Production (live) |
Version | Software release | v1.0 (Sonnet-3.7), v2.0 (Sonnet-4.0), v3.0 (new prompts) |
Session | A conversation or workflow | Single chat thread, document processing batch |
Interaction | One exchange | User asks → Bot responds (Chat), Tool Use LLM call |
Steps | Internal pipeline component or field | LLM Reasoning, Retrieval Reranking |
💡 Key Insight: Sessions group related interactions (like a chat conversation), while Interactions represent what happens within a single interaction or LLM invocation.
📖 Complete Details: For full explanations of each level, see Data Hierarchy Concepts.
Field Mapping Table
Your Pipeline → Deepchecks Fields
This table maps common LLM pipeline artifacts to the correct Deepchecks fields:
Your Pipeline Artifact | Deepchecks Field | Example Value | Notes |
---|---|---|---|
User question/prompt | input | "What is the capital of France?" | The user's original request |
Bot response | output | "The capital of France is Paris." | Your LLM's final answer |
Retrieved documents | information_retrieval | ["Paris is the capital...", "France is a country..."] | RAG context (list of strings) |
Chat history | history | ["Hi", "Hello! How can I help?"] | Previous conversation turns |
Full LLM prompt | full_prompt | "You are a helpful assistant. User: What is..." | Complete prompt sent to LLM |
Ground truth answer | expected_output | "Paris" | Reference answer for evaluation |
Human rating | user_annotation | "Good" | Human judgment: Good/Bad/Unknown |
Rating explanation | user_annotation_reason | "Accurate and concise" | Why the human rated it this way |
Conversation ID | session_id | "chat_123" | Groups related interactions, must be unique within version; Will be randomized if not supplied |
Unique request ID | user_interaction_id | "req_456" | Must be unique within version; Will be randomized if not supplied |
Request start time | started_at | 1715000000 | Unix timestamp |
Request end time | finished_at | 1715000001 | Unix timestamp; For latency calculation |
Request Total Tokens | tokens | 1005 | amount of total tokens processed during the request |
Tool calls | action | "search_knowledge_base('Select * FROM Capital WHERE Capital.country LIKE '%France%'" | Agent tool invocation |
Tool responses | tool_response | "[\+33, Paris, France]" | Tool call results |
Pipeline type | interaction_type | "Q&A" | Q&A, Summarization, Generation, etc. |
📖 Complete Field Reference: For detailed descriptions of all fields, see Interaction Data Fields and Interaction Metadata Fields.
🎯 Evaluation Goals → Required Fields
What You Want to Measure | Fields You Must Provide (in addition to Input) |
---|---|
Hallucination/Groundedness | information_retrieval & output |
Accuracy Given a Ground Truth | expected_output OR user_annotation |
Response Latency | started_at & finished_at |
Agent Traceability | action & tool_response |
Multi-turn Conversations | session_id & history |
Common Pipeline Patterns
1. Simple Q&A Bot
{
"input": "How do I reset my password?",
"output": "To reset your password, click 'Forgot Password' on the login page...",
"interaction_type": "Q&A"
}
2. RAG (Retrieval-Augmented Generation)
{
"input": "What are our vacation policies?",
"information_retrieval": [
"Employees are entitled to 15 days of vacation per year...",
"Vacation requests must be submitted 2 weeks in advance..."
],
"output": "Based on our policies, you get 15 vacation days per year...",
"interaction_type": "Q&A"
}
3. Multi-Step Agent
{
"input": "Book a flight to Paris for next week",
"output": "I found several flights to Paris. The best option is...",
"session_id": "booking_session_456",
"action": "Search('Flights to Paris next week')",
"tool_response": "Flight EF-456 12:00-15:00...",
"interaction_type": "tool_use"
}
4. Multi-Turn Conversation
{
"input": "What's the weather like?",
"history": [
"Hi there!",
"Hello! How can I help you today?",
"I'm planning a trip to London"
],
"output": "I'd be happy to help with weather information! Could you specify which city you're asking about?",
"session_id": "chat_789",
"interaction_type": "Q&A"
}
Best Practices & Common Pitfalls
✅ Do This
Unique IDs
- Make
user_interaction_id
s are unique within each version. When creating two versions with the same evaluation set (user inputs), giving the same inputs the sameuser_interaction_id
will enable you to utilize comparison features. - Use consistent
session_id
s for related interactions, and unique within each version - If you don't provide IDs, Deepchecks will generate them
Timestamps
- Use Unix timestamps (seconds since epoch) for
started_at
andfinished_at
- Both timestamps are needed for latency calculation
Information Retrieval
- Provide
information_retrieval
as a list of strings, not a single concatenated string - Each string should be a separate document/chunk
- This enables proper groundedness evaluation
Sessions vs Steps
- Sessions: Group related interactions (horizontal grouping)
- Example: Multiple Q&As in one chat conversation
- Steps: Break down one interaction (vertical decomposition)
- Example: Retrieval → Reranking → Generation within one Q&A interaction
❌ Avoid These Mistakes
Missing Required Fields
// ❌ BAD: Missing both input and output
{
"information_retrieval": ["Some context..."]
}
// ✅ GOOD: At least input OR output (usually both)
{
"input": "What is AI?",
"output": "AI stands for Artificial Intelligence..."
}
Wrong Information Retrieval Format
// ❌ BAD: Single concatenated string
{
"information_retrieval": "Doc1: AI is... Doc2: Machine learning..."
}
// ✅ GOOD: List of separate documents
{
"information_retrieval": [
"AI is a field of computer science...",
"Machine learning is a subset of AI..."
]
}
Inconsistent Session Usage
// ❌ BAD: Different session_id for same conversation
[
{"input": "Hi", "session_id": "chat_1"},
{"input": "What's 2+2?", "session_id": "chat_2"} // Should be chat_1
]
// ✅ GOOD: Consistent session_id
[
{"input": "Hi", "session_id": "chat_1"},
{"input": "What's 2+2?", "session_id": "chat_1"}
]
Duplicate User Interaction IDs
// ❌ BAD: Same ID used twice in one version
[
{"user_interaction_id": "q1", "input": "Question 1"},
{"user_interaction_id": "q1", "input": "Question 2"} // Duplicate!
]
Quick Reference Links
- Data Structure Hierarchy — Complete field reference and concepts
- Properties Guide — Available evaluation metrics
- SDK Data Upload — Code examples for logging data
- Supported Use Cases — Field requirements by interaction type
- Integration Step-by-Step — Complete integration walkthrough
Updated 2 days ago