Multi-Turn Agent Demo: Google ADK Academic Researcher
Evaluating a multi-turn, multi-agent application built on Google ADK - from auto-instrumentation to component-level scoring, multi-turn session quality, system metrics, and root cause analysis.
The Use Case
This demo walks through evaluating and debugging a multi-agent, multi-turn research assistant built on Google ADK. The application analyzes seminal academic papers, retrieves recent citing literature from the web, and proposes future research directions - and it does so across conversations that can span several back-and-forth turns with the user.
The purpose of this demo is to show how Deepchecks makes an agentic system observable and debuggable at every level: per span, per agent, per session (multi-turn), and per cost/latency dimension. For the backstory behind this specific example, see the blog post: Your AI Agent Is Failing - You Just Can't See Where.
The Workflow
The app is based on the Google ADK Academic Research sample. It uses a coordinator / sub-agent pattern - a common structure for multi-agent systems:
- Academic Coordinator - the root agent. Greets the user, asks for the seminal paper, and orchestrates the workflow across multiple turns. It delegates work to two specialized sub-agents.
- Academic Web Search Agent - searches the web (via a Serper API tool) for recent papers citing the seminal work.
- Academic New Research Agent - synthesizes the seminal paper and the citing papers into a set of suggested future research directions.
Each user session may run as a single-turn request ("Analyze BERT and find recent citing work") or a multi-turn conversation where the user gradually clarifies what they want, adds follow-up questions, or asks for deeper exploration of a specific direction.
Why This Is Hard to Debug Without the Right Tools
A production agent like this can run 10-50+ steps in a single session: LLM calls, tool invocations, sub-agent delegations, and the glue logic between them. The classic "input went in, output came out" view is useless - you cannot tell which step failed, which agent made the wrong decision, or whether the issue was with the tool or the agent calling it.
And multi-step failure compounds: a 20-step workflow where every step is individually 95% correct still only succeeds end-to-end ~36% of the time (0.95^20 = 0.36). To improve an agent, you have to pinpoint which step is the weak link.
What This Demo Covers
This demo shows how Deepchecks addresses exactly that, across five angles:
- Logging the Data - auto-instrument Google ADK with a few lines of code so every agent, LLM call, and tool invocation is captured automatically.
- Analyze Multi-Agent Performance - break down quality by interaction type (Root / Agent / LLM / Tool) to see which component of the system is dragging the overall score down.
- Evaluate Multi-Turn Sessions - score entire conversations, not just individual spans, using session-level properties that judge whether the user's intent was fulfilled across all turns.
- Observability and System Metrics - monitor latency, tokens, cost, and operational failures (stuck runs, zero-token calls) across the agent.
- Root Cause Analysis - drill from a failing component into concrete failure categories, clustered examples, and the underlying pattern to fix.
Updated 9 days ago