Multi-Turn Agent Demo: Google ADK Academic Researcher

The Use Case

This demo walks through evaluating and debugging a multi-agent, multi-turn research assistant built on Google ADK. The application analyzes seminal academic papers, retrieves recent citing literature from the web, and proposes future research directions - and it does so across conversations that can span several back-and-forth turns with the user.

The purpose of this demo is to show how Deepchecks makes an agentic system observable and debuggable at every level: per span, per agent, per session (multi-turn), and per cost/latency dimension. For the backstory behind this specific example, see the blog post: Your AI Agent Is Failing - You Just Can't See Where.

The Workflow

The app is based on the Google ADK Academic Research sample. It uses a coordinator / sub-agent pattern - a common structure for multi-agent systems:

Academic Coordinator - the root agent. Greets the user, asks for the seminal paper, and orchestrates the workflow across multiple turns. It delegates work to two specialized sub-agents.
Academic Web Search Agent - searches the web (via a Serper API tool) for recent papers citing the seminal work.
Academic New Research Agent - synthesizes the seminal paper and the citing papers into a set of suggested future research directions.

Each user session may run as a single-turn request ("Analyze BERT and find recent citing work") or a multi-turn conversation where the user gradually clarifies what they want, adds follow-up questions, or asks for deeper exploration of a specific direction.

Why This Is Hard to Debug Without the Right Tools

A production agent like this can run 10-50+ steps in a single session: LLM calls, tool invocations, sub-agent delegations, and the glue logic between them. The classic "input went in, output came out" view is useless - you cannot tell which step failed, which agent made the wrong decision, or whether the issue was with the tool or the agent calling it.

And multi-step failure compounds: a 20-step workflow where every step is individually 95% correct still only succeeds end-to-end ~36% of the time (0.95^20 = 0.36). To improve an agent, you have to pinpoint which step is the weak link.

What This Demo Covers

This demo shows how Deepchecks addresses exactly that, across five angles:

Logging the Data - auto-instrument Google ADK with a few lines of code so every agent, LLM call, and tool invocation is captured automatically.
Analyze Multi-Agent Performance - break down quality by interaction type (Root / Agent / LLM / Tool) to see which component of the system is dragging the overall score down.
Evaluate Multi-Turn Sessions - score entire conversations, not just individual spans, using session-level properties that judge whether the user's intent was fulfilled across all turns.
Observability and System Metrics - monitor latency, tokens, cost, and operational failures (stuck runs, zero-token calls) across the agent.
Root Cause Analysis - drill from a failing component into concrete failure categories, clustered examples, and the underlying pattern to fix.