End-to-End RCA Examples
This page offers end-to-end examples that illustrate the complete workflows described in the previous two guides: first, identifying failures in an application, and then analyzing those failures.
1. Customer-Support Chatbot (Chat Task)
Suppose your application is a chatbot that can access external tools to retrieve relevant information. The evaluation dataset contains 20 multi-turn conversations.
After uploading these conversations to Deepchecks:
- The Overview page reveals two types of interactions: Chat and Tool Use.
- Tool Use interactions have a high pass rate (95%), while Chat interactions have a much lower pass rate (48%). You decide to focus on improving the Chat component.
- In the Score Breakdown for the Chat interaction type, Intent Fulfillment emerges as the primary reason for failures.
- Opening the High-Level Summary for Intent Fulfillment, the agent identifies a dominant failure pattern:
“Format drift after first turn” – in 92% of failed examples, the assistant follows the user’s requested output format (e.g., markdown) in its first response but switches to free-form text in subsequent turns. The summary provides links to representative User Interaction IDs. - By searching these IDs and reviewing the corresponding interactions on the Data page, you confirm the diagnosis: the assistant fails to maintain the specified format after the initial response.
- Mitigation: You modify the system prompt to remind the assistant to follow the user’s formatting requests across all turns.
- After regenerating evaluation conversations and uploading them again, Intent Fulfillment failures drop to 6%, and the overall Chat pass rate rises to 93%.
2. Biomedical Literature Summarizer (Summarization Task)
Imagine you have an application that generates 200-word plain-English summaries of newly published biology papers.
After generating and uploading outputs for your evaluation dataset to Deepchecks:
- The Overview page shows a session pass rate of 64%. Since this application involves only a single turn, there is just one interaction type to analyze: Summarization.
- In the Score Breakdown for Summarization, the main failing property is Grounded in Context.
- The High-Level Summary for Grounded in Context shows the agent has clustered hallucinated segments into two domain-specific groups:
- Protein/Enzyme names that do not appear in the source (e.g., “KRAS kinase phosphorylation”, “BRAF-X variant”)
- Clinical-trial phase references missing from the article (e.g., “Phase-III pivotal study”)
Each cluster includes several linked examples for review.
- Based on these patterns, you hypothesize that hallucinations increase when the model encounters unfamiliar biomedical terminology.
- You update the system prompt: “Quote scientific terms exactly as they appear in the source text. If a term is unknown, do not guess—return ‘NOT FOUND.’”
- After regenerating and re-uploading the summaries, the High-Level Summary indicates only one stray term in the “Protein/Enzyme names” cluster and none in the “Clinical-trial phase” cluster. Grounded in Context failure rates drop from 36% to 9%.
Updated 8 days ago