End-to-End RCA Examples

This page offers end-to-end examples that illustrate the complete workflows described in the previous two guides: first, identifying failures in an application, and then analyzing those failures.

1. Customer-Support Chatbot (Chat Task)

Suppose your application is a chatbot that can access external tools to retrieve relevant information. The evaluation dataset contains 20 multi-turn conversations.

After uploading these conversations to Deepchecks:

  1. The Overview page reveals two types of interactions: Chat and Tool Use.
  2. Tool Use interactions have a high pass rate (95%), while Chat interactions have a much lower pass rate (48%). You decide to focus on improving the Chat component.
  3. In the Score Breakdown for the Chat interaction type, Intent Fulfillment emerges as the primary reason for failures.
  4. Opening the High-Level Summary for Intent Fulfillment, the agent identifies a dominant failure pattern:
    “Format drift after first turn” – in 92% of failed examples, the assistant follows the user’s requested output format (e.g., markdown) in its first response but switches to free-form text in subsequent turns. The summary provides links to representative User Interaction IDs.
  5. By searching these IDs and reviewing the corresponding interactions on the Data page, you confirm the diagnosis: the assistant fails to maintain the specified format after the initial response.
  6. Mitigation: You modify the system prompt to remind the assistant to follow the user’s formatting requests across all turns.
  7. After regenerating evaluation conversations and uploading them again, Intent Fulfillment failures drop to 6%, and the overall Chat pass rate rises to 93%.

2. Biomedical Literature Summarizer (Summarization Task)

Imagine you have an application that generates 200-word plain-English summaries of newly published biology papers.

After generating and uploading outputs for your evaluation dataset to Deepchecks:

  1. The Overview page shows a session pass rate of 64%. Since this application involves only a single turn, there is just one interaction type to analyze: Summarization.
  2. In the Score Breakdown for Summarization, the main failing property is Grounded in Context.
  3. The High-Level Summary for Grounded in Context shows the agent has clustered hallucinated segments into two domain-specific groups:
    • Protein/Enzyme names that do not appear in the source (e.g., “KRAS kinase phosphorylation”, “BRAF-X variant”)
    • Clinical-trial phase references missing from the article (e.g., “Phase-III pivotal study”)
      Each cluster includes several linked examples for review.
  4. Based on these patterns, you hypothesize that hallucinations increase when the model encounters unfamiliar biomedical terminology.
  5. You update the system prompt: “Quote scientific terms exactly as they appear in the source text. If a term is unknown, do not guess—return ‘NOT FOUND.’”
  6. After regenerating and re-uploading the summaries, the High-Level Summary indicates only one stray term in the “Protein/Enzyme names” cluster and none in the “Clinical-trial phase” cluster. Grounded in Context failure rates drop from 36% to 9%.