DocumentationAPI ReferenceRelease Notes
DocumentationLog In

GVHD Use Case : Q&A Example

Evaluating and debugging a Q&A application, step by step

Use Case Background

The data in this tutorial originates from a classic Retrieval Augmented Generation bot that answers questions about the GVHD medical condition.
We're evaluating a GPT-3.5 LLM-based app that uses FAISS for the retrieval embedding vectors, with differences in the retrieval strategies and temperatures between the two versions.
The knowledge base is built from a collection of online resources about the condition.
The two datasets used for this example can be downloaded from this link .

Structure of this Example

  1. Creating your first application, and uploading a baseline version and a new version to evaluate.
  2. Exploring a few flows for evaluating using system:
    1. Similarity-based Comparison and Annotation
    2. Identify Problems Using the Properties and Estimated Annotations
    3. Creating Custom LLM-Based Properties

Note: if you already have the data for the two versions uploaded, and just want to see the value in the system, jump straight to the Exploring the flows sections.

Create Your Application and Upload the Data to the System

For our scenario, let's create the application through the UI by filling in the Application Name field to be: "GVHD", and the Version Name to be "baseline"

Option 1: Upload the Data with Deepchecks' UI

  1. Upload the baseline_data version csv to the Evaluation environment.
  2. Add a new version, and upload the second csv (v2_new_ir_data) to the Evaluation environment of the new version.
  3. You are all set! You can now check out your data in the Deepchecks Application!
    In the "Applications" page, you should now see the "GVHD" App.

πŸ—’οΈ

Note: Data Processing status

Some properties take a few minutes to calculate, so some of the data - such as properties and estimated annotations will be updated over time. You'll see a "βœ… Completed" Processing Status in the Applications page, when processing is finished. In addition, you can subscribe to notifications in the "Workspace Settings", to get notified by email upon processing completion.

Option 2: Upload the Data with Deepchecks' SDK

Set up the Deepchecks client to use the Python SDK (as a best practice it is recommended to do so in a dedicated python virtual environment):

pip install deepchecks-llm-client

In your Python IDE / Jupyter environment, set up the relevant configurations:

# Retrieve your API KEY from the deepchecks UI and place in here:
DC_API_KEY = "insert-your-token-here"

# Choose an app name for the application (same as filled in UI):
APP_NAME = "GVHD-demo"

Initialize the Deepchecks Client and Upload the Data

from deepchecks_llm_client.client import DeepchecksLLMClient
from deepchecks_llm_client.data_types import ApplicationType

dc_client = DeepchecksLLMClient(
    api_token=DC_API_KEY
)

dc_client.create_application(APP_NAME, ApplicationType.QA)

Upload the baseline version

Download the dataset from here

import pandas as pd

from deepchecks_llm_client.data_types import (EnvType, AnnotationType, LogInteractionType)

df = pd.read_csv("baseline_data.csv")

dc_client.log_batch_interactions(
    app_name=APP_NAME,
    version_name="baseline",
    env_type=EnvType.EVAL,
    interactions=[
        LogInteractionType(
            input=row["input"],
            information_retrieval=row["information_retrieval"],
            output=row["output"],
            annotation=AnnotationType.BAD if row["annotation"] == 'bad' \
                    else (AnnotationType.GOOD if row["annotation"]=='good' else None),
            user_interaction_id=row["user_interaction_id"]
        ) for _, row in df.iterrows()
    ]
)

Once we have a new version we would want to test it on our golden set. In order to do that we can use the get_data function to retrieve the golden set inputs and then run them in our new version's pipeline.

Upload a new version to evaluate

After we have defined our baseline version we will want to upload a version for evaluation.

We've changed the parameters for the information retrieval in this pipeline, and thus we're naming the new version 'v2-IR'

Download the dataset from here

df = pd.read_csv("v2_new_ir_data.csv")

dc_client.log_batch_interactions(
    app_name=APP_NAME,
    version_name="v2-IR",
    env_type=EnvType.EVAL,
    interactions=[
        LogInteractionType(
            input=row["input"],
            information_retrieval=row["information_retrieval"],
            output=row["output"],
            user_interaction_id=row['user_interaction_id']
        ) for _, row in df.iterrows()
    ]
)

You are all set! You can now check out your data in the Deepchecks Application!

In the "Applications" page, you should now see the "GVHD" App.

πŸ—’οΈ

Note: Data Processing status

Some properties take a few minutes to calculate, so some of the data - such as properties and estimated annotations will be updated over time. You'll see a "βœ… Completed" Processing Status in the Applications page, when processing is finished. In addition, you can subscribe to notifications in the "Workspace Settings", to get notified by email upon processing completion.