GVHD Use Case : Q&A Example

Evaluating and debugging a Q&A application, step by step

Use Case Background

The data in this tutorial originates from a classic Retrieval Augmented Generation bot that answers questions about the GVHD medical condition.
We're evaluating a GPT-3.5 LLM-based app that uses FAISS for the retrieval embedding vectors, with differences in the retrieval strategies and temperatures between the two versions.
The knowledge base is built from a collection of online resources about the condition.
The two datasets used for this example can be downloaded from this link .

Structure of this Example

  1. Creating your first application, and uploading a baseline version and a new version to evaluate.
  2. Exploring a few flows for evaluating using system:
    1. Similarity-based Comparison and Annotation
    2. Identify Problems Using the Properties and Estimated Annotations
    3. Creating Custom LLM-Based Properties

Note: if you already have the data for the two versions uploaded, and just want to see the value in the system, jump straight to the Exploring the flows sections.

Create Your Application and Upload the Data to the System

πŸ“˜

app_name, app_type and version_name

Initializing the SDK using dc_client.init for an application or version name that doesn't already exist will create the given application and / or version in the system.

If the application doesn't already exist, you must also explicitly set the app_type argument to define the type of the new application that will be created.

For our scenario, let's create the application through the UI by filling in the Application Name field to be: "GVHD", and the Version Name to be "baseline"

Option 1: Upload the Data with Deepchecks' UI

  1. Upload the baseline_data version csv to the Evaluation environment.
  2. Add a new version, and upload the second csv (v2_new_ir_data) to the Evaluation environment of the new version.
  3. You are all set! You can now check out your data in the Deepchecks Application!
    In the "Applications" page, you should now see the "GVHD" App.

πŸ—’οΈ

Note: Data Processing status

Some properties take a few minutes to calculate, so some of the data - such as properties and estimated annotations will be updated over time. You'll see a "βœ… Completed" Processing Status in the Applications page, when processing is finished. In addition, you can subscribe to notifications in the "Workspace Settings", to get notified by email upon processing completion.

Option 2: Upload the Data with Deepchecks' SDK

Set up the Deepchecks client to use the Python SDK (as a best practice it is recommended to so so in a dedicated python virtual environment):

pip install deepchecks-llm-client

In your Python IDE / Jupyter environment, set up the relevant configurations:

# Choose an app name for the application (same as filled in UI):
APP_NAME = "GVHD-demo"

# Choose a name for the first version you'll create (same as in UI):
BASE_VERSION = "baseline"

# Retrieve your API KEY from the deepchecks UI, as seen in the following gif:
DC_API_KEY = "insert-your-token-here"

Initialize the Deepchecks Client and Upload the Data

from deepchecks_llm_client.client import dc_client
from deepchecks_llm_client.data_types import (EnvType, AnnotationType, LogInteractionType,
                                              ApplicationType)

dc_client.init(host="https://app.llm.deepchecks.com", api_token=DC_API_KEY,
               app_name=APP_NAME, app_type=ApplicationType.QA, version_name=BASE_VERSION,
               env_type=EnvType.EVAL, auto_collect=False)

Upload the baseline version

import pandas as pd

df = pd.read_csv("baseline_data.csv")

dc_client.log_batch_interactions(
        interactions=[
            LogInteractionType(
                input=row["input"],
                information_retrieval=row["information_retrieval"],
                output=row["output"],
                annotation=AnnotationType.BAD if row["annotation"] == 'bad' \
                     else (AnnotationType.GOOD if row["annotation"]=='good' else None),
                user_interaction_id=row["user_interaction_id"] 
            ) for _, row in df.iterrows()
        ]
)

Once we have a new version we would want to test it on our golden set. In order to do that we can use the get_data function to retrieve the golden set inputs and then run them in our new version's pipeline.

Upload a new version to evaluate

After we have defined our baseline version we will want to upload a version for evaluation.

We've changed the parameters for the information retrieval in this pipeline, and thus we're naming the new version 'v2-IR'

dc_client.version_name('v2-IR') # Set dc_client to upload to a new version

df = pd.read_csv("v2_new_ir_data.csv")
dc_client.log_batch_interactions(
        interactions=[
            LogInteractionType(
                input=row["input"],
                information_retrieval=row["information_retrieval"],
                output=row["output"],
                user_interaction_id=row['user_interaction_id']
            ) for _, row in df.iterrows()
        ]
)

You are all set! You can now check out your data in the Deepchecks Application!

In the "Applications" page, you should now see the "GVHD" App.

πŸ—’οΈ

Note: Data Processing status

Some properties take a few minutes to calculate, so some of the data - such as properties and estimated annotations will be updated over time. You'll see a "βœ… Completed" Processing Status in the Applications page, when processing is finished. In addition, you can subscribe to notifications in the "Workspace Settings", to get notified by email upon processing completion.