GVHD Use Case : Q&A Example
Evaluating and debugging a Q&A application, step by step
Use Case Background
The data in this tutorial originates from a classic Retrieval Augmented Generation bot that answers questions about the GVHD medical condition.
We're evaluating a GPT-3.5 LLM-based app that uses FAISS for the retrieval embedding vectors, with differences in the retrieval strategies and temperatures between the two versions.
The knowledge base is built from a collection of online resources about the condition.
The two datasets used for this example can be downloaded from this link .
Structure of this Example
- Creating your first application, and uploading a baseline version and a new version to evaluate.
- Exploring a few flows for evaluating using system:
Note: if you already have the data for the two versions uploaded, and just want to see the value in the system, jump straight to the Exploring the flows sections.
Create Your Application and Upload the Data to the System
For our scenario, let's create the application through the UI by filling in the Application Name field to be: "GVHD", and the Version Name to be "baseline"
Option 1: Upload the Data with Deepchecks' UI
- Upload the baseline_data version csv to the Evaluation environment.
- Add a new version, and upload the second csv (v2_new_ir_data) to the Evaluation environment of the new version.
- You are all set! You can now check out your data in the Deepchecks Application!
In the "Applications" page, you should now see the "GVHD" App.
Note: Data Processing status
Some properties take a few minutes to calculate, so some of the data - such as properties and estimated annotations will be updated over time. You'll see a "β Completed" Processing Status in the Applications page, when processing is finished. In addition, you can subscribe to notifications in the "Workspace Settings", to get notified by email upon processing completion.
Option 2: Upload the Data with Deepchecks' SDK
Set up the Deepchecks client to use the Python SDK (as a best practice it is recommended to do so in a dedicated python virtual environment):
pip install deepchecks-llm-client
In your Python IDE / Jupyter environment, set up the relevant configurations:
# Retrieve your API KEY from the deepchecks UI and place in here:
DC_API_KEY = "insert-your-token-here"
# Choose an app name for the application (same as filled in UI):
APP_NAME = "GVHD-demo"
Initialize the Deepchecks Client and Upload the Data
from deepchecks_llm_client.client import DeepchecksLLMClient
from deepchecks_llm_client.data_types import ApplicationType
dc_client = DeepchecksLLMClient(
api_token=DC_API_KEY
)
dc_client.create_application(APP_NAME, ApplicationType.QA)
Upload the baseline version
Download the dataset from here
import pandas as pd
from deepchecks_llm_client.data_types import (EnvType, AnnotationType, LogInteractionType)
df = pd.read_csv("baseline_data.csv")
dc_client.log_batch_interactions(
app_name=APP_NAME,
version_name="baseline",
env_type=EnvType.EVAL,
interactions=[
LogInteractionType(
input=row["input"],
information_retrieval=row["information_retrieval"],
output=row["output"],
annotation=AnnotationType.BAD if row["annotation"] == 'bad' \
else (AnnotationType.GOOD if row["annotation"]=='good' else None),
user_interaction_id=row["user_interaction_id"]
) for _, row in df.iterrows()
]
)
Once we have a new version we would want to test it on our golden set. In order to do that we can use the get_data function to retrieve the golden set inputs and then run them in our new version's pipeline.
Upload a new version to evaluate
After we have defined our baseline version we will want to upload a version for evaluation.
We've changed the parameters for the information retrieval in this pipeline, and thus we're naming the new version 'v2-IR'
Download the dataset from here
df = pd.read_csv("v2_new_ir_data.csv")
dc_client.log_batch_interactions(
app_name=APP_NAME,
version_name="v2-IR",
env_type=EnvType.EVAL,
interactions=[
LogInteractionType(
input=row["input"],
information_retrieval=row["information_retrieval"],
output=row["output"],
user_interaction_id=row['user_interaction_id']
) for _, row in df.iterrows()
]
)
You are all set! You can now check out your data in the Deepchecks Application!
In the "Applications" page, you should now see the "GVHD" App.
Note: Data Processing status
Some properties take a few minutes to calculate, so some of the data - such as properties and estimated annotations will be updated over time. You'll see a "β Completed" Processing Status in the Applications page, when processing is finished. In addition, you can subscribe to notifications in the "Workspace Settings", to get notified by email upon processing completion.
Updated 16 days ago