Deepchecks' UI
Quickly onboard to the system using only a csv file
This quickstart is the place to start if you are new to Deepchecks LLM Evaluation. We'll walk you briefly through uploading data to the system and using the system to automatically annotate the outputs of your LLM pipeline.
Demo LLM Application
We will be using as an example a simple Q&A Bot, based on the Blendle HR manual. This simple langchain based bot is intended to answer questions about Blendle's HR policy. It receives as input user questions, performs document retrieval to find the HR documents that have the best chance to answer the given question and then feeds them to the LLM (in this case, gpt-3.5-turbo) alongside the user question to generate the final answer.
Uploading Data
Upon signing into the system, you will either be presented right away with the Upload Data screen, or with the Dashboard (if you have already uploaded interactions previously). In the second case, click the "Upload Data" button in the bottom left side to open the Upload Data screen.
In order to upload interactions, you need to first select or create the Application and Application Version. An Application represents a specific task your LLM pipeline is performing. For example in our case - answering HR-related question of Blendle employees. An Application Version is simply a specific version of that pipeline - for example a certain prompt template fed to gpt-3.5-turbo.
Names selected, you can now upload your data. The data you're uploading should be a csv file reflecting individual interactions with your LLM pipeline. The structure of this csv should be the following:
user_interaction_id | session_id | input | information_retrieval | full_prompt | output | annotation | interaction_type |
---|---|---|---|---|---|---|---|
Must be unique within a single version. Used for identifying interactions when updating annotations, and identifying the same interaction across different versions | (optional) The identifier for the session grouping related interactions. A unique session_id will be generated if not provided. | (mandatory) The input to the LLM pipeline | Data retrieved as context for the LLM in this interaction | The full prompt to the LLM used in this interaction | (mandatory) The pipeline output returned to the user | Was the pipeline response good enough? (Good/Bad/Unknown) | Specifies the type of interaction (e.g., Q&A, Summarization). Defaults to the application kind if not provided. |
You can download the data for the Blendle HR Bot, GPT-3.5-prompt-1 version here. This dataset contains 97 interactions with the bot, alongside human annotations for some of the interactions.
Evaluation and Production
You might notice we have the "Data Source" toggle in the upload page, enabling us to upload data either to the Golden Set or to the Production environments.
- The Evaluation data are the interactions used by you to regularly evaluate your LLM pipeline and to compare its performance between different versions. For that reason, the user_input field for the golden set interactions will often be _identical_ between different versions, to enable "apples to apples" comparison between the two versions. The golden set will usually contain a relatively small but diverse set of sample representative of the interactions you've encountered in production.
- The Production data is the data your LLM pipeline encounters in the production environment. Logging production data to Deepchecks LLM Evaluation serves to log the user inputs and outputs your model is generating in production, and to automatically get an evaluation for these interactions to get a sense of how your pipeline is fairing in production.
Dashboard and Properties
Once the data has been uploaded (and perhaps waiting a short while for the properties to compute) you are directed to the Dashboard. The dashboard serves to give you a high level status of your system. The Score section lets you see what percent of interactions are annotated, and what percent out of the annotated interactions represent good interactions.
Below it is the Properties section. Properties are characteristics of the interaction user input or model output that are calculated for each interaction. In this section, the average value of these properties is presented, and anomalous values are flagged. For example, in the image above we can see that 14% percent of the model outputs were detected as "Avoided Answers", meaning that the model avoided answering the user question for some reason. You can read more about the different kinds of properties here.
Data Page
The samples page is the place to dive deeper into the individual interactions. You can use the filters (both on topics, annotations and properties) to slice and dice the dataset and look at the interactions relevant for you, or click on the individual interactions to inspect their components (user input, full prompt and so on) and most importantly - to view and to modify the interaction's annotations.
Estimated Annotations
While the full colored annotations are annotations provided by you (in the "annotations" column of the csv), estimated annotations (marked by having only colored outlines) are annotations estimated by the system. These estimations are made by Deepchecks Automatic Annotations pipeline and can be configured and customized to your needs.
Updated 16 days ago