Configuring Nvidia's Guardrails with Deepchecks

In this tutorial, we will use a subset of our GVHD dataset, originating from a classic Retrieval-Augmented Generation bot designed to answer questions about the GVHD medical condition. We have added some toxic inputs to the dataset and will demonstrate how to detect these issues using Deepchecks, as well as how to improve the model’s handling of toxic inputs with Nvidia's Guardrails package.

📘
Open the Demo Notebook via Colab or Download it Locally
Click the badge below to open the Google Colab or click here to download the Notebook, set in your API token (see below), and you're ready to go!

Inference with Nvidia's Guardrails package and OpenAI

First, we create a basic custom configuration for Nvidia Guardrails, using OpenAI model, with no extra rails. Click here to download the basic configuration files.

Next, we define the LLMRails client creation function and the generation function.

from nemoguardrails import RailsConfig, LLMRails
import os
import nest_asyncio

# Set the OpenAI API key
os.environ["OPENAI_API_KEY"] = "YOUR_API_KEY"

# Allow for nested async calls
nest_asyncio.apply()

# Define the function to load the guardrails
def get_guardrails(config_folder=None):
    config = RailsConfig.from_path(config_folder)
    try:
        llm_rails = LLMRails(config)
    except Exception as e:
        print(f"Error loading guardrails: {e}")
        return None
    print(f"Guardrails loaded successfully with config folder: {config_folder}")
    return llm_rails

# Define the inference function
def generate_response(rails, input_text):
    response = rails.generate(messages=[{
        "role": "user",
        "content": input_text
    }])
    return response["content"]

Infer on the Data Using Nvidia Guardrails with the Basic Configuration

import pandas as pd

# Read production data
# The dataset is available here: https://figshare.com/articles/dataset/production_data/27324201
df_prod = pd.read_csv("https://figshare.com/ndownloader/files/50058579")

## Infer on the data using Nvidia guardrails

# Load the basic guardrails config (define only the llm to use, without any additional guardrails)
guardrails = get_guardrails(config_folder="./configurations/config_basic")

# Infer on the input data
df_prod['v1_output'] = df_prod['full_prompt'].apply(lambda full_prompt: 
                                                    generate_response(guardrails, full_prompt))

Using Deepchecks to Identify Issues in the Data

Uploading the data to Deepchecks

import os
from deepchecks_llm_client.client import DeepchecksLLMClient
from deepchecks_llm_client.data_types import (EnvType, ApplicationType, LogInteraction)

# Set the Deepchecks API key
os.environ["DEEPCHECKS_API_KEY"] = "YOUR_DEEPCHECKS_API_KEY"

# Choose an app name for the application:
APP_NAME = "guardrails_demo"

# Create the client and new application in Deepchecks
dc_client = DeepchecksLLMClient(api_token=os.getenv('DEEPCHECKS_API_KEY'), host="https://staging.llmdev.deepchecks.com")
dc_client.create_application(APP_NAME, ApplicationType.QA)

## Upload data to Deepchecks
dc_client.log_batch_interactions(
    app_name=APP_NAME,
    version_name="v1",
    env_type=EnvType.EVAL,
    interactions=[
        LogInteraction(
            input=row["input"],
            output=row["v1_output"],
            information_retrieval=row["information_retrieval"],
            full_prompt=row["full_prompt"],
            user_interaction_id=str(row["user_interaction_id"]),
        ) for _, row in df_prod.iterrows()
    ]
)

Identifying Toxic Inputs and Outputs with Deepchecks

Go to the "v1" dashboard screen of the "guardrails_demo" app. Notice the red dot next to the "Toxicity" property, indicating that some examples are flagged as toxic. Click on it to view examples with high toxicity scores in the output. For further inspection, go to the data display screen and add the "Toxicity [IN]" and "Toxicity [OUT]" columns. You can then sort these columns to see the top toxic inputs.

For example, this input is identified as toxic, and it does indeed contain toxic language.

Inference with Added New Rails for Handling Toxic Inputs

Based on our findings with Deepchecks, we can inspect the type of toxicity we observe in the data and define our new rail conditions. Essentially, we add a new rail for avoiding toxic inputs and outputs, define prompts for identifying toxic inputs and outputs, and specify the required response for both. Click here to download the improved configuration files.

Next, we load the new guardrails and perform inference again, uploading the new data to Deepchecks with a new version:

# Add toxicity handling to the guardrails configuration based on the examples above. 
# 1. Add a new rail for avoiding toxic inputs and outputs 
# 2. Define prompts for identifying toxic inputs and outputs in the prompts.yml file
# 3. Specify the required response for the toxic inputs and outputs in the rails.co file 

# Load the new guardrails
improved_guardrails = get_guardrails(config_folder="./configurations/config_improved")

# Infer on the input data again using the new Nvidia guardrails
df_prod['v2_improved_output'] = df_prod['full_prompt'].apply(lambda full_prompt: generate_response(improved_guardrails, full_prompt))

# Upload data to Deepchecks
dc_client.log_batch_interactions(
    app_name=APP_NAME,
    version_name="v2_improved",
    env_type=EnvType.EVAL,
    interactions=[
        LogInteraction(
            input=row["input"],
            output=row["v2_improved_output"],
            information_retrieval=row["information_retrieval"],
            full_prompt=row["full_prompt"],
            user_interaction_id=str(row["user_interaction_id"]),
        ) for _, row in df_prod.iterrows()
    ]
)

Verifying the Improvement via Deepchecks

Now we can leverage Deepchecks to verify that the guardrails worked effectively. We go to version "v2_improved" in the app, access the Interactions screen, and add the avoided answer property:

Here, we can see that the toxic inputs are indeed handled well in the new version, as the LLM avoids answering thanks to the added guardrails, and there are no toxic outputs.

We can also add a column for similarity to version "v1".

Here, we can see that the similarity score to "v1" is quite low, suggesting that the LLM handled the new toxic inputs differently compared to when there were no guardrails.

As an example, we can go to version comparison screen and see the improvements in the new version:

📘Open the Demo Notebook via Colab or Download it Locally

Inference with Nvidia's Guardrails package and OpenAI

Infer on the Data Using Nvidia Guardrails with the Basic Configuration

Using Deepchecks to Identify Issues in the Data

Uploading the data to Deepchecks

Identifying Toxic Inputs and Outputs with Deepchecks

Inference with Added New Rails for Handling Toxic Inputs

Verifying the Improvement via Deepchecks

📘
Open the Demo Notebook via Colab or Download it Locally