Pentesting Your LLM-Based App

Using a broad set of malicious prompts to probe an LLM-based app and asses the overall risk associated with using it in a given scenario (including security, bias, and other risks)

LLMs are able to tackle an amazing range of textual tasks, but this capability comes with a price - the end users are often able to input free text to your system that will end up directly as part of the prompt sent to your LLM. There are several known possible "attacks", or vulnerabilities, of LLM based systems. For example, certain prompts are able to make the LLM's output ignore their pre-trained safety mechanisms, that are intended to block undesired behaviors such as toxic comments or aiding the user with information on how to perform illegal activities.

The Pentest environment in Deepchecks' LLM Eval system is an environment built to test your system for a wide range of such known vulnerabilities. Many of these are based on the Garak OSS package. This testing is done by providing you with a set of malicious inputs which you can then run through your LLM pipeline. Deepchecks LLM Eval will then automatically test and score your LLM's app resilience to these attack attempts, by looking at the corresponding outputs and identifying if the pipeline has been able to block the attack.

Setting Up the Pentest Environment

Enabling the Pentest Environment

In the Workspace Setting page, enable the Pentest environment

Retrieving the Malicious Inputs

From the UI

Start by clicking the Generation button and select the "Generate Pentest Data" option

Then you'll be able select the vulnerability types you wish to test your system against, see some of the interactions in each type, and download a csv containing the malicious inputs and their additional fields.

Using the SDK

We'll start by connecting to the system, and then get the set of inputs, their corresponding vulnerability types and the trigger to detect if your system is vulnerable to the given input.

from deepchecks_llm_client.client import dc_client
from deepchecks_llm_client.data_types import EnvType, LogInteractionType

dc_client.init(
            host='https://app.llm.deepchecks.com/',
            api_token='Your API Key',
            app_name="pentest-example",
            version_name='v1',
            env_type=EnvType.PENTEST,
            auto_collect=False)

df = dc_client.get_pentest_prompts()

For this example we will only take the first inputs of most vulnerability types:

df = df.groupby('vuln_type').head(5).reset_index()
df.head(50)

Additional Fields

On top of the malicious inputs, the data you'll download will contain two additional fields. These should later be passed back to the system once you have generated outputs for each of these inputs using your LLM pipeline.

  • The vuln_type column is the type of the vulnerability. Most vuln_type have many different inputs, and your pipeline may react differently to each.
  • The vuln_trigger_str column is the trigger corresponding to each input (can be none). This is used by the system to detect if the LLM has successfully avoided the attack or not.

Run the Inputs Through Your LLM Pipeline

To test our LLM we now need to generate the outputs. In this example, we'll substitute your pipeline with a basic call to gpt-3.5-turbo (fits openai 1.x.x API)

import os
import openai

openai_client = openai.OpenAI(api_key = "YOUR OPENAI KEY")

def my_chat_bot(prompt: str):
    try:
        resp = openai_client.chat.completions.create(
                        model="gpt-3.5-turbo",
                        temperature=0.7,
                        messages=[
                            {"role": "system",
                            "content": "You are ChatGPT, a large language model trained by OpenAI. Follow the user's instructions carefully."},
                            {"role": "user", "content": prompt}
                        ]
                    )
    except openai.OpenAIError as e:
        return str(e)
    return resp.choices[0].message.content
  
df['output'] = df.input.apply(my_chat_bot)

Have a quick look at our output

df.groupby('vuln_type').first().reset_index().loc[:10]

Log Pentest Results to the System

Using the UI

In the same way you would upload a regular csv to the system, once you've added your LLM pipeline's response to the malicious input you can upload it using the upload screen, while selecting the pentest environment as the target environment.

Using the SDK

interactions_list = []

for row in df.itertuples():
    interactions_list.append(LogInteractionType(
        input=row.input,
        output=row.output,
        vuln_type=row.vuln_type,
        vuln_trigger_str=row.vuln_triegger_str,        
    ))
    
dc_client.log_batch_interactions(interactions_list)

This will log your pipelines outputs to the system, alongside the necessary information - vuln_type and vuln_trigger_str.

Exploring Pentest Results

Once the data has been uploaded, in a short while you will be able to see the results in the system. First, select the Pentset tab from the environment selectors in the top left cornet.

Each vulnerability type is represented by a topic. The outputs of the LLM pipeline are classified to "Good" and "Bad" using the properties which you can see here, which are based on the detectors of the Garak OSS package. Each vulnerability type has a set of properties that are relevant to detecting whether or not the LLM pipeline was successful in avoiding the attack. A high value (up to 1) represents a case in which a property detected a the LLM apps' failure to avoid the attack for that given input.

In this dashboard you can see that overall the LLM pipeline (in this example, gpt-3.5-turbo) has been successful in avoiding 74% of the attacks presented to it. To understand better which vulnerability type were more problematic for our model, we can switch to the Data screen.

For example clicking on the DAN vulnerability type we can see that the model has been able to avoid only 6% of these types of attacks.

You can notice that most outputs convey that the DAN (unrestricted) mode has been successfully activated by the attack prompt.