Evaluation in Your CI / CD

When you have an application in production, continuous improvements are essential. Working with LLMs can be unpredictable, as even small changes can significantly affect the outputs. The Deepchecks SDK helps you add automatic tests to your CI/CD pipeline to manage these challenges.

In this guide, we'll demonstrate how to incorporate automatic quality benchmark testing into your pipeline using a two-step process. We'll evaluate system performance using a curated evaluation dataset to test the outputs.

Step 1: Running Your Pipeline on an Evaluation Set

The first step in your CI/CD job is to generate outputs from your new application version using a standardized evaluation dataset. For this guide, we'll assume you have an existing evaluation set named "Ground Truth" already uploaded to the Deepchecks platform.

import os
import pandas as pd
from deepchecks_llm_client.client import DeepchecksLLMClient
from deepchecks_llm_client.data_types import EnvType, ApplicationType, LogInteraction

# Initialize the Deepchecks SDK client
APP_NAME = "YOUR APP NAME"
VERSION_NAME = "NEW VERSION NAME"
DEEPCHECKS_LLM_API_KEY = os.environ.get("DEEPCHECKS_LLM_API_KEY")
dc_client = DeepchecksLLMClient(host="https://app.llm.deepchecks.com", api_token=DEEPCHECKS_LLM_API_KEY)

# We'll fetch the ground truth data from deepchecks
ground_truth_data = dc_client.get_data(
    app_name=APP_NAME,
    version_name="Ground Truth",
    env_type=EnvType.EVAL
)

# Run your pipeline with the inputs from the ground truth dataset
interactions = []
for _, row in ground_truth_data.iterrows():

    new_output = my_chatbot(row["input"]) # Replace with your function
    interactions.append(
        LogInteraction(
            input=row['input'],
            output=new_output,
            user_interaction_id=str(row['user_interaction_id']),
            full_prompt=row['full_prompt'],
						expected_output=row["output"], 
            custom_props={'Question Type': row['cp_Question Type']}
        )
    )

dc_client.log_batch_interactions(app_name=APP_NAME, version_name=VERSION_NAME, env_type=EnvType.EVAL, interactions=interactions)

Step 2: Defining the Automatic Tests

Let's start by downloading the annotation results from our evaluation data that we just uploaded :

evaluation_data = dc_client.get_data_if_calculations_completed(
  app_name=APP_NAME,
  version_name=VERSION_NAME,
  env_type=EnvType.EVAL,
user_interaction_ids=ground_truth_data['user_interaction_id'].tolist())

Note: For a full description of downloading the enriched data, see theget_data_if_calculations_completed SDK docs here.

Now that we have the new version data with annotations, we can run tests to see if our new version

Example 1: Test Overall Pass Rate:
Our test requires that every version must achieve at least a 70% pass rate, meaning 70% of samples receive a good annotation in the estimated_annotation column.

# Filter out NAs and unknowns
good_bad_data = evaluation_data[evaluation_data['estimated_annotation'].isin(['good', 'bad'])]

# Calculate the ratio of "good" samples out of good + bad only
good_ratio = (good_bad_data['estimated_annotation'] == 'good').mean()
threshold = 0.7

# Assert that the ratio meets the threshold
assert good_ratio >= threshold, f"FAILURE: Good ratio is too low ({good_ratio:.2f}). Required: {threshold}"
print(f"SUCCESS: Good ratio is sufficient ({good_ratio:.2f})")

Example 2: Test a Critical Data Segment:
For example in our GVHD-Demo we have questions on critical categories like "treatments & medications", will require a stricter criteria. This test ensures there is zero tolerance for hallucinations by checking that every interaction in this segment has a Grounded in Context score of at least 0.7.

# Filter for the relevant segment
segment_data = evaluation_data[evaluation_data["cp_Question Type"] == "treatments & medications"]
threshold = 0.7

# Find any samples in the segment that fail the GIC score threshold
failing_scores = segment_data[segment_data["Grounded in Context"] < threshold]

# Assert that there are no failing samples
assert failing_scores.empty, (
    f"FAILURE: {len(failing_scores)} out of {len(segment_data)} 'treatments & medications' "
    f"samples are below the Grounded in Context threshold of {threshold}."
)

print("SUCCESS: All 'treatments & medications' scores meet the threshold.")

Step 3: Integrating Tests into Your CI/CD Pipeline

The specific implementation depends on your CI/CD tool. Whether you use GitHub Actions, CircleCI, or Jenkins, you'll add a step that runs a script combining Steps 1 and 2. If an assert statement fails, the script will exit with an error, which in turn fails the pipeline build. Your script can also output a link to the Deepchecks app for easy debugging.
Here is a conceptual example of a GitHub Actions workflow:

# In your .github/workflows/ci.yml file

jobs:
  evaluate-llm:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      - name: Install dependencies
        run: pip install -r requirements.txt
      - name: Run Pipeline Evaluation
        env:
          DEEPCHECKS_LLM_API_KEY: ${{ secrets.DEEPCHECKS_LLM_API_KEY }}
          APP_NAME: "your-app-name"
          VERSION_NAME: "test-${{ github.run_number }}-${{ github.sha }}"
        # This script contains the code from Steps 1 & 2
        run: python run_pipeline_evaluation.py

Comparing Between Versions

While this example demonstrates testing against a defined benchmark, you can easily adapt the code to compare different model versions. For example, you can ensure our new version performs at least as well as a previous stable version (baseline).

The process is the same, but instead of a fixed threshold, we'll use the previous version's performance as a dynamic threshold to ensure the new version performs at least as well.

# Assume BASELINE_VERSION_NAME is defined
# And the new version's data is the `evaluation_data` variable from Step 2

# First, get the data for the baseline version (v1)
baseline_data = dc_client.get_data_if_calculations_completed(
    app_name=APP_NAME,
    version_name=BASELINE_VERSION_NAME,
    env_type=EnvType.EVAL
)

# Calculate baseline's performance to use as the baseline threshold
baseline_threshold = (baseline_data['estimated_annotation'] == 'good').mean()

# Calculate the ratio of "good" samples out of good + bad only
new_version_good_ratio = (evaluation_data['estimated_annotation'] == 'good').mean()

# Assert that the new version's performance is at least as good as the baseline
assert new_version_good_ratio >= baseline_threshold, (
    f"FAILURE: New version's good ratio ({new_version_good_ratio:.2f}) is lower than "
    f"baseline version's ({baseline_threshold:.2f})."
)

print(f"SUCCESS: New version's performance ({new_version_good_ratio:.2f}) meets or exceeds baseline ({baseline_threshold:.2f}).")

Why is CI/CD for LLMs important?

The primary goal is to verify that changes don't harm performance. Modifying a prompt, updating a model configuration, or changing your RAG retrieval logic can all have unintended consequences. Automated evaluations in your CI/CD pipeline act as a safety net to catch these regressions before they reach production.

What are the limitations?
Running a full evaluation pipeline can be time-consuming and costly, especially if it involves numerous LLM API calls. Running these comprehensive tests on every single commit is often impractical.

Best Practices

Trigger evaluations based on specific file changes: Configure your CI/CD pipeline to run LLM evaluations only when files that directly impact model behavior are modified. This ensures efficient resource usage while maintaining quality control. Set up path-based triggers for critical components:
- Prompt templates and configurations: Run tests when files containing prompt templates, system messages, few-shot examples, or preprocessing logic change
- Model configurations: Trigger evaluations when model selection, parameters (temperature, max tokens, etc.), or inference settings are updated
- External integrations: Test when RAG retrieval logic, tool definitions, API integrations, or knowledge base connections are modified
- Evaluation datasets: Run comprehensive tests when your evaluation data is updated or expanded

Example GitHub Actions path filtering:

on:
  push:
    paths:
      - 'prompts/**'
      - 'config/model_config.yaml' 
      - 'src/retrieval/**'
      - 'tools/**'
      - 'data/eval_inputs.csv'

Trigger evaluations on a schedule: Run the full evaluation suite on a weekly or monthly basis. This ensures that even if regressions aren't caught immediately, they are detected in a timely manner.
Maintain quality throughout the deployment lifecycle:
- Add a quality gate before production: Include a final evaluation step in your deployment pipeline to catch issues before they reach users.
- Keep your evaluation data current: Regularly review and refresh your evaluation datasets so they reflect real-world usage patterns and edge cases.