DocumentationAPI ReferenceRelease Notes
DocumentationLog In
Documentation

Integrate into CI/CD

Add automatic quality gates to your CI/CD pipeline - run your new version against an evaluation set and fail the build if quality drops below threshold.

LLMs are sensitive to small changes. A prompt tweak, model update, or retrieval change can significantly affect output quality in ways that only show up at evaluation time. Adding Deepchecks to your CI/CD pipeline catches these regressions automatically before they reach production.

The pattern is simple: for each new version, run your pipeline on an evaluation set, upload results to Deepchecks, wait for evaluation to complete, then assert that quality meets your thresholds.

Step 1: Run your pipeline on an evaluation set

For this guide, we'll assume you have an evaluation set named "Ground Truth" already uploaded to Deepchecks. For each new version, fetch those inputs, run your pipeline, and upload the outputs:

import os
import pandas as pd
from deepchecks_llm_client.client import DeepchecksLLMClient
from deepchecks_llm_client.data_types import EnvType, ApplicationType, LogInteraction

# Initialize the Deepchecks SDK client
APP_NAME = "YOUR APP NAME"
VERSION_NAME = "NEW VERSION NAME"
DEEPCHECKS_LLM_API_KEY = os.environ.get("DEEPCHECKS_LLM_API_KEY")
dc_client = DeepchecksLLMClient(host="https://app.llm.deepchecks.com", api_token=DEEPCHECKS_LLM_API_KEY)

# We'll fetch the ground truth data from deepchecks
ground_truth_data = dc_client.get_data(
    app_name=APP_NAME,
    version_name="Ground Truth",
    env_type=EnvType.EVAL
)

# Run your pipeline with the inputs from the ground truth dataset
interactions = []
for _, row in ground_truth_data.iterrows():

    new_output = my_chatbot(row["input"]) # Replace with your function
    interactions.append(
        LogInteraction(
            input=row['input'],
            output=new_output,
            user_interaction_id=str(row['user_interaction_id']),
            full_prompt=row['full_prompt'],
            expected_output=row["output"], 
            custom_props={'Question Type': row['cp_Question Type']}
        )
    )

dc_client.log_batch_interactions(app_name=APP_NAME, version_name=VERSION_NAME, env_type=EnvType.EVAL, interactions=interactions)

Step 2: Define quality gates

Download results once evaluation completes, then assert quality thresholds:

evaluation_data = dc_client.get_data_if_calculations_completed(
  app_name=APP_NAME,
  version_name=VERSION_NAME,
  env_type=EnvType.EVAL,
user_interaction_ids=ground_truth_data['user_interaction_id'].tolist())

Example 1: Test overall pass rate

# Filter out NAs and unknowns
good_bad_data = evaluation_data[evaluation_data['estimated_annotation'].isin(['good', 'bad'])]

# Calculate the ratio of "good" samples out of good + bad only
good_ratio = (good_bad_data['estimated_annotation'] == 'good').mean()
threshold = 0.7

# Assert that the ratio meets the threshold
assert good_ratio >= threshold, f"FAILURE: Good ratio is too low ({good_ratio:.2f}). Required: {threshold}"
print(f"SUCCESS: Good ratio is sufficient ({good_ratio:.2f})")

Example 2: Test a critical data segment

For high-stakes segments, apply stricter criteria. This test requires zero hallucinations in a specific category:

# Filter for the relevant segment
segment_data = evaluation_data[evaluation_data["cp_Question Type"] == "treatments & medications"]
threshold = 0.7

# Find any samples in the segment that fail the GIC score threshold
failing_scores = segment_data[segment_data["Grounded in Context"] < threshold]

# Assert that there are no failing samples
assert failing_scores.empty, (
    f"FAILURE: {len(failing_scores)} out of {len(segment_data)} 'treatments & medications' "
    f"samples are below the Grounded in Context threshold of {threshold}."
)

print("SUCCESS: All 'treatments & medications' scores meet the threshold.")

Step 3: Wire into your CI/CD pipeline

The specific implementation depends on your tool. Add a step that runs a script combining Steps 1 and 2 - if an assertion fails, the script exits with an error, which fails the build. Here's a GitHub Actions example:

# In your .github/workflows/ci.yml file

jobs:
  evaluate-llm:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.10'
      - name: Install dependencies
        run: pip install -r requirements.txt
      - name: Run Pipeline Evaluation
        env:
          DEEPCHECKS_LLM_API_KEY: ${{ secrets.DEEPCHECKS_LLM_API_KEY }}
          APP_NAME: "your-app-name"
          VERSION_NAME: "test-${{ github.run_number }}-${{ github.sha }}"
        # This script contains the code from Steps 1 & 2
        run: python run_pipeline_evaluation.py

Compare against a baseline version

Instead of a fixed threshold, compare the new version against the previous stable version's performance:

# Assume BASELINE_VERSION_NAME is defined
# And the new version's data is the `evaluation_data` variable from Step 2

# First, get the data for the baseline version (v1)
baseline_data = dc_client.get_data_if_calculations_completed(
    app_name=APP_NAME,
    version_name=BASELINE_VERSION_NAME,
    env_type=EnvType.EVAL
)

# Calculate baseline's performance to use as the baseline threshold
baseline_threshold = (baseline_data['estimated_annotation'] == 'good').mean()

# Calculate the ratio of "good" samples out of good + bad only
new_version_good_ratio = (evaluation_data['estimated_annotation'] == 'good').mean()

# Assert that the new version's performance is at least as good as the baseline
assert new_version_good_ratio >= baseline_threshold, (
    f"FAILURE: New version's good ratio ({new_version_good_ratio:.2f}) is lower than "
    f"baseline version's ({baseline_threshold:.2f})."
)

print(f"SUCCESS: New version's performance ({new_version_good_ratio:.2f}) meets or exceeds baseline ({baseline_threshold:.2f}).")

Limitations and best practices

Running a full evaluation pipeline on every commit is expensive and slow. The best practices below help you get the most signal with the least cost.

Best Practices

  1. Trigger evaluations based on specific file changes: Configure your CI/CD pipeline to run LLM evaluations only when files that directly impact model behavior are modified. This ensures efficient resource usage while maintaining quality control. Set up path-based triggers for critical components:
    • Prompt templates and configurations: Run tests when files containing prompt templates, system messages, few-shot examples, or preprocessing logic change
    • Model configurations: Trigger evaluations when model selection, parameters (temperature, max tokens, etc.), or inference settings are updated
    • External integrations: Test when RAG retrieval logic, tool definitions, API integrations, or knowledge base connections are modified
    • Evaluation datasets: Run comprehensive tests when your evaluation data is updated or expanded

Example GitHub Actions path filtering:

on:
  push:
    paths:
      - 'prompts/**'
      - 'config/model_config.yaml' 
      - 'src/retrieval/**'
      - 'tools/**'
      - 'data/eval_inputs.csv'
  1. Trigger evaluations on a schedule: Run the full evaluation suite on a weekly or monthly basis. This ensures that even if regressions aren't caught immediately, they are detected in a timely manner.
  2. Maintain quality throughout the deployment lifecycle:
    • Add a quality gate before production: Include a final evaluation step in your deployment pipeline to catch issues before they reach users.
    • Keep your evaluation data current: Regularly review and refresh your evaluation datasets so they reflect real-world usage patterns and edge cases.