DocumentationAPI ReferenceRelease Notes
DocumentationLog In
Documentation

Self-Hosted Deployments

Deepchecks Self-Hosted Enterprise runs entirely in your own infrastructure, giving you full control over networking, security, and scaling. The platform is designed to run on Kubernetes to ensure reliable performance and horizontal scalability in production environments. This documentation walks through the infrastructure components and configuration required to deploy Deepchecks successfully.

Prerequisites

Infrastructure Prerequisites

The following diagram illustrates a typical Deepchecks deployment running on AWS:

Prior to deploying Self-Hosted Enterprise, Deepchecks recommends having each of the following infrastructure components ready to go. When possible, it's easiest to have all components running in the same VPC. The provided recommendations are for customers deploying to AWS:

ComponentRecommendationNotes
Kubernetes ClusterAmazon EKS cluster deployed in at least 2 availability zonesWe recommend a cluster configured with Karpenter in order to provide automatic node scaling
GPU powered nodesg4dn.xlargeEach GPU worker requires a dedicated Node
Ingress ControllerAWS Load Balancer ControllerA dedicated subdomain is required for access to the application
Object StorageAWS S3 Bucket
Dedicated DatabaseAWS RDS PostgresWe suggest starting with an instance such as db.r6g.large and scaling up if necessary. Note that currently we do not support Postgres connections with TLS.
Redis CacheAWS ElastiCachecache.t4g.micro is the recommended instance size for all deployment sizes. Note that currently we do not support Redis connections with TLS.
Processing QueueAWS SQSSee the SQS Queue Configuration section for further details
External Secrets ManagerAWS Secrets Manager(Optional) Recommended for use with the external secrets operator for securely providing sensitive data to your Deepchecks Self-Hosted Enterprise solution
Identity ProviderCurrently Deepchecks supports Auth0 and Entra ID

SQS Queue Configuration

Overview

This section describes the AWS SQS queue configuration for the system. All queues are created with the configurable prefix (default: deepchecks) and include dead letter queue (DLQ) configurations for failed message handling.

📘

Queue Prefix

The prefix used must be the same as the environment variable TENANT_NAME that is provided when deploying the Helm chart. See Environment Variables for further details.

Queue Configuration Summary

Note: All FIFO queue names include the .fifo suffix as required by AWS.

Table of required SQS queues and their relevant configuration
Queue NameVisibility TimeoutMax Receive Count
insights-calculator.fifo360 seconds3
insights-calculator-dlq.fifo360 seconds-
garak-props-calculator360 seconds3
garak-props-calculator-dlq360 seconds-
props-calc-batcher360 seconds3
props-calc-batcher-dlq360 seconds-
translation660 seconds3
translation-dlq660 seconds-
pre-calc-eng660 seconds3
pre-calc-eng-dlq660 seconds-
advanced-llm-prop-calculator360 seconds3
advanced-llm-prop-calculator-dlq360 seconds-
calibrator.fifo360 seconds3
calibrator-dlq.fifo360 seconds-
notifier360 seconds3
notifier-dlq360 seconds-
proba-calculator360 seconds3
proba-calculator-dlq360 seconds-
llm-properties660 seconds3
llm-properties-dlq660 seconds-
topics-inference660 seconds3
topics-inference-dlq660 seconds-
topics-train.fifo660 seconds3
topics-train-dlq.fifo660 seconds-
similarity-annotations360 seconds3
similarity-annotations-dlq360 seconds-
properties-calculator360 seconds3
properties-calculator-dlq360 seconds-
estimate-annotation-calculator360 seconds3
estimate-annotation-calculator-dlq360 seconds-
Example Terraform to create the relevant queues

The following Terraform example can be used to help you to quickly bootstrap your queues:

variable "queue_prefix" {
  description = "Prefix for queue names"
  type        = string
  default     = "deepchecks"
}

locals {
  queues = {
    insights-calculator = {
      visibility_timeout = 360
      fifo               = true
      max_receive_count  = 3
    }
    garak-props-calculator = {
      visibility_timeout = 360
      fifo               = false
      max_receive_count  = 3
    }
    props-calc-batcher = {
      visibility_timeout = 360
      fifo               = false
      max_receive_count  = 3
    }
    translation = {
      visibility_timeout = 660
      fifo               = false
      max_receive_count  = 3
    }
    pre-calc-eng = {
      visibility_timeout = 660
      fifo               = false
      max_receive_count  = 3
    }
    advanced-llm-prop-calculator = {
      visibility_timeout = 360
      fifo               = false
      max_receive_count  = 3
    }
    calibrator = {
      visibility_timeout = 360
      fifo               = true
      max_receive_count  = 3
    }
    notifier = {
      visibility_timeout = 360
      fifo               = false
      max_receive_count  = 3
    }
    proba-calculator = {
      visibility_timeout = 360
      fifo               = false
      max_receive_count  = 3
    }
    llm-properties = {
      visibility_timeout = 660
      fifo               = false
      max_receive_count  = 3
    }
    topics-inference = {
      visibility_timeout = 660
      fifo               = false
      max_receive_count  = 3
    }
    topics-train = {
      visibility_timeout = 660
      fifo               = true
      max_receive_count  = 3
    }
    similarity-annotations = {
      visibility_timeout = 360
      fifo               = false
      max_receive_count  = 3
    }
    properties-calculator = {
      visibility_timeout = 360
      fifo               = false
      max_receive_count  = 3
    }
    estimate-annotation-calculator = {
      visibility_timeout = 360
      fifo               = false
      max_receive_count  = 3
    }
  }
}

# Dead Letter Queues
resource "aws_sqs_queue" "dlq" {
  for_each = local.queues

  name                        = "${var.queue_prefix}-${each.key}-dlq${each.value.fifo ? ".fifo" : ""}"
  fifo_queue                  = each.value.fifo
  visibility_timeout_seconds  = each.value.visibility_timeout
  content_based_deduplication = each.value.fifo ? true : false
}

# Main Queues
resource "aws_sqs_queue" "queue" {
  for_each = local.queues

  name                        = "${var.queue_prefix}-${each.key}${each.value.fifo ? ".fifo" : ""}"
  fifo_queue                  = each.value.fifo
  visibility_timeout_seconds  = each.value.visibility_timeout
  content_based_deduplication = each.value.fifo ? true : false

  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.dlq[each.key].arn
    maxReceiveCount     = each.value.max_receive_count
  })
}

S3 Bucket

In order for the Deepchecks application to function access to a S3 bucket is required.

📘

Bucket Name

Note the name of the bucket you create as you will need to create the relevant permissions for applications and set it as an environment variable later on in the Helm chart.

Kubernetes Configuration

A few notes on Kubernetes cluster provisioning for Deepchecks Self-Hosted Enterprise:

  • Deepchecks currently supports Amazon Elastic Kubernetes Service (EKS) on EC2.
  • Deepchecks recommends running Deepchecks on a cluster that supports autoscaling such as with Karpenter.
  • Deepchecks doesn't support Amazon EKS on Fargate.

We require you to install and configure the following Kubernetes tooling:

  1. Install helm by following these instructions

  2. Install kubectl by following these instructions.

  3. Configure kubectl to connect to your cluster by using kubectl use-context my-cluster-name

    See here for how to configure your kubecontext for AWS.

We also require you to create a Kubernetes namespace for your Deepchecks deployment:

(This can also be done as part of the Helm deployment described further on)

kubectl create namespace deepchecks

Configuring Kubernetes Secrets

📘

External Secrets Operator

While provisioning a Kubernetes Secret manually will work, we recommend using external secrets operator to securely create this secret on your behalf.

Sensitive credentials are required to be made available in a Kubernetes Secrets during deployment. The Kubernetes secret must be set in your values.yaml file at global.secretName. Ensure all required secrets are configured before deploying Deepchecks Self-Hosted Enterprise. You can find the list of environment variable and secrets in the Environment Variables section.

GPU powered Nodes

In order for Kubernetes to utilize the GPU nodes the Nvidia GPU operator must be installed on your cluster.

Example values.yaml file for the GPU driver

This example works as of driver version 0.18.0

nodeSelector:
  role: gpu
affinity: null

tolerations:
  - key: CriticalAddonsOnly
    operator: Exists
  - key: nvidia.com/gpu
    operator: Exists
📘

GPU Node Configuration

In this example our GPU nodes have the following label: role: gpu.

Be sure to configure the driver as suites your deployment needs.

The gpu-runner worker must run on a GPU powered Node. You can configure this via the Helm chart.

Application Permissions

In order for the application to access the relevant AWS APIs needed for the application to function you need to provide the following permissions to the application. The 2 recommended ways of doing so are either using IRSA or Pod Identities.

Required IAM Policy
📘

Variable Replacement

Be sure to replace BUCKET_NAME, REGION, ACCOUNT and TENANT_NAME with the relevant values

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "s3:Get*",
        "s3:List*",
        "s3:Put*",
        "s3:Delete*"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::<BUCKET_NAME>",
        "arn:aws:s3:::<BUCKET_NAME>/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "sqs:SendMessage",
        "sqs:ReceiveMessage",
        "sqs:DeleteMessage",
        "sqs:GetQueueAttributes",
        "sqs:GetQueueUrl"
      ],
      "Resource": "arn:aws:sqs:REGION:ACCOUNT_ID:<TENANT_NAME>*"
    },
    {
      "Action": [
        "bedrock:InvokeModel"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}

Deploying the Deepchecks Helm Chart

In order to access the Helm chart and container images you will need your credentials to Deepchecks' registry

📘

Registry Authentication

In order to authenticate with the Deepchecks registry you will need the credentials provided by the Deepchecks team.

helm registry login registry.llm.deepchecks.com
helm install deepchecks oci://registry.llm.deepchecks.com/deepchecks/deepchecks-llm-stack --values values.yaml
📘

Image Pull Secret

In order to be able to pull the container images from Deepchecks' repository you will need to configure a secret in order to provide the credentials to the Helm chart. You can see here for more information.

In order to configure the Deepchecks Helm Chart there are multiple values that need to be provided to the Helm chart via environment variables. Non-sensitive values can be passed directly via the Helm chart. For sensitive values we recommend leveraging an external secrets manager.

Environment Variables
VariableDescriptionRequiredDefault/Valid ValuesFormat/NotesSensitive
Database
DATABASE.URIPostgreSQL database connection stringYesFormat: postgresql://<username>:<password>@<host>:<port>/<database>. Ensure the database user has appropriate permissionsYes
General
GENERAL.MODELS_BUCKET_NAMEName of the cloud storage bucket where ML models are storedYesThis is the name of the bucket that you created in the S3 Bucket section
Logger
LOGGER.FORMATTEROutput format for application logsNoDefault: JSON, Valid: JSON, TEXT
LOGGER.LEVEL_DEEPCHECKS_LLMLogging level for Deepchecks LLM componentsNoDefault: INFO, Valid: DEBUG, INFO, WARNING, ERROR
OAuth
OAUTH.CLIENT_IDOAuth client identifier provided by your identity provider
OAUTH.CLIENT_SECRETOAuth client secret provided by your identity providerStore securely and never commit to version controlYes
OAUTH.PROVIDEROAuth provider typeDefault: auth0, Valid: auth0, entra_idCurrently Deepchecks supports Auth0 and Entra ID
OAUTH.SERVER_URLBase URL of your OAuth authorization server
OAUTH.TENANT_URLTenant-specific URL for your OAuth provider
Redis
REDIS.HOSTHostname or IP address of the Redis serverYes
REDIS.PASSWORDAuthentication password for Redis serverNo (only if Redis requires authentication)Store securely if authentication is enabledYes
REDIS.PORTPort number for Redis server connectionYesDefault: 6379
Tenant
TENANT_NAMEName of the default tenant created during application initialization. Used as prefix for SQS queue names and other tenant-specific resourcesYesFormat: Lowercase alphanumeric characters and hyphens only. This tenant is automatically created when the application first starts
LICENSE_KEYThis is your license key provided by Deepchecks for your deploymentYesThis also acts as your password in order to access our container registryYes
CREATE_ORG_FOR_USERThe email address of the first user in the system. This user will have the 'Owner' roleYesThis user must be the first user that logs in to the system. They can subsequently transfer ownership of the application.
Web Application
WEBAPP.AUTH_JWT_SECRETSecret key used to sign JWT tokens for API authentication within the systemYesGenerate a strong, random secret (minimum 32 characters recommended). Use openssl rand -base64 32 to generate a secure secret. Store securely and never commit to version controlYes
WEBAPP.DEPLOYMENT_URLFully Qualified Domain Name (FQDN) where the application is deployedYesFormat: Complete URL including protocol. This URL is used for generating callback URLs and external links
values.yaml
# Default values for deepchecks-llm-stack.
# This is a YAML-formatted file.

global:
  image:
    repository: harbor.llmdev.deepchecks.com/deepchecks/llm
    # This sets the pull policy for images.
    pullPolicy: IfNotPresent
    # Image tag; defaults to Chart.appVersion if empty
    tag: ""
  # Pull secrets for private container registries
  imagePullSecrets: []
  # Environment variables passed to all pods via ConfigMap
  env: {}
  # Number of old ReplicaSets to retain for rollback
  revisionHistoryLimit: 3

serviceAccount:
  # Create a shared ServiceAccount for all components
  create: true
  # Mount API credentials into pods
  automount: true
  # Annotations for cloud provider integrations (e.g., IAM roles)
  annotations: {}

web:
  # Number of web server replicas
  replicaCount: 1
  # Override the chart name
  nameOverride: ""
  fullnameOverride: ""

  # Pod annotations for monitoring, logging, or mesh integration
  podAnnotations: {}
  # Additional labels for pods
  podLabels: {}

  # Pod-level security settings (fsGroup, runAsUser, etc.)
  podSecurityContext: {}
    # fsGroup: 2000

  # Container-level security settings
  securityContext: {}
    # capabilities:
    #   drop:
    #   - ALL
    # readOnlyRootFilesystem: true
    # runAsNonRoot: true
    # runAsUser: 1000

  # CPU and memory requests/limits
  resources:
    requests:
      cpu: 2000m
      memory: 8Gi

  autoscaling:
    # Enable Horizontal Pod Autoscaler
    enabled: false
    minReplicas: 1
    maxReplicas: 10
    # Target CPU percentage for scaling
    targetCPUUtilizationPercentage: 80
    # targetMemoryUtilizationPercentage: 80

  # Additional volumes for the deployment
  volumes: []
  # Additional volume mounts for containers
  volumeMounts: []

  # Node selector for pod scheduling
  nodeSelector: {}
  # Tolerations for node taints
  tolerations: []
  # Affinity rules for pod placement
  affinity: {}

  service:
    # Service type: ClusterIP, NodePort, or LoadBalancer
    type: ClusterIP
    # Port the service exposes
    port: 8000

  ingress:
    # Enable ingress resource creation
    enabled: true
    # Ingress controller class (e.g., nginx, kong, traefik)
    className: ""
    # Ingress annotations for TLS, auth, rate-limiting, etc.
    annotations: {}
    # Hostname for the application (required if ingress enabled)
    host: ""
    # TLS configuration
    tls: []

# Default configuration inherited by all workers
workerDefaults: &workerDefaults
  replicaCount: 1
  nameOverride: ""
  fullnameOverride: ""
  podAnnotations: {}
  podLabels: {}
  podSecurityContext: {}
  securityContext: {}
  resources:
    limits:
      cpu: 1000m
      memory: 6Gi
    requests:
      cpu: 1000m
      memory: 6Gi
  autoscaling:
    enabled: false
    minReplicas: 1
    maxReplicas: 10
    targetCPUUtilizationPercentage: 80
  nodeSelector: {}
  tolerations: []
  affinity: {}

# Background worker deployments
workers:
  # Computes advanced LLM properties
  advanced-llm-prop-calculator:
    <<: *workerDefaults
    init: advanced_llm_props_runner

  # Handles batch processing jobs
  batcher-runner:
    <<: *workerDefaults
    init: batcher_runner

  # Runs calibration logic
  dc-calibrator:
    <<: *workerDefaults
    init: dc_calibrator

  # Calculates probability scores
  dc-proba-calculator:
    <<: *workerDefaults
    init: dc_proba_calculator

  # Estimates annotations
  estimate-annotation-calculator:
    <<: *workerDefaults
    init: estimate_annotation_runner

  # Runs Garak security property checks
  garak-props-runner:
    <<: *workerDefaults
    init: garak_props_runner

  # GPU-accelerated processing (requires GPU nodes)
  gpu-runner:
    <<: *workerDefaults
    init: gpu_runner

  # Generates insights from analysis
  insights-runner:
    <<: *workerDefaults
    init: insights_runner

  # Computes LLM properties
  llm-properties-calculator:
    <<: *workerDefaults
    init: llm_props

  # Sends notifications
  notifier:
    <<: *workerDefaults
    init: notifier

  # Pre-calculation engine
  pre-calc-eng:
    <<: *workerDefaults
    init: pre_calc_eng

  # Handles data transformation
  translator:
    <<: *workerDefaults
    init: translator
client.yaml Example
📘

Example Configuration

This is a simple example configuration for a customer deployment

serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::12345567890:role/DeepchecksRole

global:
  image:
    tag: "0.39.2"

  # supplied here is the secret containing the username and password needed to pull the Deepchecks containers
  imagePullSecrets:
    - name: deepchecks-registry
  env:
    CREATE_ORG_FOR_USER: "[email protected]"
    TENANT_NAME: "example"
    GENERAL.MODELS_BUCKET_NAME: "deepchecks-models"
    LOGGER.LEVEL_DEEPCHECKS_LLM: "INFO"
    REDIS.HOST: "host.redis.com"
    REDIS.PORT: "10000"
    LOGGER.FORMATTER: "TEXT"
    GENERAL.IGNORE_EMAIL_VERIFICATION: "True"

# in this example kong is used as the ingress controller along with cert manager for certificate management
web:
  ingress:
    enabled: true
    className: "kong"
    annotations:
      cert-manager.io/cluster-issuer: issuer
    host: deepchecks.example.com
    tls:
      - hosts:
          - deepchecks.example.com
        secretName: deepchecks-tls

workers:
  # it is important to configure your node selectors and taints in order to ensure that your GPU workers run on your GPU powered nodes
  gpu-runner:
    nodeSelector:
      role: gpu
    tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists

Deployment Timeout

As part of the deployment a Kubernetes job will run to populate the models S3 bucket with the required data. Typically this takes approximately 2 minutes, but is dependent on the networks performance. If your Helm commands are timing out it is recommended to add --timeout 20m0s

⚠️

Sync Job

As part of the Helm deployment each Deployment has an initContainer that checks that a 'sync' Job has completed successfully. If the Job is manually deleted new pods will fail to start.

📘

DNS Configuration

Don't forget to configure your DNS record in order to be able to access the Deepchecks UI and SDK