DocumentationAPI ReferenceRelease Notes
DocumentationLog In
Documentation

Self-Hosted Deployments

Deepchecks Self-Hosted Enterprise runs entirely in your own infrastructure, giving you full control over networking, security, and scaling. The platform is designed to run on Kubernetes to ensure reliable performance and horizontal scalability in production environments. This documentation walks through the infrastructure components and configuration required to deploy Deepchecks successfully.

Prerequisites

Infrastructure Prerequisites

The following diagram illustrates a typical Deepchecks deployment running on AWS:

Prior to deploying Self-Hosted Enterprise, Deepchecks recommends having each of the following infrastructure components ready to go. Select your deployment environment below for provider-specific guidance.

ComponentRecommendationNotes
Kubernetes ClusterAmazon EKS cluster deployed in at least 2 availability zonesWe recommend a cluster configured with Karpenter for automatic node scaling. Kubernetes version 1.33 or higher is required.
GPU Nodesg4dn.xlarge (1x NVIDIA T4, 4 vCPU, 16 GiB RAM)Each GPU worker requires a dedicated node
Ingress ControllerAWS Load Balancer ControllerA dedicated subdomain is required for access to the application
Object StorageAWS S3 Bucket
DatabaseAWS RDS PostgreSQL 16We suggest starting with an instance such as db.r6g.large and scaling up if necessary. TLS connections are not currently supported.
CacheAWS ElastiCache (Redis)cache.t4g.micro is recommended for all deployment sizes. TLS connections are not currently supported.
Processing QueueAWS SQSSee the Queue Configuration section for further details
Secrets ManagerAWS Secrets Manager(Optional) Recommended for use with the External Secrets Operator for securely providing sensitive data to your deployment
Identity ProviderCurrently Deepchecks supports Auth0 and Entra ID

Minimum Resource Requirements

The following table summarizes the default CPU and memory requests for each Deepchecks component. These represent the minimum resources your cluster must be able to schedule.

📘

Node Storage

We recommend that each node has a minimum of 150 GB of available storage.

ComponentReplicasCPU RequestMemory Request
Web Server12000m8Gi
Workers (each)
advanced-llm-prop-calculator11000m6Gi
batcher-runner11000m6Gi
dc-calibrator11000m6Gi
dc-proba-calculator11000m6Gi
estimate-annotation-calculator11000m6Gi
general-calc-runner11000m6Gi
garak-props-runner11000m6Gi
gpu-runner11000m6Gi
insights-runner11000m6Gi
llm-properties-calculator11000m6Gi
notifier11000m6Gi
pre-calc-eng11000m6Gi
translator11000m6Gi
Models Sync Job1 (one-time)2000m8Gi
Total14 + job17 vCPU94 Gi
📘

GPU Worker

The gpu-runner worker must run on a dedicated GPU-powered node and is not included in the CPU/memory totals above as it runs on separate hardware. See the Infrastructure Prerequisites section for recommended GPU instance types.

All resource values can be customized in your values.yaml. See the Example Configurations section for details.

Queue Configuration

Overview

Deepchecks uses message queues to coordinate background processing tasks. All queues are created with a configurable prefix that must match the TENANT_NAME environment variable provided during Helm chart deployment. See Environment Variables for further details.

All queues are AWS SQS queues. FIFO queue names include the .fifo suffix as required by AWS.

Table of required SQS queues and their configuration
Queue NameVisibility TimeoutMax Receive CountFIFO
insights-calculator.fifo360 seconds3Yes
insights-calculator-dlq.fifo360 seconds-Yes
garak-props-calculator360 seconds3No
garak-props-calculator-dlq360 seconds-No
props-calc-batcher360 seconds3No
props-calc-batcher-dlq360 seconds-No
translation660 seconds3No
translation-dlq660 seconds-No
pre-calc-eng660 seconds3No
pre-calc-eng-dlq660 seconds-No
advanced-llm-prop-calculator360 seconds3No
advanced-llm-prop-calculator-dlq360 seconds-No
calibrator.fifo360 seconds3Yes
calibrator-dlq.fifo360 seconds-Yes
notifier360 seconds3No
notifier-dlq360 seconds-No
proba-calculator360 seconds3No
proba-calculator-dlq360 seconds-No
llm-properties660 seconds3No
llm-properties-dlq660 seconds-No
topics-inference660 seconds3No
topics-inference-dlq660 seconds-No
topics-train.fifo660 seconds3Yes
topics-train-dlq.fifo660 seconds-Yes
similarity-annotations360 seconds3No
similarity-annotations-dlq360 seconds-No
properties-calculator360 seconds3No
properties-calculator-dlq360 seconds-No
estimate-annotation-calculator360 seconds3No
estimate-annotation-calculator-dlq360 seconds-No
general-calc660 seconds3No
general-calc-dlq660 seconds-No
Example Terraform to create the required queues

The following Terraform example can be used to bootstrap your queues:

variable "queue_prefix" {
  description = "Prefix for queue names (must match TENANT_NAME)"
  type        = string
  default     = "deepchecks"
}

locals {
  queues = {
    insights-calculator = {
      visibility_timeout = 360
      fifo               = true
      max_receive_count  = 3
    }
    garak-props-calculator = {
      visibility_timeout = 360
      fifo               = false
      max_receive_count  = 3
    }
    props-calc-batcher = {
      visibility_timeout = 360
      fifo               = false
      max_receive_count  = 3
    }
    translation = {
      visibility_timeout = 660
      fifo               = false
      max_receive_count  = 3
    }
    pre-calc-eng = {
      visibility_timeout = 660
      fifo               = false
      max_receive_count  = 3
    }
    advanced-llm-prop-calculator = {
      visibility_timeout = 360
      fifo               = false
      max_receive_count  = 3
    }
    calibrator = {
      visibility_timeout = 360
      fifo               = true
      max_receive_count  = 3
    }
    notifier = {
      visibility_timeout = 360
      fifo               = false
      max_receive_count  = 3
    }
    proba-calculator = {
      visibility_timeout = 360
      fifo               = false
      max_receive_count  = 3
    }
    llm-properties = {
      visibility_timeout = 660
      fifo               = false
      max_receive_count  = 3
    }
    topics-inference = {
      visibility_timeout = 660
      fifo               = false
      max_receive_count  = 3
    }
    topics-train = {
      visibility_timeout = 660
      fifo               = true
      max_receive_count  = 3
    }
    similarity-annotations = {
      visibility_timeout = 360
      fifo               = false
      max_receive_count  = 3
    }
    properties-calculator = {
      visibility_timeout = 360
      fifo               = false
      max_receive_count  = 3
    }
    estimate-annotation-calculator = {
      visibility_timeout = 360
      fifo               = false
      max_receive_count  = 3
    }
    general-calc = {
      visibility_timeout = 660
      fifo               = false
      max_receive_count  = 3
    }
  }
}

# Dead Letter Queues
resource "aws_sqs_queue" "dlq" {
  for_each = local.queues

  name                        = "${var.queue_prefix}-${each.key}-dlq${each.value.fifo ? ".fifo" : ""}"
  fifo_queue                  = each.value.fifo
  visibility_timeout_seconds  = each.value.visibility_timeout
  content_based_deduplication = each.value.fifo ? true : false
}

# Main Queues
resource "aws_sqs_queue" "queue" {
  for_each = local.queues

  name                        = "${var.queue_prefix}-${each.key}${each.value.fifo ? ".fifo" : ""}"
  fifo_queue                  = each.value.fifo
  visibility_timeout_seconds  = each.value.visibility_timeout
  content_based_deduplication = each.value.fifo ? true : false

  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.dlq[each.key].arn
    maxReceiveCount     = each.value.max_receive_count
  })
}

Object Storage

Deepchecks requires access to object storage for storing ML models and application data.

Create an S3 bucket for use by the Deepchecks application.

📘

Bucket Name

Note the name of the bucket you create as you will need to set it as the GENERAL.MODELS_BUCKET_NAME environment variable in the Helm chart.

Kubernetes Configuration

Supported Environments:

  • Amazon Elastic Kubernetes Service (EKS) on EC2
  • Kubernetes version 1.33 or higher
  • Clusters configured with Karpenter are recommended for automatic node scaling
  • Amazon EKS on Fargate is not supported

Tooling

We require you to install and configure the following Kubernetes tooling:

  1. Install helm by following these instructions
  2. Install kubectl by following these instructions
  3. Configure kubectl to connect to your cluster:

See here for how to configure your kubecontext for AWS EKS.

Create a Kubernetes namespace for your Deepchecks deployment (this can also be done as part of the Helm deployment):

kubectl create namespace deepchecks

Configuring Kubernetes Secrets

Sensitive credentials must be available in a Kubernetes Secret during deployment. The secret must be referenced in your values.yaml file at global.secretName. Ensure all required secrets are configured before deploying. See the Environment Variables section for the full list.

📘

External Secrets Operator

While creating a Kubernetes Secret manually will work, we recommend using the External Secrets Operator with your secrets manager of choice (e.g., AWS Secrets Manager, Azure Key Vault, Google Secret Manager) to securely create this secret on your behalf.

Example: creating the secret manually
kubectl create secret generic deepchecks-secrets \
  --namespace deepchecks \
  --from-literal=DATABASE.URI='postgresql://user:pass@host:5432/dbname' \
  --from-literal=OAUTH.CLIENT_SECRET='your-oauth-secret' \
  --from-literal=WEBAPP.AUTH_JWT_SECRET='your-jwt-secret' \
  --from-literal=LICENSE_KEY='your-license-key'

For on-premises deployments, also include the storage credentials:

--from-literal=STORAGE_ACCESS_KEY_ID='your-storage-access-key' \
--from-literal=STORAGE_SECRET_ACCESS_KEY='your-storage-secret-key'

GPU Nodes

In order for Kubernetes to utilize GPU nodes, the NVIDIA GPU Operator must be installed on your cluster. The installation process is the same across all supported Kubernetes distributions (EKS, AKS, GKE, and on-premises).

Example values.yaml for the NVIDIA GPU Operator

This example works as of GPU Operator version 0.18.0:

nodeSelector:
  role: gpu
affinity: null

tolerations:
  - key: CriticalAddonsOnly
    operator: Exists
  - key: nvidia.com/gpu
    operator: Exists
📘

GPU Node Configuration

In this example GPU nodes have the label role: gpu. Be sure to configure the operator as suits your deployment needs. The gpu-runner worker must run on a GPU-powered node. You can configure this via the Helm chart.

Application Permissions

The Deepchecks application requires access to cloud provider APIs for object storage and message queue operations.

The recommended approach is to use IRSA or Pod Identities to provide AWS permissions to the application.

Required IAM Policy
📘

Variable Replacement

Replace BUCKET_NAME, REGION, ACCOUNT_ID and TENANT_NAME with the relevant values.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "s3:Get*",
        "s3:List*",
        "s3:Put*",
        "s3:Delete*"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::<BUCKET_NAME>",
        "arn:aws:s3:::<BUCKET_NAME>/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "sqs:SendMessage",
        "sqs:ReceiveMessage",
        "sqs:DeleteMessage",
        "sqs:GetQueueAttributes",
        "sqs:GetQueueUrl"
      ],
      "Resource": "arn:aws:sqs:<REGION>:<ACCOUNT_ID>:<TENANT_NAME>*"
    },
    {
      "Action": [
        "bedrock:InvokeModel"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}

Deploying the Deepchecks Helm Chart

Registry Authentication

In order to access the Helm chart and container images you will need your credentials from the Deepchecks team.

helm registry login registry.cdn.deepchecks.com
📘

Image Pull Secret

In order to pull container images from the Deepchecks registry you will need to configure an image pull secret. See here for more information.

Installation

helm install deepchecks oci://registry.cdn.deepchecks.com/deepchecks/llm-stack \
  --values values.yaml \
  --namespace deepchecks \
  --create-namespace

For on-premises deployments, also install the ElasticMQ Helm chart for the message queue:

helm install elasticmq oci://registry.cdn.deepchecks.com/deepchecks/elasticmq \
  --values elasticmq-values.yaml \
  --namespace deepchecks

Deployment Timeout

As part of the deployment, a Kubernetes job will run to populate the object storage bucket with required model data. This job requires internet access to pull model artifacts. Typically this takes approximately 2 minutes, but depends on network performance. If your Helm commands are timing out, add --timeout 20m0s.

⚠️

Sync Job

As part of the Helm deployment each Deployment has an initContainer that checks that a 'sync' Job has completed successfully. If the Job is manually deleted, new pods will fail to start.

📘

DNS Configuration

Don't forget to configure your DNS record in order to be able to access the Deepchecks UI and SDK.

KEDA Autoscaling (Optional)

KEDA can be used to automatically scale workers based on queue depth. This is optional but recommended for production deployments.

Install KEDA on your cluster, then configure auto-wired scaling in your values.yaml:

kedaAutoscaling:
  enabled: true
  provider: "aws"
  aws:
    region: "<AWS_REGION>"
    queuePrefix: "<TENANT_NAME>"
    accountId: "<AWS_ACCOUNT_ID>"
    authenticationRef:
      name: "aws-credentials"
      kind: ClusterTriggerAuthentication

kedaTriggerAuthentication:
  enabled: true
  items:
    - name: aws-credentials
      scope: cluster
      podIdentity:
        provider: aws

Environment Variables

Non-sensitive values can be passed directly via the Helm chart's global.env configuration. For sensitive values, create a Kubernetes Secret and reference it via global.secretName.

Environment Variables
VariableDescriptionRequiredDefault/Valid ValuesFormat/NotesSensitive
Database
DATABASE.URIPostgreSQL database connection stringYesFormat: postgresql://<username>:<password>@<host>:<port>/<database>. Ensure the database user has appropriate permissions.Yes
General
GENERAL.MODELS_BUCKET_NAMEName of the storage bucket/container where ML models are storedYesThis is the name of the bucket/container that you created in the Object Storage section.
Logger
LOGGER.FORMATTEROutput format for application logsNoDefault: JSON, Valid: JSON, TEXT
LOGGER.LEVEL_DEEPCHECKS_LLMLogging level for Deepchecks LLM componentsNoDefault: INFO, Valid: DEBUG, INFO, WARNING, ERROR
OAuth
OAUTH.CLIENT_IDOAuth client identifier provided by your identity providerYes
OAUTH.CLIENT_SECRETOAuth client secret provided by your identity providerYesStore securely and never commit to version controlYes
OAUTH.PROVIDEROAuth provider typeYesDefault: auth0, Valid: auth0, entra_idCurrently Deepchecks supports Auth0 and Entra ID
OAUTH.SERVER_URLBase URL of your OAuth authorization serverYes
OAUTH.TENANT_URLTenant-specific URL for your OAuth providerYes
Redis
REDIS.HOSTHostname or IP address of the Redis/Valkey serverYes
REDIS.PASSWORDAuthentication password for Redis/Valkey serverNo (only if authentication is enabled)Store securely if authentication is enabledYes
REDIS.PORTPort number for Redis/Valkey connectionYesDefault: 6379
Tenant
TENANT_NAMEName of the default tenant. Used as prefix for queue names and other tenant-specific resources.YesFormat: Lowercase alphanumeric characters and hyphens only. This tenant is automatically created when the application first starts.
LICENSE_KEYLicense key provided by Deepchecks for your deploymentYesAlso used as the password for accessing the container registry.Yes
CREATE_ORG_FOR_USEREmail address of the first user in the system (will have 'Owner' role)YesThis user must be the first to log in. They can subsequently transfer ownership.
Web Application
WEBAPP.AUTH_JWT_SECRETSecret key used to sign JWT tokens for API authenticationYesGenerate with openssl rand -base64 32. Minimum 32 characters. Store securely.Yes
WEBAPP.DEPLOYMENT_URLFQDN where the application is deployedYesFormat: Complete URL including protocol (e.g., https://deepchecks.example.com).
On-Premises Only
STORAGE_ACCESS_KEY_IDAccess key ID for S3-compatible storageYes (on-prem only)Required when using S3-compatible storage (MinIO, Ceph, etc.)Yes
STORAGE_SECRET_ACCESS_KEYSecret access key for S3-compatible storageYes (on-prem only)Required when using S3-compatible storageYes
STORAGE_ENDPOINTEndpoint URL of S3-compatible storageYes (on-prem only)e.g., https://minio.example.com:9000
QUEUE_ENDPOINTEndpoint URL of ElasticMQ serviceYes (on-prem only)e.g., http://elasticmq:9324

Example Configurations

values.yaml (default chart values)
global:
  image:
    repository: registry.cdn.deepchecks.com/deepchecks
    name: llm
    pullPolicy: IfNotPresent
    tag: ""
  imagePullSecrets: []
  env: {}
  secretName: ""
  revisionHistoryLimit: 3

serviceAccount:
  create: true
  automount: true
  annotations: {}

web:
  replicaCount: 1
  podAnnotations: {}
  podLabels: {}
  podSecurityContext: {}
  securityContext: {}
  resources:
    requests:
      cpu: 2000m
      memory: 8Gi
  autoscaling:
    enabled: false
    minReplicas: 1
    maxReplicas: 10
    targetCPUUtilizationPercentage: 80
  volumes: []
  volumeMounts: []
  nodeSelector: {}
  tolerations: []
  affinity: {}
  service:
    type: ClusterIP
    port: 8000
  ingress:
    enabled: true
    className: ""
    annotations: {}
    host: ""
    tls: []

workerDefaults: &workerDefaults
  replicaCount: 1
  podAnnotations: {}
  podLabels: {}
  podSecurityContext: {}
  securityContext: {}
  resources:
    limits:
      cpu: 1000m
      memory: 6Gi
    requests:
      cpu: 1000m
      memory: 6Gi
  autoscaling:
    enabled: false
    minReplicas: 1
    maxReplicas: 10
    targetCPUUtilizationPercentage: 80
  nodeSelector: {}
  tolerations: []
  affinity: {}

workers:
  advanced-llm-prop-calculator:
    <<: *workerDefaults
    init: advanced_llm_props_runner
  batcher-runner:
    <<: *workerDefaults
    init: batcher_runner
  dc-calibrator:
    <<: *workerDefaults
    init: dc_calibrator
  dc-proba-calculator:
    <<: *workerDefaults
    init: dc_proba_calculator
  estimate-annotation-calculator:
    <<: *workerDefaults
    init: estimate_annotation_runner
  general-calc-runner:
    <<: *workerDefaults
    init: general_calc_runner
  garak-props-runner:
    <<: *workerDefaults
    init: garak_props_runner
  gpu-runner:
    <<: *workerDefaults
    init: gpu_runner
  insights-runner:
    <<: *workerDefaults
    init: insights_runner
  llm-properties-calculator:
    <<: *workerDefaults
    init: llm_props
  notifier:
    <<: *workerDefaults
    init: notifier
  pre-calc-eng:
    <<: *workerDefaults
    init: pre_calc_eng
  translator:
    <<: *workerDefaults
    init: translator
Example client values.yaml
serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/DeepchecksRole

global:
  image:
    tag: "0.39.2"
  imagePullSecrets:
    - name: deepchecks-registry
  secretName: "deepchecks-secrets"
  env:
    CREATE_ORG_FOR_USER: "[email protected]"
    TENANT_NAME: "example"
    GENERAL.MODELS_BUCKET_NAME: "deepchecks-models"
    LOGGER.LEVEL_DEEPCHECKS_LLM: "INFO"
    REDIS.HOST: "host.redis.example.com"
    REDIS.PORT: "6379"
    LOGGER.FORMATTER: "TEXT"
    WEBAPP.DEPLOYMENT_URL: "https://deepchecks.example.com"
    OAUTH.PROVIDER: "auth0"
    OAUTH.SERVER_URL: "https://example.auth0.com"
    OAUTH.TENANT_URL: "https://example.auth0.com"
    OAUTH.CLIENT_ID: "your-client-id"

web:
  ingress:
    enabled: true
    className: "alb"
    annotations:
      alb.ingress.kubernetes.io/scheme: internet-facing
    host: deepchecks.example.com
    tls:
      - hosts:
          - deepchecks.example.com
        secretName: deepchecks-tls

workers:
  gpu-runner:
    nodeSelector:
      role: gpu
    tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists