Deepchecks Self-Hosted Enterprise runs entirely in your own infrastructure, giving you full control over networking, security, and scaling. The platform is designed to run on Kubernetes to ensure reliable performance and horizontal scalability in production environments. This documentation walks through the infrastructure components and configuration required to deploy Deepchecks successfully.

Prerequisites

Infrastructure Prerequisites

The following diagram illustrates a typical Deepchecks deployment running on AWS:

Prior to deploying Self-Hosted Enterprise, Deepchecks recommends having each of the following infrastructure components ready to go. Select your deployment environment below for provider-specific guidance.

Component	Recommendation	Notes
Kubernetes Cluster	Amazon EKS cluster deployed in at least 2 availability zones	We recommend a cluster configured with Karpenter for automatic node scaling. Kubernetes version 1.33 or higher is required.
GPU Nodes	g4dn.xlarge (1x NVIDIA T4, 4 vCPU, 16 GiB RAM)	Each GPU worker requires a dedicated node
Ingress Controller	AWS Load Balancer Controller	A dedicated subdomain is required for access to the application
Object Storage	AWS S3 Bucket
Database	AWS RDS PostgreSQL 16	We suggest starting with an instance such as db.r6g.large and scaling up if necessary. TLS connections are not currently supported.
Cache	AWS ElastiCache (Redis)	cache.t4g.micro is recommended for all deployment sizes. TLS connections are not currently supported.
Processing Queue	AWS SQS	See the Queue Configuration section for further details
Secrets Manager	AWS Secrets Manager	(Optional) Recommended for use with the External Secrets Operator for securely providing sensitive data to your deployment
Identity Provider		Currently Deepchecks supports Auth0 and Entra ID

Minimum Resource Requirements

The following table summarizes the default CPU and memory requests for each Deepchecks component. These represent the minimum resources your cluster must be able to schedule.

📘
Node Storage
We recommend that each node has a minimum of 150 GB of available storage.

Component	Replicas	CPU Request	Memory Request
Web Server	1	2000m	8Gi
Workers (each)
advanced-llm-prop-calculator	1	1000m	6Gi
batcher-runner	1	1000m	6Gi
dc-calibrator	1	1000m	6Gi
dc-proba-calculator	1	1000m	6Gi
estimate-annotation-calculator	1	1000m	6Gi
general-calc-runner	1	1000m	6Gi
garak-props-runner	1	1000m	6Gi
gpu-runner	1	1000m	6Gi
insights-runner	1	1000m	6Gi
llm-properties-calculator	1	1000m	6Gi
notifier	1	1000m	6Gi
pre-calc-eng	1	1000m	6Gi
translator	1	1000m	6Gi
Models Sync Job	1 (one-time)	2000m	8Gi
Total	14 + job	17 vCPU	94 Gi

📘
GPU Worker
The gpu-runner worker must run on a dedicated GPU-powered node and is not included in the CPU/memory totals above as it runs on separate hardware. See the Infrastructure Prerequisites section for recommended GPU instance types.

All resource values can be customized in your values.yaml. See the Example Configurations section for details.

Queue Configuration

Overview

Deepchecks uses message queues to coordinate background processing tasks. All queues are created with a configurable prefix that must match the TENANT_NAME environment variable provided during Helm chart deployment. See Environment Variables for further details.

All queues are AWS SQS queues. FIFO queue names include the .fifo suffix as required by AWS.

Table of required SQS queues and their configuration

Queue Name	Visibility Timeout	Max Receive Count	FIFO
insights-calculator.fifo	360 seconds	3	Yes
insights-calculator-dlq.fifo	360 seconds	-	Yes
garak-props-calculator	360 seconds	3	No
garak-props-calculator-dlq	360 seconds	-	No
props-calc-batcher	360 seconds	3	No
props-calc-batcher-dlq	360 seconds	-	No
translation	660 seconds	3	No
translation-dlq	660 seconds	-	No
pre-calc-eng	660 seconds	3	No
pre-calc-eng-dlq	660 seconds	-	No
advanced-llm-prop-calculator	360 seconds	3	No
advanced-llm-prop-calculator-dlq	360 seconds	-	No
calibrator.fifo	360 seconds	3	Yes
calibrator-dlq.fifo	360 seconds	-	Yes
notifier	360 seconds	3	No
notifier-dlq	360 seconds	-	No
proba-calculator	360 seconds	3	No
proba-calculator-dlq	360 seconds	-	No
llm-properties	660 seconds	3	No
llm-properties-dlq	660 seconds	-	No
topics-inference	660 seconds	3	No
topics-inference-dlq	660 seconds	-	No
topics-train.fifo	660 seconds	3	Yes
topics-train-dlq.fifo	660 seconds	-	Yes
similarity-annotations	360 seconds	3	No
similarity-annotations-dlq	360 seconds	-	No
properties-calculator	360 seconds	3	No
properties-calculator-dlq	360 seconds	-	No
estimate-annotation-calculator	360 seconds	3	No
estimate-annotation-calculator-dlq	360 seconds	-	No
general-calc	660 seconds	3	No
general-calc-dlq	660 seconds	-	No

Example Terraform to create the required queues

The following Terraform example can be used to bootstrap your queues:

variable "queue_prefix" {
  description = "Prefix for queue names (must match TENANT_NAME)"
  type        = string
  default     = "deepchecks"
}

locals {
  queues = {
    insights-calculator = {
      visibility_timeout = 360
      fifo               = true
      max_receive_count  = 3
    }
    garak-props-calculator = {
      visibility_timeout = 360
      fifo               = false
      max_receive_count  = 3
    }
    props-calc-batcher = {
      visibility_timeout = 360
      fifo               = false
      max_receive_count  = 3
    }
    translation = {
      visibility_timeout = 660
      fifo               = false
      max_receive_count  = 3
    }
    pre-calc-eng = {
      visibility_timeout = 660
      fifo               = false
      max_receive_count  = 3
    }
    advanced-llm-prop-calculator = {
      visibility_timeout = 360
      fifo               = false
      max_receive_count  = 3
    }
    calibrator = {
      visibility_timeout = 360
      fifo               = true
      max_receive_count  = 3
    }
    notifier = {
      visibility_timeout = 360
      fifo               = false
      max_receive_count  = 3
    }
    proba-calculator = {
      visibility_timeout = 360
      fifo               = false
      max_receive_count  = 3
    }
    llm-properties = {
      visibility_timeout = 660
      fifo               = false
      max_receive_count  = 3
    }
    topics-inference = {
      visibility_timeout = 660
      fifo               = false
      max_receive_count  = 3
    }
    topics-train = {
      visibility_timeout = 660
      fifo               = true
      max_receive_count  = 3
    }
    similarity-annotations = {
      visibility_timeout = 360
      fifo               = false
      max_receive_count  = 3
    }
    properties-calculator = {
      visibility_timeout = 360
      fifo               = false
      max_receive_count  = 3
    }
    estimate-annotation-calculator = {
      visibility_timeout = 360
      fifo               = false
      max_receive_count  = 3
    }
    general-calc = {
      visibility_timeout = 660
      fifo               = false
      max_receive_count  = 3
    }
  }
}

# Dead Letter Queues
resource "aws_sqs_queue" "dlq" {
  for_each = local.queues

  name                        = "${var.queue_prefix}-${each.key}-dlq${each.value.fifo ? ".fifo" : ""}"
  fifo_queue                  = each.value.fifo
  visibility_timeout_seconds  = each.value.visibility_timeout
  content_based_deduplication = each.value.fifo ? true : false
}

# Main Queues
resource "aws_sqs_queue" "queue" {
  for_each = local.queues

  name                        = "${var.queue_prefix}-${each.key}${each.value.fifo ? ".fifo" : ""}"
  fifo_queue                  = each.value.fifo
  visibility_timeout_seconds  = each.value.visibility_timeout
  content_based_deduplication = each.value.fifo ? true : false

  redrive_policy = jsonencode({
    deadLetterTargetArn = aws_sqs_queue.dlq[each.key].arn
    maxReceiveCount     = each.value.max_receive_count
  })
}

Object Storage

Deepchecks requires access to object storage for storing ML models and application data.

Create an S3 bucket for use by the Deepchecks application.

📘
Bucket Name
Note the name of the bucket you create as you will need to set it as the GENERAL.MODELS_BUCKET_NAME environment variable in the Helm chart.

Kubernetes Configuration

Supported Environments:

Amazon Elastic Kubernetes Service (EKS) on EC2
Kubernetes version 1.33 or higher
Clusters configured with Karpenter are recommended for automatic node scaling
Amazon EKS on Fargate is not supported

Tooling

We require you to install and configure the following Kubernetes tooling:

Install helm by following these instructions
Install kubectl by following these instructions
Configure kubectl to connect to your cluster:

See here for how to configure your kubecontext for AWS EKS.

Create a Kubernetes namespace for your Deepchecks deployment (this can also be done as part of the Helm deployment):

kubectl create namespace deepchecks

Configuring Kubernetes Secrets

Sensitive credentials must be available in a Kubernetes Secret during deployment. The secret must be referenced in your values.yaml file at global.secretName. Ensure all required secrets are configured before deploying. See the Environment Variables section for the full list.

📘
External Secrets Operator
While creating a Kubernetes Secret manually will work, we recommend using the External Secrets Operator with your secrets manager of choice (e.g., AWS Secrets Manager, Azure Key Vault, Google Secret Manager) to securely create this secret on your behalf.

Example: creating the secret manually

kubectl create secret generic deepchecks-secrets \
  --namespace deepchecks \
  --from-literal=DATABASE.URI='postgresql://user:pass@host:5432/dbname' \
  --from-literal=OAUTH.CLIENT_SECRET='your-oauth-secret' \
  --from-literal=WEBAPP.AUTH_JWT_SECRET='your-jwt-secret' \
  --from-literal=LICENSE_KEY='your-license-key'

For on-premises deployments, also include the storage credentials:

--from-literal=STORAGE_ACCESS_KEY_ID='your-storage-access-key' \
--from-literal=STORAGE_SECRET_ACCESS_KEY='your-storage-secret-key'

GPU Nodes

In order for Kubernetes to utilize GPU nodes, the NVIDIA GPU Operator must be installed on your cluster. The installation process is the same across all supported Kubernetes distributions (EKS, AKS, GKE, and on-premises).

Example values.yaml for the NVIDIA GPU Operator

This example works as of GPU Operator version 0.18.0:

nodeSelector:
  role: gpu
affinity: null

tolerations:
  - key: CriticalAddonsOnly
    operator: Exists
  - key: nvidia.com/gpu
    operator: Exists

📘
GPU Node Configuration
In this example GPU nodes have the label role: gpu. Be sure to configure the operator as suits your deployment needs. The gpu-runner worker must run on a GPU-powered node. You can configure this via the Helm chart.

Application Permissions

The Deepchecks application requires access to cloud provider APIs for object storage and message queue operations.

The recommended approach is to use IRSA or Pod Identities to provide AWS permissions to the application.

Required IAM Policy

📘
Variable Replacement
Replace BUCKET_NAME, REGION, ACCOUNT_ID and TENANT_NAME with the relevant values.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "s3:Get*",
        "s3:List*",
        "s3:Put*",
        "s3:Delete*"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::<BUCKET_NAME>",
        "arn:aws:s3:::<BUCKET_NAME>/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": [
        "sqs:SendMessage",
        "sqs:ReceiveMessage",
        "sqs:DeleteMessage",
        "sqs:GetQueueAttributes",
        "sqs:GetQueueUrl"
      ],
      "Resource": "arn:aws:sqs:<REGION>:<ACCOUNT_ID>:<TENANT_NAME>*"
    },
    {
      "Action": [
        "bedrock:InvokeModel"
      ],
      "Effect": "Allow",
      "Resource": "*"
    }
  ]
}

Deploying the Deepchecks Helm Chart

Registry Authentication

In order to access the Helm chart and container images you will need your credentials from the Deepchecks team.

helm registry login registry.cdn.deepchecks.com

📘
Image Pull Secret
In order to pull container images from the Deepchecks registry you will need to configure an image pull secret. See here for more information.

Installation

helm install deepchecks oci://registry.cdn.deepchecks.com/deepchecks/llm-stack \
  --values values.yaml \
  --namespace deepchecks \
  --create-namespace

For on-premises deployments, also install the ElasticMQ Helm chart for the message queue:

helm install elasticmq oci://registry.cdn.deepchecks.com/deepchecks/elasticmq \
  --values elasticmq-values.yaml \
  --namespace deepchecks

⏳
Deployment Timeout
As part of the deployment, a Kubernetes job will run to populate the object storage bucket with required model data. This job requires internet access to pull model artifacts. Typically this takes approximately 2 minutes, but depends on network performance. If your Helm commands are timing out, add --timeout 20m0s.

⚠️
Sync Job
As part of the Helm deployment each Deployment has an initContainer that checks that a 'sync' Job has completed successfully. If the Job is manually deleted, new pods will fail to start.

📘
DNS Configuration
Don't forget to configure your DNS record in order to be able to access the Deepchecks UI and SDK.

KEDA Autoscaling (Optional)

KEDA can be used to automatically scale workers based on queue depth. This is optional but recommended for production deployments.

Install KEDA on your cluster, then configure auto-wired scaling in your values.yaml:

kedaAutoscaling:
  enabled: true
  provider: "aws"
  aws:
    region: "<AWS_REGION>"
    queuePrefix: "<TENANT_NAME>"
    accountId: "<AWS_ACCOUNT_ID>"
    authenticationRef:
      name: "aws-credentials"
      kind: ClusterTriggerAuthentication

kedaTriggerAuthentication:
  enabled: true
  items:
    - name: aws-credentials
      scope: cluster
      podIdentity:
        provider: aws

Environment Variables

Non-sensitive values can be passed directly via the Helm chart's global.env configuration. For sensitive values, create a Kubernetes Secret and reference it via global.secretName.

Environment Variables

Variable	Description	Required	Default/Valid Values	Format/Notes	Sensitive
Database
`DATABASE.URI`	PostgreSQL database connection string	Yes		Format: `postgresql://<username>:<password>@<host>:<port>/<database>`. Ensure the database user has appropriate permissions.	Yes
General
`GENERAL.MODELS_BUCKET_NAME`	Name of the storage bucket/container where ML models are stored	Yes		This is the name of the bucket/container that you created in the Object Storage section.
Logger
`LOGGER.FORMATTER`	Output format for application logs	No	Default: `JSON`, Valid: `JSON`, `TEXT`
`LOGGER.LEVEL_DEEPCHECKS_LLM`	Logging level for Deepchecks LLM components	No	Default: `INFO`, Valid: `DEBUG`, `INFO`, `WARNING`, `ERROR`
OAuth
`OAUTH.CLIENT_ID`	OAuth client identifier provided by your identity provider	Yes
`OAUTH.CLIENT_SECRET`	OAuth client secret provided by your identity provider	Yes		Store securely and never commit to version control	Yes
`OAUTH.PROVIDER`	OAuth provider type	Yes	Default: `auth0`, Valid: `auth0`, `entra_id`	Currently Deepchecks supports Auth0 and Entra ID
`OAUTH.SERVER_URL`	Base URL of your OAuth authorization server	Yes
`OAUTH.TENANT_URL`	Tenant-specific URL for your OAuth provider	Yes
Redis
`REDIS.HOST`	Hostname or IP address of the Redis/Valkey server	Yes
`REDIS.PASSWORD`	Authentication password for Redis/Valkey server	No (only if authentication is enabled)		Store securely if authentication is enabled	Yes
`REDIS.PORT`	Port number for Redis/Valkey connection	Yes	Default: `6379`
Tenant
`TENANT_NAME`	Name of the default tenant. Used as prefix for queue names and other tenant-specific resources.	Yes		Format: Lowercase alphanumeric characters and hyphens only. This tenant is automatically created when the application first starts.
`LICENSE_KEY`	License key provided by Deepchecks for your deployment	Yes		Also used as the password for accessing the container registry.	Yes
`CREATE_ORG_FOR_USER`	Email address of the first user in the system (will have 'Owner' role)	Yes		This user must be the first to log in. They can subsequently transfer ownership.
Web Application
`WEBAPP.AUTH_JWT_SECRET`	Secret key used to sign JWT tokens for API authentication	Yes		Generate with `openssl rand -base64 32`. Minimum 32 characters. Store securely.	Yes
`WEBAPP.DEPLOYMENT_URL`	FQDN where the application is deployed	Yes		Format: Complete URL including protocol (e.g., `https://deepchecks.example.com`).
On-Premises Only
`STORAGE_ACCESS_KEY_ID`	Access key ID for S3-compatible storage	Yes (on-prem only)		Required when using S3-compatible storage (MinIO, Ceph, etc.)	Yes
`STORAGE_SECRET_ACCESS_KEY`	Secret access key for S3-compatible storage	Yes (on-prem only)		Required when using S3-compatible storage	Yes
`STORAGE_ENDPOINT`	Endpoint URL of S3-compatible storage	Yes (on-prem only)		e.g., `https://minio.example.com:9000`
`QUEUE_ENDPOINT`	Endpoint URL of ElasticMQ service	Yes (on-prem only)		e.g., `http://elasticmq:9324`

Example Configurations

values.yaml (default chart values)

global:
  image:
    repository: registry.cdn.deepchecks.com/deepchecks
    name: llm
    pullPolicy: IfNotPresent
    tag: ""
  imagePullSecrets: []
  env: {}
  secretName: ""
  revisionHistoryLimit: 3

serviceAccount:
  create: true
  automount: true
  annotations: {}

web:
  replicaCount: 1
  podAnnotations: {}
  podLabels: {}
  podSecurityContext: {}
  securityContext: {}
  resources:
    requests:
      cpu: 2000m
      memory: 8Gi
  autoscaling:
    enabled: false
    minReplicas: 1
    maxReplicas: 10
    targetCPUUtilizationPercentage: 80
  volumes: []
  volumeMounts: []
  nodeSelector: {}
  tolerations: []
  affinity: {}
  service:
    type: ClusterIP
    port: 8000
  ingress:
    enabled: true
    className: ""
    annotations: {}
    host: ""
    tls: []

workerDefaults: &workerDefaults
  replicaCount: 1
  podAnnotations: {}
  podLabels: {}
  podSecurityContext: {}
  securityContext: {}
  resources:
    limits:
      cpu: 1000m
      memory: 6Gi
    requests:
      cpu: 1000m
      memory: 6Gi
  autoscaling:
    enabled: false
    minReplicas: 1
    maxReplicas: 10
    targetCPUUtilizationPercentage: 80
  nodeSelector: {}
  tolerations: []
  affinity: {}

workers:
  advanced-llm-prop-calculator:
    <<: *workerDefaults
    init: advanced_llm_props_runner
  batcher-runner:
    <<: *workerDefaults
    init: batcher_runner
  dc-calibrator:
    <<: *workerDefaults
    init: dc_calibrator
  dc-proba-calculator:
    <<: *workerDefaults
    init: dc_proba_calculator
  estimate-annotation-calculator:
    <<: *workerDefaults
    init: estimate_annotation_runner
  general-calc-runner:
    <<: *workerDefaults
    init: general_calc_runner
  garak-props-runner:
    <<: *workerDefaults
    init: garak_props_runner
  gpu-runner:
    <<: *workerDefaults
    init: gpu_runner
  insights-runner:
    <<: *workerDefaults
    init: insights_runner
  llm-properties-calculator:
    <<: *workerDefaults
    init: llm_props
  notifier:
    <<: *workerDefaults
    init: notifier
  pre-calc-eng:
    <<: *workerDefaults
    init: pre_calc_eng
  translator:
    <<: *workerDefaults
    init: translator

Example client values.yaml

serviceAccount:
  annotations:
    eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/DeepchecksRole

global:
  image:
    tag: "0.39.2"
  imagePullSecrets:
    - name: deepchecks-registry
  secretName: "deepchecks-secrets"
  env:
    CREATE_ORG_FOR_USER: "[email protected]"
    TENANT_NAME: "example"
    GENERAL.MODELS_BUCKET_NAME: "deepchecks-models"
    LOGGER.LEVEL_DEEPCHECKS_LLM: "INFO"
    REDIS.HOST: "host.redis.example.com"
    REDIS.PORT: "6379"
    LOGGER.FORMATTER: "TEXT"
    WEBAPP.DEPLOYMENT_URL: "https://deepchecks.example.com"
    OAUTH.PROVIDER: "auth0"
    OAUTH.SERVER_URL: "https://example.auth0.com"
    OAUTH.TENANT_URL: "https://example.auth0.com"
    OAUTH.CLIENT_ID: "your-client-id"

web:
  ingress:
    enabled: true
    className: "alb"
    annotations:
      alb.ingress.kubernetes.io/scheme: internet-facing
    host: deepchecks.example.com
    tls:
      - hosts:
          - deepchecks.example.com
        secretName: deepchecks-tls

workers:
  gpu-runner:
    nodeSelector:
      role: gpu
    tolerations:
      - effect: NoSchedule
        key: nvidia.com/gpu
        operator: Exists