Self-Hosted Deployments
Deepchecks Self-Hosted Enterprise runs entirely in your own infrastructure, giving you full control over networking, security, and scaling. The platform is designed to run on Kubernetes to ensure reliable performance and horizontal scalability in production environments. This documentation walks through the infrastructure components and configuration required to deploy Deepchecks successfully.
Prerequisites
Infrastructure Prerequisites
The following diagram illustrates a typical Deepchecks deployment running on AWS:
Prior to deploying Self-Hosted Enterprise, Deepchecks recommends having each of the following infrastructure components ready to go. Select your deployment environment below for provider-specific guidance.
| Component | Recommendation | Notes |
|---|---|---|
| Kubernetes Cluster | Amazon EKS cluster deployed in at least 2 availability zones | We recommend a cluster configured with Karpenter for automatic node scaling. Kubernetes version 1.33 or higher is required. |
| GPU Nodes | g4dn.xlarge (1x NVIDIA T4, 4 vCPU, 16 GiB RAM) | Each GPU worker requires a dedicated node |
| Ingress Controller | AWS Load Balancer Controller | A dedicated subdomain is required for access to the application |
| Object Storage | AWS S3 Bucket | |
| Database | AWS RDS PostgreSQL 16 | We suggest starting with an instance such as db.r6g.large and scaling up if necessary. TLS connections are not currently supported. |
| Cache | AWS ElastiCache (Redis) | cache.t4g.micro is recommended for all deployment sizes. TLS connections are not currently supported. |
| Processing Queue | AWS SQS | See the Queue Configuration section for further details |
| Secrets Manager | AWS Secrets Manager | (Optional) Recommended for use with the External Secrets Operator for securely providing sensitive data to your deployment |
| Identity Provider | Currently Deepchecks supports Auth0 and Entra ID |
Minimum Resource Requirements
The following table summarizes the default CPU and memory requests for each Deepchecks component. These represent the minimum resources your cluster must be able to schedule.
Node StorageWe recommend that each node has a minimum of 150 GB of available storage.
| Component | Replicas | CPU Request | Memory Request |
|---|---|---|---|
| Web Server | 1 | 2000m | 8Gi |
| Workers (each) | |||
| advanced-llm-prop-calculator | 1 | 1000m | 6Gi |
| batcher-runner | 1 | 1000m | 6Gi |
| dc-calibrator | 1 | 1000m | 6Gi |
| dc-proba-calculator | 1 | 1000m | 6Gi |
| estimate-annotation-calculator | 1 | 1000m | 6Gi |
| general-calc-runner | 1 | 1000m | 6Gi |
| garak-props-runner | 1 | 1000m | 6Gi |
| gpu-runner | 1 | 1000m | 6Gi |
| insights-runner | 1 | 1000m | 6Gi |
| llm-properties-calculator | 1 | 1000m | 6Gi |
| notifier | 1 | 1000m | 6Gi |
| pre-calc-eng | 1 | 1000m | 6Gi |
| translator | 1 | 1000m | 6Gi |
| Models Sync Job | 1 (one-time) | 2000m | 8Gi |
| Total | 14 + job | 17 vCPU | 94 Gi |
GPU WorkerThe
gpu-runnerworker must run on a dedicated GPU-powered node and is not included in the CPU/memory totals above as it runs on separate hardware. See the Infrastructure Prerequisites section for recommended GPU instance types.
All resource values can be customized in your values.yaml. See the Example Configurations section for details.
Queue Configuration
Overview
Deepchecks uses message queues to coordinate background processing tasks. All queues are created with a configurable prefix that must match the TENANT_NAME environment variable provided during Helm chart deployment. See Environment Variables for further details.
All queues are AWS SQS queues. FIFO queue names include the .fifo suffix as required by AWS.
Table of required SQS queues and their configuration
| Queue Name | Visibility Timeout | Max Receive Count | FIFO |
|---|---|---|---|
| insights-calculator.fifo | 360 seconds | 3 | Yes |
| insights-calculator-dlq.fifo | 360 seconds | - | Yes |
| garak-props-calculator | 360 seconds | 3 | No |
| garak-props-calculator-dlq | 360 seconds | - | No |
| props-calc-batcher | 360 seconds | 3 | No |
| props-calc-batcher-dlq | 360 seconds | - | No |
| translation | 660 seconds | 3 | No |
| translation-dlq | 660 seconds | - | No |
| pre-calc-eng | 660 seconds | 3 | No |
| pre-calc-eng-dlq | 660 seconds | - | No |
| advanced-llm-prop-calculator | 360 seconds | 3 | No |
| advanced-llm-prop-calculator-dlq | 360 seconds | - | No |
| calibrator.fifo | 360 seconds | 3 | Yes |
| calibrator-dlq.fifo | 360 seconds | - | Yes |
| notifier | 360 seconds | 3 | No |
| notifier-dlq | 360 seconds | - | No |
| proba-calculator | 360 seconds | 3 | No |
| proba-calculator-dlq | 360 seconds | - | No |
| llm-properties | 660 seconds | 3 | No |
| llm-properties-dlq | 660 seconds | - | No |
| topics-inference | 660 seconds | 3 | No |
| topics-inference-dlq | 660 seconds | - | No |
| topics-train.fifo | 660 seconds | 3 | Yes |
| topics-train-dlq.fifo | 660 seconds | - | Yes |
| similarity-annotations | 360 seconds | 3 | No |
| similarity-annotations-dlq | 360 seconds | - | No |
| properties-calculator | 360 seconds | 3 | No |
| properties-calculator-dlq | 360 seconds | - | No |
| estimate-annotation-calculator | 360 seconds | 3 | No |
| estimate-annotation-calculator-dlq | 360 seconds | - | No |
| general-calc | 660 seconds | 3 | No |
| general-calc-dlq | 660 seconds | - | No |
Example Terraform to create the required queues
The following Terraform example can be used to bootstrap your queues:
variable "queue_prefix" {
description = "Prefix for queue names (must match TENANT_NAME)"
type = string
default = "deepchecks"
}
locals {
queues = {
insights-calculator = {
visibility_timeout = 360
fifo = true
max_receive_count = 3
}
garak-props-calculator = {
visibility_timeout = 360
fifo = false
max_receive_count = 3
}
props-calc-batcher = {
visibility_timeout = 360
fifo = false
max_receive_count = 3
}
translation = {
visibility_timeout = 660
fifo = false
max_receive_count = 3
}
pre-calc-eng = {
visibility_timeout = 660
fifo = false
max_receive_count = 3
}
advanced-llm-prop-calculator = {
visibility_timeout = 360
fifo = false
max_receive_count = 3
}
calibrator = {
visibility_timeout = 360
fifo = true
max_receive_count = 3
}
notifier = {
visibility_timeout = 360
fifo = false
max_receive_count = 3
}
proba-calculator = {
visibility_timeout = 360
fifo = false
max_receive_count = 3
}
llm-properties = {
visibility_timeout = 660
fifo = false
max_receive_count = 3
}
topics-inference = {
visibility_timeout = 660
fifo = false
max_receive_count = 3
}
topics-train = {
visibility_timeout = 660
fifo = true
max_receive_count = 3
}
similarity-annotations = {
visibility_timeout = 360
fifo = false
max_receive_count = 3
}
properties-calculator = {
visibility_timeout = 360
fifo = false
max_receive_count = 3
}
estimate-annotation-calculator = {
visibility_timeout = 360
fifo = false
max_receive_count = 3
}
general-calc = {
visibility_timeout = 660
fifo = false
max_receive_count = 3
}
}
}
# Dead Letter Queues
resource "aws_sqs_queue" "dlq" {
for_each = local.queues
name = "${var.queue_prefix}-${each.key}-dlq${each.value.fifo ? ".fifo" : ""}"
fifo_queue = each.value.fifo
visibility_timeout_seconds = each.value.visibility_timeout
content_based_deduplication = each.value.fifo ? true : false
}
# Main Queues
resource "aws_sqs_queue" "queue" {
for_each = local.queues
name = "${var.queue_prefix}-${each.key}${each.value.fifo ? ".fifo" : ""}"
fifo_queue = each.value.fifo
visibility_timeout_seconds = each.value.visibility_timeout
content_based_deduplication = each.value.fifo ? true : false
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.dlq[each.key].arn
maxReceiveCount = each.value.max_receive_count
})
}Object Storage
Deepchecks requires access to object storage for storing ML models and application data.
Kubernetes Configuration
Supported Environments:
- Amazon Elastic Kubernetes Service (EKS) on EC2
- Kubernetes version 1.33 or higher
- Clusters configured with Karpenter are recommended for automatic node scaling
- Amazon EKS on Fargate is not supported
Tooling
We require you to install and configure the following Kubernetes tooling:
- Install
helmby following these instructions - Install
kubectlby following these instructions - Configure
kubectlto connect to your cluster:
See here for how to configure your kubecontext for AWS EKS.
Create a Kubernetes namespace for your Deepchecks deployment (this can also be done as part of the Helm deployment):
kubectl create namespace deepchecksConfiguring Kubernetes Secrets
Sensitive credentials must be available in a Kubernetes Secret during deployment. The secret must be referenced in your values.yaml file at global.secretName. Ensure all required secrets are configured before deploying. See the Environment Variables section for the full list.
External Secrets OperatorWhile creating a Kubernetes Secret manually will work, we recommend using the External Secrets Operator with your secrets manager of choice (e.g., AWS Secrets Manager, Azure Key Vault, Google Secret Manager) to securely create this secret on your behalf.
Example: creating the secret manually
kubectl create secret generic deepchecks-secrets \
--namespace deepchecks \
--from-literal=DATABASE.URI='postgresql://user:pass@host:5432/dbname' \
--from-literal=OAUTH.CLIENT_SECRET='your-oauth-secret' \
--from-literal=WEBAPP.AUTH_JWT_SECRET='your-jwt-secret' \
--from-literal=LICENSE_KEY='your-license-key'For on-premises deployments, also include the storage credentials:
--from-literal=STORAGE_ACCESS_KEY_ID='your-storage-access-key' \
--from-literal=STORAGE_SECRET_ACCESS_KEY='your-storage-secret-key'GPU Nodes
In order for Kubernetes to utilize GPU nodes, the NVIDIA GPU Operator must be installed on your cluster. The installation process is the same across all supported Kubernetes distributions (EKS, AKS, GKE, and on-premises).
Example values.yaml for the NVIDIA GPU Operator
This example works as of GPU Operator version 0.18.0:
nodeSelector:
role: gpu
affinity: null
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
GPU Node ConfigurationIn this example GPU nodes have the label
role: gpu. Be sure to configure the operator as suits your deployment needs. Thegpu-runnerworker must run on a GPU-powered node. You can configure this via the Helm chart.
Application Permissions
The Deepchecks application requires access to cloud provider APIs for object storage and message queue operations.
The recommended approach is to use IRSA or Pod Identities to provide AWS permissions to the application.
Required IAM Policy
Variable ReplacementReplace
BUCKET_NAME,REGION,ACCOUNT_IDandTENANT_NAMEwith the relevant values.
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"s3:Get*",
"s3:List*",
"s3:Put*",
"s3:Delete*"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::<BUCKET_NAME>",
"arn:aws:s3:::<BUCKET_NAME>/*"
]
},
{
"Effect": "Allow",
"Action": [
"sqs:SendMessage",
"sqs:ReceiveMessage",
"sqs:DeleteMessage",
"sqs:GetQueueAttributes",
"sqs:GetQueueUrl"
],
"Resource": "arn:aws:sqs:<REGION>:<ACCOUNT_ID>:<TENANT_NAME>*"
},
{
"Action": [
"bedrock:InvokeModel"
],
"Effect": "Allow",
"Resource": "*"
}
]
}Deploying the Deepchecks Helm Chart
Registry Authentication
In order to access the Helm chart and container images you will need your credentials from the Deepchecks team.
helm registry login registry.cdn.deepchecks.com
Image Pull SecretIn order to pull container images from the Deepchecks registry you will need to configure an image pull secret. See here for more information.
Installation
helm install deepchecks oci://registry.cdn.deepchecks.com/deepchecks/llm-stack \
--values values.yaml \
--namespace deepchecks \
--create-namespaceFor on-premises deployments, also install the ElasticMQ Helm chart for the message queue:
helm install elasticmq oci://registry.cdn.deepchecks.com/deepchecks/elasticmq \
--values elasticmq-values.yaml \
--namespace deepchecks
Deployment TimeoutAs part of the deployment, a Kubernetes job will run to populate the object storage bucket with required model data. This job requires internet access to pull model artifacts. Typically this takes approximately 2 minutes, but depends on network performance. If your Helm commands are timing out, add
--timeout 20m0s.
Sync JobAs part of the Helm deployment each Deployment has an initContainer that checks that a 'sync' Job has completed successfully. If the Job is manually deleted, new pods will fail to start.
DNS ConfigurationDon't forget to configure your DNS record in order to be able to access the Deepchecks UI and SDK.
KEDA Autoscaling (Optional)
KEDA can be used to automatically scale workers based on queue depth. This is optional but recommended for production deployments.
Install KEDA on your cluster, then configure auto-wired scaling in your values.yaml:
kedaAutoscaling:
enabled: true
provider: "aws"
aws:
region: "<AWS_REGION>"
queuePrefix: "<TENANT_NAME>"
accountId: "<AWS_ACCOUNT_ID>"
authenticationRef:
name: "aws-credentials"
kind: ClusterTriggerAuthentication
kedaTriggerAuthentication:
enabled: true
items:
- name: aws-credentials
scope: cluster
podIdentity:
provider: awsEnvironment Variables
Non-sensitive values can be passed directly via the Helm chart's global.env configuration. For sensitive values, create a Kubernetes Secret and reference it via global.secretName.
Environment Variables
| Variable | Description | Required | Default/Valid Values | Format/Notes | Sensitive |
|---|---|---|---|---|---|
| Database | |||||
DATABASE.URI | PostgreSQL database connection string | Yes | Format: postgresql://<username>:<password>@<host>:<port>/<database>. Ensure the database user has appropriate permissions. | Yes | |
| General | |||||
GENERAL.MODELS_BUCKET_NAME | Name of the storage bucket/container where ML models are stored | Yes | This is the name of the bucket/container that you created in the Object Storage section. | ||
| Logger | |||||
LOGGER.FORMATTER | Output format for application logs | No | Default: JSON, Valid: JSON, TEXT | ||
LOGGER.LEVEL_DEEPCHECKS_LLM | Logging level for Deepchecks LLM components | No | Default: INFO, Valid: DEBUG, INFO, WARNING, ERROR | ||
| OAuth | |||||
OAUTH.CLIENT_ID | OAuth client identifier provided by your identity provider | Yes | |||
OAUTH.CLIENT_SECRET | OAuth client secret provided by your identity provider | Yes | Store securely and never commit to version control | Yes | |
OAUTH.PROVIDER | OAuth provider type | Yes | Default: auth0, Valid: auth0, entra_id | Currently Deepchecks supports Auth0 and Entra ID | |
OAUTH.SERVER_URL | Base URL of your OAuth authorization server | Yes | |||
OAUTH.TENANT_URL | Tenant-specific URL for your OAuth provider | Yes | |||
| Redis | |||||
REDIS.HOST | Hostname or IP address of the Redis/Valkey server | Yes | |||
REDIS.PASSWORD | Authentication password for Redis/Valkey server | No (only if authentication is enabled) | Store securely if authentication is enabled | Yes | |
REDIS.PORT | Port number for Redis/Valkey connection | Yes | Default: 6379 | ||
| Tenant | |||||
TENANT_NAME | Name of the default tenant. Used as prefix for queue names and other tenant-specific resources. | Yes | Format: Lowercase alphanumeric characters and hyphens only. This tenant is automatically created when the application first starts. | ||
LICENSE_KEY | License key provided by Deepchecks for your deployment | Yes | Also used as the password for accessing the container registry. | Yes | |
CREATE_ORG_FOR_USER | Email address of the first user in the system (will have 'Owner' role) | Yes | This user must be the first to log in. They can subsequently transfer ownership. | ||
| Web Application | |||||
WEBAPP.AUTH_JWT_SECRET | Secret key used to sign JWT tokens for API authentication | Yes | Generate with openssl rand -base64 32. Minimum 32 characters. Store securely. | Yes | |
WEBAPP.DEPLOYMENT_URL | FQDN where the application is deployed | Yes | Format: Complete URL including protocol (e.g., https://deepchecks.example.com). | ||
| On-Premises Only | |||||
STORAGE_ACCESS_KEY_ID | Access key ID for S3-compatible storage | Yes (on-prem only) | Required when using S3-compatible storage (MinIO, Ceph, etc.) | Yes | |
STORAGE_SECRET_ACCESS_KEY | Secret access key for S3-compatible storage | Yes (on-prem only) | Required when using S3-compatible storage | Yes | |
STORAGE_ENDPOINT | Endpoint URL of S3-compatible storage | Yes (on-prem only) | e.g., https://minio.example.com:9000 | ||
QUEUE_ENDPOINT | Endpoint URL of ElasticMQ service | Yes (on-prem only) | e.g., http://elasticmq:9324 |
Example Configurations
values.yaml (default chart values)
global:
image:
repository: registry.cdn.deepchecks.com/deepchecks
name: llm
pullPolicy: IfNotPresent
tag: ""
imagePullSecrets: []
env: {}
secretName: ""
revisionHistoryLimit: 3
serviceAccount:
create: true
automount: true
annotations: {}
web:
replicaCount: 1
podAnnotations: {}
podLabels: {}
podSecurityContext: {}
securityContext: {}
resources:
requests:
cpu: 2000m
memory: 8Gi
autoscaling:
enabled: false
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 80
volumes: []
volumeMounts: []
nodeSelector: {}
tolerations: []
affinity: {}
service:
type: ClusterIP
port: 8000
ingress:
enabled: true
className: ""
annotations: {}
host: ""
tls: []
workerDefaults: &workerDefaults
replicaCount: 1
podAnnotations: {}
podLabels: {}
podSecurityContext: {}
securityContext: {}
resources:
limits:
cpu: 1000m
memory: 6Gi
requests:
cpu: 1000m
memory: 6Gi
autoscaling:
enabled: false
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 80
nodeSelector: {}
tolerations: []
affinity: {}
workers:
advanced-llm-prop-calculator:
<<: *workerDefaults
init: advanced_llm_props_runner
batcher-runner:
<<: *workerDefaults
init: batcher_runner
dc-calibrator:
<<: *workerDefaults
init: dc_calibrator
dc-proba-calculator:
<<: *workerDefaults
init: dc_proba_calculator
estimate-annotation-calculator:
<<: *workerDefaults
init: estimate_annotation_runner
general-calc-runner:
<<: *workerDefaults
init: general_calc_runner
garak-props-runner:
<<: *workerDefaults
init: garak_props_runner
gpu-runner:
<<: *workerDefaults
init: gpu_runner
insights-runner:
<<: *workerDefaults
init: insights_runner
llm-properties-calculator:
<<: *workerDefaults
init: llm_props
notifier:
<<: *workerDefaults
init: notifier
pre-calc-eng:
<<: *workerDefaults
init: pre_calc_eng
translator:
<<: *workerDefaults
init: translatorExample client values.yaml
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::123456789012:role/DeepchecksRole
global:
image:
tag: "0.39.2"
imagePullSecrets:
- name: deepchecks-registry
secretName: "deepchecks-secrets"
env:
CREATE_ORG_FOR_USER: "[email protected]"
TENANT_NAME: "example"
GENERAL.MODELS_BUCKET_NAME: "deepchecks-models"
LOGGER.LEVEL_DEEPCHECKS_LLM: "INFO"
REDIS.HOST: "host.redis.example.com"
REDIS.PORT: "6379"
LOGGER.FORMATTER: "TEXT"
WEBAPP.DEPLOYMENT_URL: "https://deepchecks.example.com"
OAUTH.PROVIDER: "auth0"
OAUTH.SERVER_URL: "https://example.auth0.com"
OAUTH.TENANT_URL: "https://example.auth0.com"
OAUTH.CLIENT_ID: "your-client-id"
web:
ingress:
enabled: true
className: "alb"
annotations:
alb.ingress.kubernetes.io/scheme: internet-facing
host: deepchecks.example.com
tls:
- hosts:
- deepchecks.example.com
secretName: deepchecks-tls
workers:
gpu-runner:
nodeSelector:
role: gpu
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: ExistsUpdated 23 days ago