Self-Hosted Deployments
Deepchecks Self-Hosted Enterprise runs entirely in your own infrastructure, giving you full control over networking, security, and scaling. The platform is designed to run on Kubernetes to ensure reliable performance and horizontal scalability in production environments. This documentation walks through the infrastructure components and configuration required to deploy Deepchecks successfully.
Prerequisites
Infrastructure Prerequisites
The following diagram illustrates a typical Deepchecks deployment running on AWS:
Prior to deploying Self-Hosted Enterprise, Deepchecks recommends having each of the following infrastructure components ready to go. When possible, it's easiest to have all components running in the same VPC. The provided recommendations are for customers deploying to AWS:
| Component | Recommendation | Notes |
|---|---|---|
| Kubernetes Cluster | Amazon EKS cluster deployed in at least 2 availability zones | We recommend a cluster configured with Karpenter in order to provide automatic node scaling |
| GPU powered nodes | g4dn.xlarge | Each GPU worker requires a dedicated Node |
| Ingress Controller | AWS Load Balancer Controller | A dedicated subdomain is required for access to the application |
| Object Storage | AWS S3 Bucket | |
| Dedicated Database | AWS RDS Postgres | We suggest starting with an instance such as db.r6g.large and scaling up if necessary. Note that currently we do not support Postgres connections with TLS. |
| Redis Cache | AWS ElastiCache | cache.t4g.micro is the recommended instance size for all deployment sizes. Note that currently we do not support Redis connections with TLS. |
| Processing Queue | AWS SQS | See the SQS Queue Configuration section for further details |
| External Secrets Manager | AWS Secrets Manager | (Optional) Recommended for use with the external secrets operator for securely providing sensitive data to your Deepchecks Self-Hosted Enterprise solution |
| Identity Provider | Currently Deepchecks supports Auth0 and Entra ID |
SQS Queue Configuration
Overview
This section describes the AWS SQS queue configuration for the system. All queues are created with the configurable prefix (default: deepchecks) and include dead letter queue (DLQ) configurations for failed message handling.
Queue PrefixThe prefix used must be the same as the environment variable
TENANT_NAMEthat is provided when deploying the Helm chart. See Environment Variables for further details.
Queue Configuration Summary
Note: All FIFO queue names include the .fifo suffix as required by AWS.
Table of required SQS queues and their relevant configuration
| Queue Name | Visibility Timeout | Max Receive Count |
|---|---|---|
| insights-calculator.fifo | 360 seconds | 3 |
| insights-calculator-dlq.fifo | 360 seconds | - |
| garak-props-calculator | 360 seconds | 3 |
| garak-props-calculator-dlq | 360 seconds | - |
| props-calc-batcher | 360 seconds | 3 |
| props-calc-batcher-dlq | 360 seconds | - |
| translation | 660 seconds | 3 |
| translation-dlq | 660 seconds | - |
| pre-calc-eng | 660 seconds | 3 |
| pre-calc-eng-dlq | 660 seconds | - |
| advanced-llm-prop-calculator | 360 seconds | 3 |
| advanced-llm-prop-calculator-dlq | 360 seconds | - |
| calibrator.fifo | 360 seconds | 3 |
| calibrator-dlq.fifo | 360 seconds | - |
| notifier | 360 seconds | 3 |
| notifier-dlq | 360 seconds | - |
| proba-calculator | 360 seconds | 3 |
| proba-calculator-dlq | 360 seconds | - |
| llm-properties | 660 seconds | 3 |
| llm-properties-dlq | 660 seconds | - |
| topics-inference | 660 seconds | 3 |
| topics-inference-dlq | 660 seconds | - |
| topics-train.fifo | 660 seconds | 3 |
| topics-train-dlq.fifo | 660 seconds | - |
| similarity-annotations | 360 seconds | 3 |
| similarity-annotations-dlq | 360 seconds | - |
| properties-calculator | 360 seconds | 3 |
| properties-calculator-dlq | 360 seconds | - |
| estimate-annotation-calculator | 360 seconds | 3 |
| estimate-annotation-calculator-dlq | 360 seconds | - |
Example Terraform to create the relevant queues
The following Terraform example can be used to help you to quickly bootstrap your queues:
variable "queue_prefix" {
description = "Prefix for queue names"
type = string
default = "deepchecks"
}
locals {
queues = {
insights-calculator = {
visibility_timeout = 360
fifo = true
max_receive_count = 3
}
garak-props-calculator = {
visibility_timeout = 360
fifo = false
max_receive_count = 3
}
props-calc-batcher = {
visibility_timeout = 360
fifo = false
max_receive_count = 3
}
translation = {
visibility_timeout = 660
fifo = false
max_receive_count = 3
}
pre-calc-eng = {
visibility_timeout = 660
fifo = false
max_receive_count = 3
}
advanced-llm-prop-calculator = {
visibility_timeout = 360
fifo = false
max_receive_count = 3
}
calibrator = {
visibility_timeout = 360
fifo = true
max_receive_count = 3
}
notifier = {
visibility_timeout = 360
fifo = false
max_receive_count = 3
}
proba-calculator = {
visibility_timeout = 360
fifo = false
max_receive_count = 3
}
llm-properties = {
visibility_timeout = 660
fifo = false
max_receive_count = 3
}
topics-inference = {
visibility_timeout = 660
fifo = false
max_receive_count = 3
}
topics-train = {
visibility_timeout = 660
fifo = true
max_receive_count = 3
}
similarity-annotations = {
visibility_timeout = 360
fifo = false
max_receive_count = 3
}
properties-calculator = {
visibility_timeout = 360
fifo = false
max_receive_count = 3
}
estimate-annotation-calculator = {
visibility_timeout = 360
fifo = false
max_receive_count = 3
}
}
}
# Dead Letter Queues
resource "aws_sqs_queue" "dlq" {
for_each = local.queues
name = "${var.queue_prefix}-${each.key}-dlq${each.value.fifo ? ".fifo" : ""}"
fifo_queue = each.value.fifo
visibility_timeout_seconds = each.value.visibility_timeout
content_based_deduplication = each.value.fifo ? true : false
}
# Main Queues
resource "aws_sqs_queue" "queue" {
for_each = local.queues
name = "${var.queue_prefix}-${each.key}${each.value.fifo ? ".fifo" : ""}"
fifo_queue = each.value.fifo
visibility_timeout_seconds = each.value.visibility_timeout
content_based_deduplication = each.value.fifo ? true : false
redrive_policy = jsonencode({
deadLetterTargetArn = aws_sqs_queue.dlq[each.key].arn
maxReceiveCount = each.value.max_receive_count
})
}S3 Bucket
In order for the Deepchecks application to function access to a S3 bucket is required.
Bucket NameNote the name of the bucket you create as you will need to create the relevant permissions for applications and set it as an environment variable later on in the Helm chart.
Kubernetes Configuration
A few notes on Kubernetes cluster provisioning for Deepchecks Self-Hosted Enterprise:
- Deepchecks currently supports Amazon Elastic Kubernetes Service (EKS) on EC2.
- Deepchecks recommends running Deepchecks on a cluster that supports autoscaling such as with Karpenter.
- Deepchecks doesn't support Amazon EKS on Fargate.
We require you to install and configure the following Kubernetes tooling:
-
Install
helmby following these instructions -
Install
kubectlby following these instructions. -
Configure
kubectlto connect to your cluster by usingkubectl use-context my-cluster-nameSee here for how to configure your kubecontext for AWS.
We also require you to create a Kubernetes namespace for your Deepchecks deployment:
(This can also be done as part of the Helm deployment described further on)
kubectl create namespace deepchecksConfiguring Kubernetes Secrets
External Secrets OperatorWhile provisioning a Kubernetes Secret manually will work, we recommend using external secrets operator to securely create this secret on your behalf.
Sensitive credentials are required to be made available in a Kubernetes Secrets during deployment. The Kubernetes secret must be set in your values.yaml file at global.secretName. Ensure all required secrets are configured before deploying Deepchecks Self-Hosted Enterprise. You can find the list of environment variable and secrets in the Environment Variables section.
GPU powered Nodes
In order for Kubernetes to utilize the GPU nodes the Nvidia GPU operator must be installed on your cluster.
Example values.yaml file for the GPU driver
This example works as of driver version 0.18.0
nodeSelector:
role: gpu
affinity: null
tolerations:
- key: CriticalAddonsOnly
operator: Exists
- key: nvidia.com/gpu
operator: Exists
GPU Node ConfigurationIn this example our GPU nodes have the following label:
role: gpu.Be sure to configure the driver as suites your deployment needs.
The
gpu-runnerworker must run on a GPU powered Node. You can configure this via the Helm chart.
Application Permissions
In order for the application to access the relevant AWS APIs needed for the application to function you need to provide the following permissions to the application. The 2 recommended ways of doing so are either using IRSA or Pod Identities.
Required IAM Policy
Variable ReplacementBe sure to replace
BUCKET_NAME,REGION,ACCOUNTandTENANT_NAMEwith the relevant values
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"s3:Get*",
"s3:List*",
"s3:Put*",
"s3:Delete*"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::<BUCKET_NAME>",
"arn:aws:s3:::<BUCKET_NAME>/*"
]
},
{
"Effect": "Allow",
"Action": [
"sqs:SendMessage",
"sqs:ReceiveMessage",
"sqs:DeleteMessage",
"sqs:GetQueueAttributes",
"sqs:GetQueueUrl"
],
"Resource": "arn:aws:sqs:REGION:ACCOUNT_ID:<TENANT_NAME>*"
},
{
"Action": [
"bedrock:InvokeModel"
],
"Effect": "Allow",
"Resource": "*"
}
]
}Deploying the Deepchecks Helm Chart
In order to access the Helm chart and container images you will need your credentials to Deepchecks' registry
Registry AuthenticationIn order to authenticate with the Deepchecks registry you will need the credentials provided by the Deepchecks team.
helm registry login registry.llm.deepchecks.comhelm install deepchecks oci://registry.llm.deepchecks.com/deepchecks/deepchecks-llm-stack --values values.yaml
Image Pull SecretIn order to be able to pull the container images from Deepchecks' repository you will need to configure a secret in order to provide the credentials to the Helm chart. You can see here for more information.
In order to configure the Deepchecks Helm Chart there are multiple values that need to be provided to the Helm chart via environment variables. Non-sensitive values can be passed directly via the Helm chart. For sensitive values we recommend leveraging an external secrets manager.
Environment Variables
| Variable | Description | Required | Default/Valid Values | Format/Notes | Sensitive |
|---|---|---|---|---|---|
| Database | |||||
DATABASE.URI | PostgreSQL database connection string | Yes | Format: postgresql://<username>:<password>@<host>:<port>/<database>. Ensure the database user has appropriate permissions | Yes | |
| General | |||||
GENERAL.MODELS_BUCKET_NAME | Name of the cloud storage bucket where ML models are stored | Yes | This is the name of the bucket that you created in the S3 Bucket section | ||
| Logger | |||||
LOGGER.FORMATTER | Output format for application logs | No | Default: JSON, Valid: JSON, TEXT | ||
LOGGER.LEVEL_DEEPCHECKS_LLM | Logging level for Deepchecks LLM components | No | Default: INFO, Valid: DEBUG, INFO, WARNING, ERROR | ||
| OAuth | |||||
OAUTH.CLIENT_ID | OAuth client identifier provided by your identity provider | ||||
OAUTH.CLIENT_SECRET | OAuth client secret provided by your identity provider | Store securely and never commit to version control | Yes | ||
OAUTH.PROVIDER | OAuth provider type | Default: auth0, Valid: auth0, entra_id | Currently Deepchecks supports Auth0 and Entra ID | ||
OAUTH.SERVER_URL | Base URL of your OAuth authorization server | ||||
OAUTH.TENANT_URL | Tenant-specific URL for your OAuth provider | ||||
| Redis | |||||
REDIS.HOST | Hostname or IP address of the Redis server | Yes | |||
REDIS.PASSWORD | Authentication password for Redis server | No (only if Redis requires authentication) | Store securely if authentication is enabled | Yes | |
REDIS.PORT | Port number for Redis server connection | Yes | Default: 6379 | ||
| Tenant | |||||
TENANT_NAME | Name of the default tenant created during application initialization. Used as prefix for SQS queue names and other tenant-specific resources | Yes | Format: Lowercase alphanumeric characters and hyphens only. This tenant is automatically created when the application first starts | ||
LICENSE_KEY | This is your license key provided by Deepchecks for your deployment | Yes | This also acts as your password in order to access our container registry | Yes | |
CREATE_ORG_FOR_USER | The email address of the first user in the system. This user will have the 'Owner' role | Yes | This user must be the first user that logs in to the system. They can subsequently transfer ownership of the application. | ||
| Web Application | |||||
WEBAPP.AUTH_JWT_SECRET | Secret key used to sign JWT tokens for API authentication within the system | Yes | Generate a strong, random secret (minimum 32 characters recommended). Use openssl rand -base64 32 to generate a secure secret. Store securely and never commit to version control | Yes | |
WEBAPP.DEPLOYMENT_URL | Fully Qualified Domain Name (FQDN) where the application is deployed | Yes | Format: Complete URL including protocol. This URL is used for generating callback URLs and external links |
values.yaml
# Default values for deepchecks-llm-stack.
# This is a YAML-formatted file.
global:
image:
repository: harbor.llmdev.deepchecks.com/deepchecks/llm
# This sets the pull policy for images.
pullPolicy: IfNotPresent
# Image tag; defaults to Chart.appVersion if empty
tag: ""
# Pull secrets for private container registries
imagePullSecrets: []
# Environment variables passed to all pods via ConfigMap
env: {}
# Number of old ReplicaSets to retain for rollback
revisionHistoryLimit: 3
serviceAccount:
# Create a shared ServiceAccount for all components
create: true
# Mount API credentials into pods
automount: true
# Annotations for cloud provider integrations (e.g., IAM roles)
annotations: {}
web:
# Number of web server replicas
replicaCount: 1
# Override the chart name
nameOverride: ""
fullnameOverride: ""
# Pod annotations for monitoring, logging, or mesh integration
podAnnotations: {}
# Additional labels for pods
podLabels: {}
# Pod-level security settings (fsGroup, runAsUser, etc.)
podSecurityContext: {}
# fsGroup: 2000
# Container-level security settings
securityContext: {}
# capabilities:
# drop:
# - ALL
# readOnlyRootFilesystem: true
# runAsNonRoot: true
# runAsUser: 1000
# CPU and memory requests/limits
resources:
requests:
cpu: 2000m
memory: 8Gi
autoscaling:
# Enable Horizontal Pod Autoscaler
enabled: false
minReplicas: 1
maxReplicas: 10
# Target CPU percentage for scaling
targetCPUUtilizationPercentage: 80
# targetMemoryUtilizationPercentage: 80
# Additional volumes for the deployment
volumes: []
# Additional volume mounts for containers
volumeMounts: []
# Node selector for pod scheduling
nodeSelector: {}
# Tolerations for node taints
tolerations: []
# Affinity rules for pod placement
affinity: {}
service:
# Service type: ClusterIP, NodePort, or LoadBalancer
type: ClusterIP
# Port the service exposes
port: 8000
ingress:
# Enable ingress resource creation
enabled: true
# Ingress controller class (e.g., nginx, kong, traefik)
className: ""
# Ingress annotations for TLS, auth, rate-limiting, etc.
annotations: {}
# Hostname for the application (required if ingress enabled)
host: ""
# TLS configuration
tls: []
# Default configuration inherited by all workers
workerDefaults: &workerDefaults
replicaCount: 1
nameOverride: ""
fullnameOverride: ""
podAnnotations: {}
podLabels: {}
podSecurityContext: {}
securityContext: {}
resources:
limits:
cpu: 1000m
memory: 6Gi
requests:
cpu: 1000m
memory: 6Gi
autoscaling:
enabled: false
minReplicas: 1
maxReplicas: 10
targetCPUUtilizationPercentage: 80
nodeSelector: {}
tolerations: []
affinity: {}
# Background worker deployments
workers:
# Computes advanced LLM properties
advanced-llm-prop-calculator:
<<: *workerDefaults
init: advanced_llm_props_runner
# Handles batch processing jobs
batcher-runner:
<<: *workerDefaults
init: batcher_runner
# Runs calibration logic
dc-calibrator:
<<: *workerDefaults
init: dc_calibrator
# Calculates probability scores
dc-proba-calculator:
<<: *workerDefaults
init: dc_proba_calculator
# Estimates annotations
estimate-annotation-calculator:
<<: *workerDefaults
init: estimate_annotation_runner
# Runs Garak security property checks
garak-props-runner:
<<: *workerDefaults
init: garak_props_runner
# GPU-accelerated processing (requires GPU nodes)
gpu-runner:
<<: *workerDefaults
init: gpu_runner
# Generates insights from analysis
insights-runner:
<<: *workerDefaults
init: insights_runner
# Computes LLM properties
llm-properties-calculator:
<<: *workerDefaults
init: llm_props
# Sends notifications
notifier:
<<: *workerDefaults
init: notifier
# Pre-calculation engine
pre-calc-eng:
<<: *workerDefaults
init: pre_calc_eng
# Handles data transformation
translator:
<<: *workerDefaults
init: translatorclient.yaml Example
Example ConfigurationThis is a simple example configuration for a customer deployment
serviceAccount:
annotations:
eks.amazonaws.com/role-arn: arn:aws:iam::12345567890:role/DeepchecksRole
global:
image:
tag: "0.39.2"
# supplied here is the secret containing the username and password needed to pull the Deepchecks containers
imagePullSecrets:
- name: deepchecks-registry
env:
CREATE_ORG_FOR_USER: "[email protected]"
TENANT_NAME: "example"
GENERAL.MODELS_BUCKET_NAME: "deepchecks-models"
LOGGER.LEVEL_DEEPCHECKS_LLM: "INFO"
REDIS.HOST: "host.redis.com"
REDIS.PORT: "10000"
LOGGER.FORMATTER: "TEXT"
GENERAL.IGNORE_EMAIL_VERIFICATION: "True"
# in this example kong is used as the ingress controller along with cert manager for certificate management
web:
ingress:
enabled: true
className: "kong"
annotations:
cert-manager.io/cluster-issuer: issuer
host: deepchecks.example.com
tls:
- hosts:
- deepchecks.example.com
secretName: deepchecks-tls
workers:
# it is important to configure your node selectors and taints in order to ensure that your GPU workers run on your GPU powered nodes
gpu-runner:
nodeSelector:
role: gpu
tolerations:
- effect: NoSchedule
key: nvidia.com/gpu
operator: Exists
Deployment TimeoutAs part of the deployment a Kubernetes job will run to populate the models S3 bucket with the required data. Typically this takes approximately 2 minutes, but is dependent on the networks performance. If your Helm commands are timing out it is recommended to add
--timeout 20m0s
Sync JobAs part of the Helm deployment each Deployment has an initContainer that checks that a 'sync' Job has completed successfully. If the Job is manually deleted new pods will fail to start.
DNS ConfigurationDon't forget to configure your DNS record in order to be able to access the Deepchecks UI and SDK
Updated 6 days ago