Datadog Integration
Datadog is a leading APM solution. In this guide we'll cover how you can use our dedicated server-to-server integration with Datadog to monitor your LLM application.
If you have Datadog in your organization, we warmly recommend using this integration to take advantage of deepchecks estimated annotations, properties and more to help you monitor the quality of you LLM application in production.
How to enable the integration?
Configuring Datadog API Key in Deepchecks
In Datadog, under organization settings, configure a new, dedicated API Key for the integration
Using Deepchecks REST API, enable the Datadog integration and pass Datadog's API Key.
Getting Started with our REST API
Set Datadog User Config
Make sure the integration is enabled - both in system_config
and user_config
, via theGet Config API. If system_config
is disabled, please reach out to your account manager at Deepchecks to inquire about enabling it for your organization.
Upload production data to Deepchecks
To test the integration, you can upload data using CSV file from the UI (Make sure to upload the data into the "Production" environment).
For an ongoing production integration you can use our Python SDK or alternatively work directly with our REST API. Deepchecks will send the data about a specific interaction to Datadog only after estimated annotations have been computed for it. Make sure you have estimated annotations have completed for the data you have uploaded before moving to the next phase.
Verifying that the data reached Datadog
Deepchecks send the data as log entries to Datadog. Using logs, you can decide what log fields to use as measures or facets, and enable monitors and dashboard widgets.
Go to "logs" in Datadog and make sure deepchecks data reached successfully:
Configuring Log attributes as measures and facets
Go over log attributes and convert them into measure for numbers and facets for strings:
Import a Dashboard into Datadog
If you want to use our pre-configured dashboard for Deepchecks, copy the following json, save it locally and import it into Datadog:
{"title":"Monitor LLM","description":"[[suggested_dashboards]]","widgets":[{"id":4873129968030932,"definition":{"type":"image","url":"https://files.readme.io/9400bc7-logo-dc-llm.svg","sizing":"contain","margin":"md","has_background":false,"has_border":false,"vertical_align":"center","horizontal_align":"center"},"layout":{"x":0,"y":0,"width":6,"height":3}},{"id":6633413186649232,"definition":{"title":"Number of Interactions Per Application/Version","title_size":"16","title_align":"left","requests":[{"response_format":"scalar","formulas":[{"formula":"query1"}],"queries":[{"data_source":"logs","name":"query1","indexes":["*"],"compute":{"aggregation":"count"},"group_by":[{"facet":"@deepchecks_interaction.context.application_name","limit":10,"sort":{"order":"desc","aggregation":"count"}},{"facet":"@deepchecks_interaction.context.application_version_name","limit":10,"sort":{"order":"desc","aggregation":"count"}}],"search":{"query":"$application $version $environment $source"},"storage":"hot"}],"style":{"palette":"datadog16"},"sort":{"count":10,"order_by":[{"type":"formula","index":0,"order":"desc"}]}}],"type":"sunburst","legend":{"type":"automatic"}},"layout":{"x":6,"y":0,"width":3,"height":3}},{"id":3419482205985922,"definition":{"title":"Interactions Count by Estimated Annotations","title_size":"16","title_align":"left","requests":[{"response_format":"scalar","formulas":[{"formula":"query1"}],"queries":[{"data_source":"logs","name":"query1","indexes":["*"],"compute":{"aggregation":"count"},"group_by":[{"facet":"@deepchecks_interaction.estimated_annotation","limit":10,"sort":{"order":"desc","aggregation":"count"}}],"search":{"query":"$application $version $environment $source"},"storage":"hot"}],"style":{"palette":"datadog16"},"sort":{"count":10,"order_by":[{"type":"formula","index":0,"order":"desc"}]}}],"type":"sunburst","legend":{"type":"automatic"}},"layout":{"x":9,"y":0,"width":3,"height":3}},{"id":8336015370009554,"definition":{"title":"Good Estimated Annotation Ratio By Application/Version","title_size":"16","title_align":"left","show_legend":true,"legend_layout":"auto","legend_columns":["avg","min","max","value","sum"],"type":"timeseries","requests":[{"formulas":[{"alias":"good ratio","formula":"query1 / query2"}],"queries":[{"data_source":"logs","name":"query1","indexes":["*"],"compute":{"aggregation":"count","interval":600000},"group_by":[{"facet":"@deepchecks_interaction.context.application_name","limit":10,"sort":{"order":"desc","aggregation":"count"}},{"facet":"@deepchecks_interaction.context.application_version_name","limit":10,"sort":{"order":"desc","aggregation":"count"}}],"search":{"query":"@deepchecks_interaction.estimated_annotation:good $application $version $environment $source"},"storage":"hot"},{"data_source":"logs","name":"query2","indexes":["*"],"compute":{"aggregation":"count","interval":600000},"group_by":[{"facet":"@deepchecks_interaction.context.application_name","limit":10,"sort":{"order":"desc","aggregation":"count"}},{"facet":"@deepchecks_interaction.context.application_version_name","limit":10,"sort":{"order":"desc","aggregation":"count"}}],"search":{"query":"@deepchecks_interaction.estimated_annotation:(good OR bad) $application $version $environment $source"},"storage":"hot"}],"response_format":"timeseries","style":{"palette":"dog_classic","order_by":"values","line_type":"solid","line_width":"normal"},"display_type":"line"}]},"layout":{"x":0,"y":3,"width":6,"height":3}},{"id":3925972544507286,"definition":{"title":"Bad Grounded In Context Ratio By Application/Version (\"under the threshold\" / \"all\")","title_size":"16","title_align":"left","show_legend":true,"legend_layout":"auto","legend_columns":["avg","min","max","value","sum"],"type":"timeseries","requests":[{"formulas":[{"alias":"bad grounded in context ratio","formula":"query1 / query2"}],"queries":[{"data_source":"logs","name":"query1","indexes":["*"],"compute":{"aggregation":"count","interval":600000},"group_by":[{"facet":"@deepchecks_interaction.context.application_name","limit":10,"sort":{"order":"desc","aggregation":"count"}},{"facet":"@deepchecks_interaction.context.application_version_name","limit":10,"sort":{"order":"desc","aggregation":"count"}}],"search":{"query":"@deepchecks_interaction.output_properties.output_grounded_in_context:[0.0 TO 0.4] $application $version $environment $source"},"storage":"hot"},{"data_source":"logs","name":"query2","indexes":["*"],"compute":{"aggregation":"count","interval":600000},"group_by":[{"facet":"@deepchecks_interaction.context.application_name","limit":10,"sort":{"order":"desc","aggregation":"count"}},{"facet":"@deepchecks_interaction.context.application_version_name","limit":10,"sort":{"order":"desc","aggregation":"count"}}],"search":{"query":"$application $version $environment $source"},"storage":"hot"}],"response_format":"timeseries","style":{"palette":"dog_classic","order_by":"values","line_type":"solid","line_width":"normal"},"display_type":"line"}]},"layout":{"x":6,"y":3,"width":6,"height":3}},{"id":8506330137618792,"definition":{"title":"Unknown Estimated Annotation Ratio By Application/Version","title_size":"16","title_align":"left","show_legend":true,"legend_layout":"auto","legend_columns":["avg","min","max","value","sum"],"type":"timeseries","requests":[{"formulas":[{"alias":"unknown ratio","formula":"query3 / query4"}],"queries":[{"data_source":"logs","name":"query3","indexes":["*"],"compute":{"aggregation":"count","interval":600000},"group_by":[{"facet":"@deepchecks_interaction.context.application_name","limit":10,"sort":{"order":"desc","aggregation":"count"}},{"facet":"@deepchecks_interaction.context.application_version_name","limit":10,"sort":{"order":"desc","aggregation":"count"}}],"search":{"query":"@deepchecks_interaction.estimated_annotation:unknown $application $version $environment $source"},"storage":"hot"},{"data_source":"logs","name":"query4","indexes":["*"],"compute":{"aggregation":"count","interval":600000},"group_by":[{"facet":"@deepchecks_interaction.context.application_name","limit":10,"sort":{"order":"desc","aggregation":"count"}},{"facet":"@deepchecks_interaction.context.application_version_name","limit":10,"sort":{"order":"desc","aggregation":"count"}}],"search":{"query":"$application $version $environment $source"},"storage":"hot"}],"response_format":"timeseries","style":{"palette":"dog_classic","order_by":"values","line_type":"solid","line_width":"normal"},"display_type":"line"}]},"layout":{"x":0,"y":6,"width":6,"height":3}},{"id":1648519253537176,"definition":{"title":"Number of Interactions By Application/Version","title_size":"16","title_align":"left","show_legend":true,"legend_layout":"auto","legend_columns":["avg","min","max","value","sum"],"type":"timeseries","requests":[{"formulas":[{"formula":"query1"}],"queries":[{"data_source":"logs","name":"query1","indexes":["*"],"compute":{"aggregation":"count","interval":600000},"group_by":[{"facet":"@deepchecks_interaction.context.application_name","limit":10,"sort":{"order":"desc","aggregation":"count"}},{"facet":"@deepchecks_interaction.context.application_version_name","limit":10,"sort":{"order":"desc","aggregation":"count"}}],"search":{"query":"$application $version $environment $source"},"storage":"hot"}],"response_format":"timeseries","style":{"palette":"dog_classic","order_by":"values","line_type":"solid","line_width":"normal"},"display_type":"line"}]},"layout":{"x":6,"y":6,"width":6,"height":3}}],"template_variables":[{"name":"source","prefix":"source","available_values":[],"default":"deepchecks_llm"},{"name":"application","prefix":"@deepchecks_interaction.context.application_name","available_values":[],"default":"*"},{"name":"version","prefix":"@deepchecks_interaction.context.application_version_name","available_values":[],"default":"*"},{"name":"environment","prefix":"@deepchecks_interaction.context.env_type","available_values":[],"default":"*"}],"layout_type":"ordered","notify_list":[],"template_variable_presets":[{"name":"Deepchecks PROD data","template_variables":[]}],"reflow_type":"fixed"}
Import Example Monitor into Datadog
This monitor will alert you if less than 20% of your the samples that received an estimated annotation got "Good" estimated annotations over a period of 5 minutes. To define it, copy the following json and import it into Datadog:
{"id":146458791,"name":"Good Estimated Annotation ratio alert","type":"log alert","query":"formula(\"query / query1\").last(\"5m\") < 0.2","message":"To investigate in Deepchecks:\nhttps://app.llm.deepchecks.com?appName={{urlencode \"[@deepchecks_interaction.context.application_name].name\"}}&versionName={{urlencode \"[@deepchecks_interaction.context.application_version_name].name\"}}\n\nDeepchecks' dashbaord in datadog:\nhttps://app.datadoghq.com/dashboard/yge-c92-33z/monitor-llm?from_ts={{eval \"last_triggered_at_epoch-10*60*1000\"}}&to_ts={{eval \"last_triggered_at_epoch+10*60*1000\"}}&live=false\n\n@slack-test-datadog-dc-integration","tags":[],"options":{"thresholds":{"critical":0.2,"warning":0.4},"enable_logs_sample":false,"notify_audit":false,"on_missing_data":"default","include_tags":false,"variables":[{"data_source":"logs","name":"query","indexes":["*"],"compute":{"aggregation":"count"},"group_by":[{"facet":"@deepchecks_interaction.context.application_name","limit":10,"sort":{"order":"desc","aggregation":"count"}},{"facet":"@deepchecks_interaction.context.application_version_name","limit":10,"sort":{"order":"desc","aggregation":"count"}}],"search":{"query":"@deepchecks_interaction.estimated_annotation:good"},"storage":"hot"},{"data_source":"logs","name":"query1","indexes":["*"],"compute":{"aggregation":"count"},"group_by":[{"facet":"@deepchecks_interaction.context.application_name","limit":10,"sort":{"order":"desc","aggregation":"count"}},{"facet":"@deepchecks_interaction.context.application_version_name","limit":10,"sort":{"order":"desc","aggregation":"count"}}],"search":{"query":""},"storage":"hot"}],"new_group_delay":0,"silenced":{}},"priority":null,"restricted_roles":null}
Next steps
Once you have enough data in Datadog, go over the dashboard and tune it according to your needs.
Do the same for the monitors, tune the existing monitor to balance between "false alarms" and a "real issue", add more monitors based on other properties to help you capture production regressions on time.
Updated 6 months ago