Production Monitoring
In production environments, we face the challenge of operating in a labelless landscape. Luckily, the Deepchecks auto annotation pipeline is based on properties that do not require a ground truth response, making model performance evaluation flexible and faster.
In production, we apply the same auto annotation pipeline to monitor whether our application continues to perform similarly to what we saw in the evaluation set.
Production Sampling Configuration
For high-volume production applications, evaluate a strategic sample of your data rather than every interaction. For detailed sampling configuration options, see our Production Sampling Guide.
Production Performance Summary:
Tool Use: 218 interactions with 85.32% overall score, showing strong planning (4.62) and tool calling (4.82) performance
Generation: 141 interactions with 90.91% overall score, demonstrating solid instruction fulfillment (4.29) and appropriate response handling
In order to be confident your production is behaving as expected, monitor your annotation score and property trends over time to detect performance drift and ensure consistent agent behavior. You can read more about production monitoring in our Production Monitoring guide.
Property Trends
Click on any property to see its performance trend over your selected time range. This helps identify when and why performance changes occur.
Example: Avoided Answer Trend Analysis
The trend reveals interesting patterns: baseline performance around 0.2 from March through May, followed by a significant spike to 0.6 in mid-June, then a return to baseline levels by July. This type of sudden change warrants investigation.
Investigating Property Trends
When you notice concerning trends, click on the corresponding bar in the main dashboard to examine the interactions from that time range.
The June spike in "Avoided Answer" scores reveals an influx of irrelevant queries—account management, cryptocurrency, and general customer service requests. Fortunately, the "Relevant Topic" property we built during root cause analysis now properly classifies these interactions, recognizing that avoiding irrelevant questions is good behavior rather than a performance problem.
Important Consideration: While it's good that the agent appropriately avoids answering topics outside its current scope, the high volume of irrelevant queries suggests potential opportunities. Consider whether adding support for these common question domains (like basic account management or cryptocurrency market data) would provide additional value to users while maintaining focus on core financial advisory capabilities.
Congratulations!
You've successfully completed the Investment Agent evaluation workflow.
Updated about 23 hours ago