Skip to content

About Evaluation Studio

Evaluation Studio is a unified workspace for evaluating AI system performance across two primary areas: Model Evaluation and Agentic Evaluation. It enables users to systematically assess both the quality of large language model (LLM) outputs and the behavior of agentic applications in real-world scenarios.

By supporting both model-level and agentic evaluation, Evaluation Studio provides a comprehensive foundation for improving LLM quality and agentic application behavior. Whether you're validating prompt effectiveness, debugging tool chains, or auditing full workflows, Evaluation Studio enables scalable, data-driven iteration—helping you build safer, more reliable, and higher-performing AI systems.

Model Evaluation

Model Evaluation enables you to assess the performance of large language models (LLMs) using configurable quality and safety metrics. You can upload datasets with input-output pairs, apply built-in or custom evaluators, and analyze model effectiveness through visual scoring, thresholds, and collaborative projects. This type of evaluation is ideal for fine-tuning, comparing, and validating models before or after deployment. Learn more

Agentic Evaluation

Agentic Evaluation is designed to assess how effectively an agentic application performs in production. You can import app sessions and trace data, then run multi-level evaluations to understand how well the agentic app achieves goals, adheres to workflows, utilizes tools and handles various other tasks and interactions. Agentic Evaluation enables multi-level evaluation across sessions and traces, offering deep insights into how orchestrators, agents, and tools operate in production. This helps uncover coordination issues, workflow failures, and opportunities for optimization. Learn more