Overview

Smart Studio provides a range of evaluation methods to evaluate single or multiple models. Generate detailed reports and gain a clear understanding of your model's capabilities.

Evaluation Types

Smart Studio offers two primary ways to structure your evaluation, depending on your goal.

Single Model Evaluation

Evaluate the performance of a single model in-depth to understand its strengths and weaknesses.

Validate performance after fine-tuning
Identify specific failure modes and areas for improvement
Establish a performance baseline for a new task

Comparative Evaluation

Compare two or more models side-by-side to determine the best-performing one for your use case.

Benchmark your fine-tuned model against a base model or industry standards
Select the optimal model from multiple candidates
Track performance improvements across different model versions

Single vs. Comparative Evaluation

Feature	Single Model Evaluation	Comparative Evaluation
Primary Goal	Deep-dive analysis of one model	Ranking and selection among multiple models
Input	One model	Two models
Output	A detailed performance report	A leaderboard and side-by-side comparisons
Best For	Understanding "How good is this model?"	Answering "Which model is better?"

Evaluation Methods

Choose the evaluation method that best fits your dataset, requirements, and resources.

AI Auto Evaluation

Uses a powerful Large Language Model (LLM) as a judge to score model outputs based on custom or predefined criteria.

Evaluate open-ended responses
Assess performance on custom datasets

Learn More

Benchmark Evaluation

Evaluates models against standardized, public datasets to provide objective and reproducible metrics.

Measure performance on academic benchmarks (e.g., MMLU, GSM8K)
Compare your model against state-of-the-art (SOTA) models
Ensure objective and consistent scoring

Learn More

AI Auto vs. Benchmark

Feature	AI-Auto Evaluation	Benchmark Evaluation
Evaluation Core	LLM as Judge	Standardized Datasets
Best For	Custom & open-ended tasks	Standard & objective tasks
Flexibility	High (custom metrics & scenes)	Low (fixed metrics & datasets)
Evaluation Time	Varies (often fast on custom sets)	Long

Detailed Evaluation Reports

Regardless of the type or method you choose, every evaluation job generates a detailed report. The report provides a comprehensive overview of your model's performance, including:

Overall scores and a summary of key findings.
Metric-level breakdowns to identify specific strengths and weaknesses.
Side-by-side comparisons and a leaderboard to rank models against each other.
Sample-level analysis to review individual inputs and outputs for error analysis.

Next Steps

Start Evaluation

Create a new evaluation job to measure and compare your models

Learn More

Create Deployment

Create a deployment endpoint for your model

Learn More

Evaluation Types​

Single vs. Comparative Evaluation​

Evaluation Methods​

AI Auto vs. Benchmark​

Detailed Evaluation Reports​

Next Steps​

Evaluation Types

Single vs. Comparative Evaluation

Evaluation Methods

AI Auto vs. Benchmark

Detailed Evaluation Reports

Next Steps