Skip to main content

Overview

Smart Studio provides a range of evaluation methods to evaluate single or multiple models. Generate detailed reports and gain a clear understanding of your model's capabilities.

Evaluation Types

Smart Studio offers two primary ways to structure your evaluation, depending on your goal.

Single Model Evaluation

Evaluate the performance of a single model in-depth to understand its strengths and weaknesses.

  • Validate performance after fine-tuning
  • Identify specific failure modes and areas for improvement
  • Establish a performance baseline for a new task
Comparative Evaluation

Compare two or more models side-by-side to determine the best-performing one for your use case.

  • Benchmark your fine-tuned model against a base model or industry standards
  • Select the optimal model from multiple candidates
  • Track performance improvements across different model versions

Single vs. Comparative Evaluation

FeatureSingle Model EvaluationComparative Evaluation
Primary GoalDeep-dive analysis of one modelRanking and selection among multiple models
InputOne modelTwo models
OutputA detailed performance reportA leaderboard and side-by-side comparisons
Best ForUnderstanding "How good is this model?"Answering "Which model is better?"

Evaluation Methods

Choose the evaluation method that best fits your dataset, requirements, and resources.

AI Auto Evaluation

Uses a powerful Large Language Model (LLM) as a judge to score model outputs based on custom or predefined criteria.

  • Evaluate open-ended responses
  • Assess performance on custom datasets
Benchmark Evaluation

Evaluates models against standardized, public datasets to provide objective and reproducible metrics.

  • Measure performance on academic benchmarks (e.g., MMLU, GSM8K)
  • Compare your model against state-of-the-art (SOTA) models
  • Ensure objective and consistent scoring

AI Auto vs. Benchmark

FeatureAI-Auto EvaluationBenchmark Evaluation
Evaluation CoreLLM as JudgeStandardized Datasets
Best ForCustom & open-ended tasksStandard & objective tasks
FlexibilityHigh (custom metrics & scenes)Low (fixed metrics & datasets)
Evaluation TimeVaries (often fast on custom sets)Long

Detailed Evaluation Reports

Regardless of the type or method you choose, every evaluation job generates a detailed report. The report provides a comprehensive overview of your model's performance, including:

  • Overall scores and a summary of key findings.
  • Metric-level breakdowns to identify specific strengths and weaknesses.
  • Side-by-side comparisons and a leaderboard to rank models against each other.
  • Sample-level analysis to review individual inputs and outputs for error analysis.

Next Steps

Start Evaluation

Create a new evaluation job to measure and compare your models

Create Deployment

Create a deployment endpoint for your model