Overview
Smart Studio provides a range of evaluation methods to evaluate single or multiple models. Generate detailed reports and gain a clear understanding of your model's capabilities.
Evaluation Types
Smart Studio offers two primary ways to structure your evaluation, depending on your goal.
Evaluate the performance of a single model in-depth to understand its strengths and weaknesses.
- Validate performance after fine-tuning
- Identify specific failure modes and areas for improvement
- Establish a performance baseline for a new task
Compare two or more models side-by-side to determine the best-performing one for your use case.
- Benchmark your fine-tuned model against a base model or industry standards
- Select the optimal model from multiple candidates
- Track performance improvements across different model versions
Single vs. Comparative Evaluation
| Feature | Single Model Evaluation | Comparative Evaluation |
|---|---|---|
| Primary Goal | Deep-dive analysis of one model | Ranking and selection among multiple models |
| Input | One model | Two models |
| Output | A detailed performance report | A leaderboard and side-by-side comparisons |
| Best For | Understanding "How good is this model?" | Answering "Which model is better?" |
Evaluation Methods
Choose the evaluation method that best fits your dataset, requirements, and resources.
Uses a powerful Large Language Model (LLM) as a judge to score model outputs based on custom or predefined criteria.
- Evaluate open-ended responses
- Assess performance on custom datasets
Evaluates models against standardized, public datasets to provide objective and reproducible metrics.
- Measure performance on academic benchmarks (e.g., MMLU, GSM8K)
- Compare your model against state-of-the-art (SOTA) models
- Ensure objective and consistent scoring
AI Auto vs. Benchmark
| Feature | AI-Auto Evaluation | Benchmark Evaluation |
|---|---|---|
| Evaluation Core | LLM as Judge | Standardized Datasets |
| Best For | Custom & open-ended tasks | Standard & objective tasks |
| Flexibility | High (custom metrics & scenes) | Low (fixed metrics & datasets) |
| Evaluation Time | Varies (often fast on custom sets) | Long |
Detailed Evaluation Reports
Regardless of the type or method you choose, every evaluation job generates a detailed report. The report provides a comprehensive overview of your model's performance, including:
- Overall scores and a summary of key findings.
- Metric-level breakdowns to identify specific strengths and weaknesses.
- Side-by-side comparisons and a leaderboard to rank models against each other.
- Sample-level analysis to review individual inputs and outputs for error analysis.