Skip to main content

AI Dataset Preparation

Leverage our intelligent data engine to automatically prepare, label, and optimize your datasets. This system uses advanced data processing and quality assurance to make your data train-ready for any LLM.

When to Use AI Dataset Preparation

Use this feature when you have raw, unlabeled data and need to create a structured dataset for fine-tuning. LLM will help you generate high-quality data.

Prerequisite

This feature consumes tokens. You must first configure your API Keys in the Provider Keys section. If your provider is not listed, please contact us.

Step 1: Upload & Configure

In this step, you upload your raw data, select a dataset generation method, and define your training requirements. The system uses these inputs to generate labeling rules and build your dataset.

1. Method Selection

Choose the data preparation method that best suits your needs:

  • Base Data Agent: Standard processing from your raw data.
  • SSS-Bench: An advanced version offering a "T-shaped" evaluation framework to balance task specialization with general robustness. (Currently supports LLM models only)

Method Selection


When to Use Each Method

Base Data AgentSSS-Bench
Best forQuick dataset generation with standard labelingImproving domain expertise while preserving the base model's general capabilities
Model supportLLM, VLMLLM only
Token consumptionLowerHigher
Use whenYou need train-ready data fast and do not require domain benchmarkingYou want to ensure domain adaptation gains without significant degradation in general performance
Key advantageSimple and efficientInjects industry open-source data to improve both depth and breadth of your dataset, leading to better training results
How SSS-Bench Works

Post-training adaptation (such as SFT) risks catastrophic forgetting. SSS-Bench provides a "Specialization-Generalization" dynamic evaluation across three dimensions:

  • Vertical Depth (Task Specialization): Generates diverse task-specific questions from your raw data to quantify domain adaptation gains.
  • Domain Breadth (Domain General Knowledge): Evaluates coverage of broad industry knowledge (e.g., medical, financial) using open-source datasets.
  • Horizontal Base (General Robustness): Monitors retention of basic reasoning capabilities (based on MMLU, etc.) to detect degradation in general utility.

SSS-Bench Configuration (SSS-Bench Only)

If you select SSS-Bench, three additional configuration steps are required:

Select Industry

Choose the target industry for your model: Finance, Healthcare, Legal, or Others.

Note: If your domain is not listed, select Others. SSS-Bench generates diverse datasets from your raw data and may produce better training results than Base Data Agent, but consumes more tokens.


Industry Selection


Select Domain Benchmarks

Choose specific sub-datasets within the industry to focus your evaluation.

IndustrySub-datasets
FinanceFinancial Report QA, Financial Math (test set only)
HealthcareMedical Subfields Knowledge, Medical Exams
LegalContracts & Transactions, Legal Knowledge & Exams, Judicial Analysis

Domain Benchmarks


Select General Capability Benchmarks (Optional)

Select general capabilities to monitor to ensure your model retains its foundational strengths.

Available benchmarks:

  • Knowledge & Comprehension: commonsense reasoning, world knowledge, reading comprehension
  • Language: natural language understanding, natural language inference, multilingual
  • Reasoning & Math: reasoning, math
  • Long Context: long_context (test set only)

General Capabilities


Feature: Even without raw data, SSS-Bench can synthesize open-source datasets to customize training and test sets, allowing you to experience the full fine-tuning flow.

2. Upload Unlabeled Data

Upload raw data to generate training and evaluation sets.

Supported File Types

Supported formats depend on the model type:

  • VLM (Vision-Language Model): image files only (e.g., JPG, PNG)
  • LLM (Large Language Model): image and text files (e.g., TXT, PDF, MD)

Note: Files that do not meet the format requirements are ignored and not included in your dataset.

Drag and drop files into the upload area, or click to select them.

3. Evaluation Dataset

Choose how to generate the evaluation set:

  • Auto Split: Automatically allocate a percentage of your data (default: 30%).
  • Upload Your Own: Upload a pre-made evaluation set.

foot-info


4. Requirements & Scenario

Describe your use case and expected response style. The system uses your description to prepare a training-ready dataset that aligns with your goals.

You may include:

  • Target audience
  • Scenario or domain
  • Expected response style
  • Boundaries or taboo topics
  • Evaluation considerations

5. Task Name

Enter a descriptive name for your task.

6. Model Type & Post‑Training Method

Select the model type (LLM or VLM) and the post‑training method (SFT, REF DISTILL or DPO) you plan to use.

Tip

If you are unsure which options to choose, click Recommend for Me, and the system will suggest the best configuration based on your inputs.

Step 2: Evaluation Set Labeling

In this step, you review the AI-generated labeling rules and generate the evaluation dataset.

Evaluation Set Labeling

1. AI-Generated Labeling Rules

After completing Step 1, the system automatically analyzes your uploaded data and scenario description to generate labeling rules and a dataset construction plan.

These rules define how the AI will label and structure your test dataset.
You should carefully review them and make any corrections as needed.

The rules typically include:

  • Response style guidelines
  • Content and structural requirements
  • Quality and accuracy expectations
  • Dataset construction logic

You can edit or refine any part of the rules before proceeding.

2. Token Usage Reminder

Once the system generates the labeling, the interface displays:

  • Tokens already consumed
  • Estimated tokens required to generate the full evaluation dataset
Token Budget

Insufficient tokens may interrupt or delay dataset generation. Make sure your remaining token budget is sufficient before proceeding.

3. Generate Test Dataset

After confirming the AI-Generated Labeling Rules, click Generate to create your evaluation dataset.

Note: You must manually confirm the labeling rules before generation.

4. Review and Edit Generated Labels

When the evaluation dataset is generated, you can review each labeled item. The interface allows you to:

  • Modify incorrect or incomplete labels
  • Delete unwanted labeled samples
  • Confirm labels that look correct

You can review items one by one or process them in batches.

Step 3: Training Set Labeling

In this step, the system automatically generates labeled training data based on the confirmed labeling rules and the dataset construction plan from the previous steps.

Training Set Labeling

1. Auto‑Generated Training Data

The model produces labeled outputs in bulk. The output format matches your selected training model and method.

Review the generated results to ensure they meet your expectations.

2. Regenerate If Needed

If the generated labels are inaccurate, incomplete, or misaligned with your rules, you can click Regenerate.

The system will recreate the training labels for the dataset based on the same rules. Use this option whenever you feel the overall quality needs improvement.

3. Proceed When Satisfied

Once you confirm the training labels are correct, continue to the Complete step to finalize your dataset.

Step 4: Complete

In the final step, you review and confirm your fully processed dataset. Once confirmed, your dataset is ready for training.

Complete

1. Final Dataset Overview

After the system generates the training set and evaluation set, a summary of your completed dataset appears.
The summary includes key information such as:

  • Dataset name
  • Target vs. actual training sample counts
  • Target vs. actual evaluation sample counts
  • Number of duplicates removed
  • Dataset purpose
  • Total processing time

Review the summary to verify that the dataset meets your expectations.

2. Review Data Quality Report

The data quality report shows how the system constructed the dataset and whether the output aligns with your initial requirements.
If needed, you can still update the dataset name at this stage.

3. Finalize the Dataset

When everything looks correct, click Complete & Return to Datasets.
You can then download, store, or use your newly generated train‑ready dataset for further fine‑tuning.

Next Steps

Fine-Tuning

Use your created dataset to fine-tune models for better performance on your specific tasks.

Model Evaluation

Evaluate your models using your dataset. SSS-Bench datasets support granular reporting on Specialization, Domain Knowledge, and General Robustness.