Skip to main content

Create Dataset

Learn how to create and manage datasets for training and evaluation purposes.

For detailed information about fine-tuning methods, see the Fine‑Tuning documentation.

Dataset Creation Methods

Smart Studio offers multiple ways to create datasets for training your models.

Manual Upload

Upload your existing datasets in various formats including JSON, CSV, TXT, and more.

Features:

  • Multiple file formats
  • Batch upload support
  • Data validation
AI Dataset Preparation

Let AI automatically prepare and label your datasets with intelligent data processing.

Features:

  • Automated labeling
  • Smart data splitting
  • Quality assurance

Manual Upload Configuration

Select your dataset settings and upload your data.

Create Dataset


Dataset Name

Enter a name for your dataset. We recommend including a version number to distinguish between datasets.

Example: qwen-sft-v1, qwen-dpo-v2

Dataset Type

Select the type of dataset you want to create.

  • Training Set: The data used to teach the model patterns and knowledge for your specific task.
  • Evaluation Set: A separate set of unseen data used to measure your model's performance after training.

Training Category

Select a training category that matches your fine-tuning task. The required data format depends on the category you choose. See Data Format Reference for details.

  • LLM SFT Generation: Supervised fine-tuning for large language models.
  • LLM DPO Generation: Preference alignment training for large language models.
  • VLM SFT Generation: Supervised fine-tuning for vision-language models.
  • VLM DPO Generation: Preference alignment training for vision-language models.

Upload Method

You can upload your dataset in one of two ways.

Local Upload

Click the upload area or drag and drop your file. Supported formats: TAR, JSONL.

OSS

Enter the full OSS path to your dataset file.

Format: oss://bucket-name.oss-region/path/to/file.jsonl

Data Format Reference

The required data format depends on the training category you selected. Each line should be a valid JSON object.

LLM SFT Generation

{
"messages": [
{"role": "system", "content": "<system>"},
{"role": "user", "content": "<query1>"},
{"role": "assistant", "content": "<response1>"},
{"role": "user", "content": "<query2>"},
{"role": "assistant", "content": "<response2>"}
]
}

LLM DPO Generation

{
"messages": [
{"role": "system", "content": "<system>"},
{"role": "user", "content": "<query1>"},
{"role": "assistant", "content": "<response1>"},
],
"rejected_response": "<reject_response>"
}

VLM SFT Generation

{
"messages": [
{"role": "system", "content": "<system>"},
{"role": "user", "content": "<query1>"},
{"role": "assistant", "content": "<response1>"},
],
"images": ["/xxx/x.jpg", "/xxx/x.png"]
}
Image Placeholders

The <image> placeholder in the user content is optional. If included, the number of <image> placeholders must match the number of paths in the images array.

VLM DPO Generation

{
"messages": [
{"role": "system", "content": "<system>"},
{"role": "user", "content": "<query1>"},
{"role": "assistant", "content": "<response1>"},
],
"images": ["/xxx/x.jpg", "/xxx/x.png"],
"rejected_response": "<reject_response>"
}

Note: The <image> placeholder in the user content is optional. If included, the number of <image> placeholders must match the number of paths in the images array.

Manage Datasets

My Datasets page lists all datasets that you create. Use this page to view dataset details, edit configurations, and delete datasets you no longer need.

Datasets list

Next Steps

Fine-Tuning

Use your created dataset to fine-tune models for better performance on your specific tasks.

Model Evaluation

Evaluate your models using your dataset to measure performance and accuracy.