Skip to main content

Direct Preference Optimization - VLM

Learn how to train vision-language models using Direct Preference Optimization (DPO) to align multimodal model behavior with human preferences through comparative feedback.

Purpose and Overview

Direct Preference Optimization (DPO) for Vision-Language Models (VLM) optimizes multimodal models based on human preferences and comparative feedback. DPO-VLM learns from preference rankings on image-text tasks to align model behavior with human values and expectations for multimodal outputs.

Step 1: Choose Base Model

Select an appropriate vision-language base model as the foundation for your DPO training process. The choice of base model significantly impacts the final performance and alignment capabilities of your optimized model.


Choose Base Model


Select a base model as the foundation for fine-tuning. The choice of base model impacts the final performance and capabilities of the fine-tuned model. For detailed model comparisons and selection criteria, see How to Choose Models.

Selection Tips
  • Start with Instruct models for most conversational applications. (e.g., Qwen3-VL-4B-Instruct)
  • Choose Thinking models when your task requires step-by-step reasoning. (e.g., Qwen3-VL-4B-Thinking)

Step 2: Dataset & Evaluation

Upload your multimodal preference dataset containing comparative examples with images and configure evaluation settings to monitor training progress and alignment quality.

Smart Studio provides multiple ways to prepare datasets:

  • Upload a dataset directly. For instructions, see Create Datasets.
  • Use AI Dataset Preparation to automate the dataset creation process.
  • Provide the OSS address of the data without uploading the file to the platform.

Dataset & Evaluation


Dataset Requirements

File Format

File must be in JSONL format with each line containing multimodal preference pairs with chosen and rejected responses. Image references must be included in the conversation content.

Dataset Size

Recommended size: 100-100,000 examples. Vision models typically require fewer examples than text-only models.

Image Quality

High-quality images are essential. Ensure images are clear, properly formatted, and relevant to the task.

Preference Quality

Clear preference distinctions are essential. Ensure chosen responses demonstrate better visual understanding and description quality than rejected ones.

DPO-VLM Data Quality Tips
  • Ensure clear quality differences between chosen and rejected responses for image-text tasks
  • Include diverse visual scenarios covering different image types and alignment aspects
  • Maintain consistent preference criteria for visual descriptions throughout the dataset
  • Validate that preference rankings reflect accurate visual understanding and human values

Required Data Format

{
"messages": [
{"role": "system", "content": "<system>"},
{"role": "user", "content": "<query1>"},
{"role": "assistant", "content": "<response1>"},
],
"images": ["/xxx/x.jpg", "/xxx/x.png"],
"rejected_response": "<reject_response>"
}

Format Explanation

  • When user-content contains placeholders like <image>, they should correspond to the order in the images field, with matching quantities
  • When <image> count is 0, it corresponds to LLM-SFT, allowing 0 images
  • <image> only appears in user-content

Example Data Formats

{
"messages": [
{"role": "system", "content": "You are a helpful and harmless assistant"},
{"role": "user", "content": "<image>Describe this image in detail"},
{"role": "assistant", "content": "The image shows a golden retriever playing in a sunny park with green grass and trees in the background"}
],
"images": ["/data/park_dog.jpg"],
"rejected_response": "It's a dog"
}
{
"messages": [
{"role": "user", "content": "<image><image>Which image shows a safer driving environment? Explain your reasoning"},
{"role": "assistant", "content": "The first image shows a safer driving environment. The road is dry with clear visibility and light traffic. The second image shows wet road conditions with reduced visibility and heavy traffic, which increases the risk of accidents"}
],
"images": ["/data/road_clear.jpg", "/data/road_wet.jpg"],
"rejected_response": "The first one looks safer"
}
{
"messages": [
{"role": "system", "content": "You are a medical imaging assistant"},
{"role": "user", "content": "<image>Analyze this X-ray image and describe your findings"},
{"role": "assistant", "content": "The chest X-ray shows clear lung fields with no signs of consolidation or effusion. The cardiac silhouette appears normal in size. The bony structures are intact with no visible fractures"}
],
"images": ["/data/chest_xray.jpg"],
"rejected_response": "The X-ray looks normal"
}
{
"messages": [
{"role": "user", "content": "<image>Describe the image, <video>and summarize what happens in the video"},
{"role": "assistant", "content": "The image shows a chef preparing ingredients on a cutting board. The video demonstrates the complete recipe process, starting with chopping vegetables, followed by sautéing in a pan, and finishing with plating the dish"}
],
"images": ["/data/chef.jpg"],
"videos": ["/data/cooking.mp4"],
"rejected_response": "Someone is cooking"
}

Step 3: Settings & Options

Configure DPO-VLM training parameters and model settings. Default values are optimized for multimodal preference learning, but you can adjust them based on your specific alignment requirements and dataset characteristics.


Settings & Options


Basic Configuration

Custom Model Name

Used for display in My Models for management purposes. Choose a descriptive name that helps you identify the model's purpose and version.

Example: "VLM-Safety-Aligned-v1" or "Visual-Preference-Model"

Task Display Name

Set a display name for this DPO-VLM training task. The name appears in the Fine-tuning task list and helps you track alignment progress and history.

Example: "Q1-2025-DPO-VLM-Alignment" or "Visual-Safety-DPO-Jan"

Training Parameters

The following parameters apply to all fine-tuning methods unless otherwise noted. LoRA is the default fine-tuning method.

ParameterDefinitionTuning Impact
epochThe number of complete passes through the training dataset.Increase: More learning opportunities, but the model may perform well on training data while producing poor results on new inputs.
Decrease: Trains faster, but the model may not learn enough to perform well.
batch_sizeDefines the number of training examples to process in a single group. Large Batch size may lead to out-of-memory.Increase: Produces more consistent training updates, but uses significantly more GPU memory.
Decrease: Reduces GPU memory usage, but training updates may become less consistent.
learning_rateControls the size of each adjustment the model makes during training.Increase: The model learns faster, but training may become unstable and fail to reach a good solution.
Decrease: Training becomes more stable and precise, but takes longer and may settle for a solution that is not optimal.
lora_rankSets the learning capacity of the LoRA adapters.Increase (e.g., 16, 32): Improves the model's ability to learn complex tasks, but uses more GPU memory.
Decrease (e.g., 4, 8): Reduces GPU memory usage, but the model may struggle with complex tasks.

Advanced Parameters

These parameters work well with default values for most use cases. Adjust only when needed.

ParameterDefinitionTuning Impact
max_context_lengthSets the maximum token limit per example. Texts exceeding this limit will be truncated.Increase to learn from longer texts, but this significantly increases GPU memory usage.
warmup_ratioSpecifies the fraction of the training process to use for a "warm-up" phase. During this phase, the learning rate slowly increases to prevent early training instability.A small value (0.03–0.1) is generally recommended. This is primarily a stability mechanism, not a performance tuning parameter.
gradient_accumulation_stepsSpecifies the number of small batches to process before the model performs a single learning update. This simulates a larger batch size to save memory.Increase to achieve more stable training at the cost of slower speed. A value of 1 disables this feature.
target_modulesIdentifies the specific internal components (layers) of the model that will be modified by LoRA.Adding more modules allows more comprehensive adaptation but increases trainable parameters.
MAX_PIXELSSets the maximum allowed pixel count (Height × Width) for input images. The system automatically resizes any image exceeding this limit to prevent memory errors.Recommended to set a value to prevent OOM errors. Leave as None only if your dataset contains no large images.
beta DPOKL regularization coefficient. Controls how closely the trained model stays to the reference model's behavior during DPO training.Increase to keep the model closer to its original behavior, reducing the risk of overfitting to preference data.

After reviewing all the configuration, click Create Task to begin the training process.

Step 4: Monitor Training Progress

During and after training, check key training metrics at any time.


monitor


The Model Loss chart displays two metrics:

  • Training Loss: Measures how well the model learns from your training data.
  • Validation Loss: Measures how well the model generalizes to unseen data.
Interpret training metrics
  • If both losses decrease steadily, your model is learning well. Continue training.
  • If training loss decreases but validation loss increases, your model may be overfitting. Stop training and deploy the current model.
  • If both losses remain high or increase, your training data or configuration may need adjustment. Review your dataset and parameters.

DPO-VLM Parameter Guidelines

  • Start with defaults: Default values are optimized for vision tasks

  • MAX_PIXELS: Controls image resolution processing - higher values preserve detail but increase memory usage

  • Adjust learning rate: Lower for stable training, higher for faster convergence

Next Steps

Deploy the Fine-Tuned Model

Once training is complete, deploy your fine-tuned model to a production endpoint for real-world usage.

Test in Model Lab

Test your fine-tuned model's performance and compare against base models in our interactive testing environment.