Qwen2.5-VL - Vision Language Model¶

Qwen2.5-VL is a state-of-the-art vision-language model developed by Alibaba Cloud that excels at understanding images and answering questions about visual content.

Overview¶

Qwen2.5-VL is a powerful multimodal model capable of:

Visual Question Answering - Answer complex questions about image content
Image Captioning - Generate detailed descriptions of images
Multi-image Understanding - Reason across multiple images simultaneously
Fine-grained Recognition - Identify specific objects, text, and details
Spatial Reasoning - Understand spatial relationships and layouts

GPU Recommended

Qwen2.5-VL works best with GPU acceleration. CPU inference may be very slow or may not work properly.

License & Attribution

License: Apache 2.0
Source: Qwen Team
Paper: Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Pre-trained Model IDs¶

Qwen2.5-VL pre-trained models are available and do not require a Roboflow API key.

Model ID	Description
`qwen25-vl-7b`	7B parameter model - balanced performance and speed

You can also use fine-tuned models from Roboflow by specifying project/version as the model ID (requires API key).

Supported Backends¶

Backend	Extras Required
`torch`	`torch-cpu`, `torch-cu118`, `torch-cu124`, `torch-cu126`, `torch-cu128`, `torch-jp6-cu126`

Roboflow Platform Compatibility¶

Feature	Supported
Training	✅ LoRA fine-tuning only
Upload Weights	✅ Upload fine-tuned models
Serverless API (v2)	⚠️ Limited support (not yet fully stable)
Workflows	✅ Use in Workflows via Qwen2.5-VL block
Edge Deployment (Jetson)	❌ Not supported
Self-Hosting	✅ Deploy with `inference-models` (GPU recommended)

Usage Examples¶

Visual Question Answering¶

import cv2
from inference_models import AutoModel

# Load model
model = AutoModel.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
image = cv2.imread("path/to/image.jpg")

# Ask a question
answers = model.prompt(
    images=image,
    prompt="What objects are visible in this image?",
    max_new_tokens=512
)
print(f"Answer: {answers[0]}")

Image Captioning¶

import cv2
from inference_models import AutoModel

# Load model
model = AutoModel.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")
image = cv2.imread("path/to/image.jpg")

# Generate detailed caption
captions = model.prompt(
    images=image,
    prompt="Describe this image in detail.",
    max_new_tokens=512
)
print(f"Caption: {captions[0]}")

Multi-image Understanding¶

import cv2
from inference_models import AutoModel

# Load model
model = AutoModel.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct")

# Load multiple images
image1 = cv2.imread("path/to/image1.jpg")
image2 = cv2.imread("path/to/image2.jpg")

# Compare images
answers = model.prompt(
    images=[image1, image2],
    prompt="What are the differences between these two images?",
    max_new_tokens=512
)
print(f"Answer: {answers[0]}")

Using Fine-tuned Models¶

import cv2
from inference_models import AutoModel

# Load your fine-tuned model from Roboflow
model = AutoModel.from_pretrained(
    "your-project/version",
    api_key="your_roboflow_api_key"
)

image = cv2.imread("path/to/image.jpg")

# Use with custom prompt
answers = model.prompt(
    images=image,
    prompt="your custom question",
    max_new_tokens=512
)
print(f"Answer: {answers[0]}")

Workflows Integration¶

Qwen2.5-VL can be used in Roboflow Workflows for complex computer vision pipelines.

Learn more: Workflows Documentation

Performance Tips¶

Use GPU - Qwen2.5-VL requires GPU for acceptable performance
Optimize prompts - Clear, specific prompts yield better results
Adjust max_new_tokens - Increase for longer responses, decrease for faster inference