LmCast :: Stay tuned in

Qwen3-VL can scan two-hour videos and pinpoint nearly every detail

Recorded: Dec. 3, 2025, 3:04 a.m.

Original Summarized

Qwen3-VL can scan two-hour videos and pinpoint nearly every detail

Skip to content

Ad

THE DECODER
Artificial Intelligence: News, Business, Research

DE

Gradient

Color scheme

Color scheme

AI research
Nov 28, 2025Nov 28, 2025

Jonathan Kemper

Qwen3-VL can scan two-hour videos and pinpoint nearly every detail

Alibaba

Jonathan Kemper

Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects.

Profile

E-Mail

Content

Summary

A few months after launching Qwen3-VL, Alibaba has released a detailed technical report on the open multimodal model. The data shows the system excels at image-based math tasks and can analyze hours of video footage.Ad

The system handles massive data loads, processing two-hour videos or hundreds of document pages within a 256,000-token context window.
In "needle-in-a-haystack" tests, the flagship 235-billion-parameter model located individual frames in 30-minute videos with 100 percent accuracy. Even in two-hour videos containing roughly one million tokens, accuracy held at 99.5 percent. The test works by inserting a semantically important "needle" frame at random positions in long videos, which the system must then find and analyze.
The needle-in-a-haystack test measures the model's ability to locate specific frames in long videos. | Image: Alibaba

Share
Recommend our article
Share

In published benchmarks, the Qwen3-VL-235B-A22B model often beats Gemini 2.5 Pro, OpenAI GPT-5, and Claude Opus 4.1 - even when competitors use reasoning features or high thinking budgets. The model dominates visual math tasks, scoring 85.8 percent on MathVista compared to GPT-5's 81.3 percent. On MathVision, it leads with 74.6 percent, ahead of Gemini 2.5 Pro (73.3 percent) and GPT-5 (65.8 percent).Ad

Ad

THE DECODER Newsletter
The most important AI news straight to your inbox.
✓ Weekly
✓ Free
✓ Cancel at any time

Please leave this field empty

Check your inbox or spam folder to confirm your subscription.

Gemini's older 2.5 Pro model maintains a slight lead in general image understanding. | Image: Alibaba
The model also shows range in specialized benchmarks. It scored 96.5 percent on the DocVQA document comprehension test and 875 points on OCRBench, supporting 39 languages - nearly four times as many as its predecessor.
Qwen3-VL achieves over 70 percent accuracy on OCR tasks in 32 of the 39 supported languages. | Image: Alibaba
Alibaba claims the system demonstrates new capabilities in GUI agent tasks. It achieved 61.8 percent accuracy on ScreenSpot Pro, which tests navigation in graphical user interfaces. On AndroidWorld, where the system must independently operate Android apps, Qwen3-VL-32B hit 63.7 percent.
The model handles complex, multi-page PDF documents as well. It scored 56.2 percent on MMLongBench-Doc for long document analysis. On the CharXiv benchmark for scientific charts, it reached 90.5 percent on description tasks and 66.2 percent on complex reasoning questions.
It is not a clean sweep, however. In the complex MMMU-Pro test, Qwen3-VL scored 69.3 percent, trailing GPT-5's 78.4 percent. Commercial competitors also generally lead in video QA benchmarks. The data suggests Qwen3-VL is a specialist in visual math and documents, but still lags in general reasoning.
Key technical advances for multimodal AI
The technical report outlines three main architectural upgrades. First, "interleaved MRoPE" replaces the previous position embedding method. Instead of grouping mathematical representations by dimension (time, horizontal, vertical), the new approach distributes them evenly across all available mathematical areas. This change aims to boost performance on long videos.Recommendation

AI research

DeepSeek's latest R1 model matches OpenAI's o1 in reasoning benchmarks

Qwen3-VL combines a vision encoder and language model to process text, images, and videos simultaneously. DeepStack uses visual information from different processing levels. | Image: Alibaba
Second, DeepStack technology allows the model to access intermediate results from the vision encoder, not just the final output. This gives the system access to visual information at different levels of detail.
Third, a text-based timestamp system replaces the complex T-RoPE method found in Qwen2.5-VL. Instead of assigning a mathematical time position to every video frame, the system now inserts simple text markers like "<3.8 seconds>" directly into the input. This simplifies the process and improves the model's grasp of time-based video tasks.
Training at scale with one trillion tokens
Alibaba trained the model in four phases on up to 10,000 GPUs. After learning to link images and text, the system underwent full multimodal training on about one trillion tokens. Data sources included web scrapes, 3 million PDFs from Common Crawl, and over 60 million STEM tasks.
In later phases, the team gradually expanded the context window from 8,000 to 32,000 and finally to 262,000 tokens. The "Thinking" variants received specific chain-of-thought training, allowing them to explicitly map out reasoning steps for better results on complex problems.Ad

Ad

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.


Open weights under Apache 2.0
All Qwen3-VL models released since September are available under the Apache 2.0 license with open weights on Hugging Face. The lineup includes dense variants ranging from 2B to 32B parameters, as well as mixture-of-experts models: the 30B-A3B and the massive 235B-A22B.
While features like extracting frames from long videos aren't new - Google's Gemini 1.5 Pro handled this in early 2024 - Qwen3-VL offers competitive performance in an open package. With the previous Qwen2.5-VL already common in research, the new model is likely to drive further open-source development.

Share

Support our independent, free-access reporting. Any contribution helps and secures our future. Support now:

Bank transfer

Summary

Alibaba's Qwen3-VL, launched in September, outperforms GPT-5 and Gemini 2.5 Pro on benchmarks that require solving math questions using images, analyzing videos, and understanding documents.
The latest technical report highlights that Qwen3-VL can process very long videos and large amounts of text at the same time, accurately identify video frames, and recognize text in 39 languages.
The model was trained on a trillion text and image samples using 10,000 GPUs, is openly available.

Sources

Arxiv

Jonathan Kemper

Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects.

Profile

E-Mail

Share

AI in practice
Oct 4, 2025Oct 4, 2025

Alibaba releases Qwen3 compact open source multimodal models

News, tests and reports about VR, AR and MIXED Reality.
What happens next with MIXED
My personal farewell to MIXED
Meta and Anduril are now jointly developing XR headsets for the US military
MIXED-NEWS.com

AI in practice
Sep 23, 2025Sep 23, 2025

Alibaba's Qwen introduces new models for voice, image editing and safety

Load more

Google News

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.

To top

The Decoder

About
Advertise
Publish with us

Topics

AI research
AI in practice
AI and society

THE DECODER Newsletter
The most important AI news straight to your inbox.
✓ Weekly
✓ Free
✓ Cancel at any time

Please leave this field empty

Check your inbox or spam folder to confirm your subscription.

To top

Privacy Policy |
Privacy Manager |
Legal Notice

The Decoder by DEEP CONTENT | ALL RIGHTS RESERVED 2024

Privacy Policy |
Privacy Manager |
Legal Notice

DE

Gradient

Color scheme

Color scheme

THE DECODER Newsletter
The most important AI news straight to your inbox.
✓ Weekly
✓ Free
✓ Cancel at any time

Please leave this field empty

Check your inbox or spam folder to confirm your subscription.

Join our community
Join the DECODER community on Discord, Reddit or Twitter - we can't wait to meet you.


Qwen3-VL can scan two-hour videos and pinpoint nearly every detail

Share

Copy link
Email

Bank details
IBAN: DE88 2507 0070 0053 0014 00
BIC: DEUTDE2HXXX
Account holder: Deep Content GmbH
Purpose: Support THE DECODER

Suche nach:

AI research
Nov 30, 2025Nov 30, 2025

The ARC benchmark's fall marks another casualty of relentless AI optimization

AI in practice
Nov 28, 2025Nov 28, 2025

DeepseekMath-V2 is Deepseek's latest attempt to pop the US AI bubble

AI research
Nov 24, 2025Nov 24, 2025

Frustrated authors withdraw papers after realizing their reviewers are just lazy LLMs

The Decoder

About
Advertise
Publish with us

Topics

AI research
AI in practice
AI and society

Load more

Google News

Alibaba’s Qwen3-VL represents a significant advancement in multimodal AI, demonstrating the capacity to analyze extensive video content and complex documents with exceptional detail. Released in September, the model’s core strength lies in its performance on tasks requiring visual and mathematical reasoning. Technical reports reveal several key architectural upgrades contributing to this capability. Firstly, the implementation of “interleaved MRoPE” – a revised position embedding method – optimizes the model’s handling of mathematical representations within lengthy video sequences, addressing a core limitation of previous iterations. Secondly, the adoption of DeepStack technology facilitates access to intermediate visual results at multiple processing levels, enhancing the system’s ability to dissect visual information. Finally, a text-based timestamp system, replacing the complex T-RoPE method, simplifies the process of managing time-based video analysis.

The model’s training regimen, involving the processing of approximately one trillion tokens across 10,000 GPUs, underscores its scale and ambition. Data sources included web scrapes, Common Crawl PDFs, and a substantial STEM task dataset, reflecting a diverse and expansive learning environment. Qwen3-VL’s capabilities are highlighted through rigorous benchmarks, including the “needle-in-a-haystack” test, where it demonstrated 99.5 percent accuracy in locating specific frames within two-hour videos, even with a million tokens. Performance metrics also showcase superiority over competitors like GPT-5 and Gemini 2.5 Pro on tasks such as MathVista and MathVision, achieving 85.8 percent and 74.6 percent respectively, demonstrating a clear advantage in visual math problem-solving. Furthermore, the model excels in document comprehension, scoring 96.5 percent on the DocVQA test and achieving 56.2 percent on the MMLongBench-Doc benchmark for long document analysis. Support for 39 languages through OCRBench, along with a notable 61.8 percent accuracy on the ScreenSpot Pro GUI agent task and 63.7 percent on AndroidWorld’s independent Android app operation, further highlights its versatility.

Despite these impressive results, the model isn't without limitations. Specifically, it trails behind GPT-5 in the complex MMMU-Pro test, indicating a current gap in general reasoning capabilities. However, the open-source nature of the model, released under the Apache 2.0 license with weights ranging from 2B to 235B parameters, positions it as a cornerstone for further development within the open-source AI community. The model’s accessible architecture, combined with the substantial training data, are expected to accelerate advancements in multimodal AI research and application.