Qwen3-VL can scan two-hour videos and pinpoint nearly every detail
Recorded: Dec. 3, 2025, 3:04 a.m.
| Original | Summarized |
Qwen3-VL can scan two-hour videos and pinpoint nearly every detail Skip to content Ad THE DECODER DE Gradient Color scheme Color scheme AI research Jonathan Kemper Qwen3-VL can scan two-hour videos and pinpoint nearly every detail Alibaba Jonathan Kemper Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects. Profile
Content Summary A few months after launching Qwen3-VL, Alibaba has released a detailed technical report on the open multimodal model. The data shows the system excels at image-based math tasks and can analyze hours of video footage.Ad The system handles massive data loads, processing two-hour videos or hundreds of document pages within a 256,000-token context window. Share In published benchmarks, the Qwen3-VL-235B-A22B model often beats Gemini 2.5 Pro, OpenAI GPT-5, and Claude Opus 4.1 - even when competitors use reasoning features or high thinking budgets. The model dominates visual math tasks, scoring 85.8 percent on MathVista compared to GPT-5's 81.3 percent. On MathVision, it leads with 74.6 percent, ahead of Gemini 2.5 Pro (73.3 percent) and GPT-5 (65.8 percent).Ad Ad THE DECODER Newsletter Please leave this field empty Check your inbox or spam folder to confirm your subscription. Gemini's older 2.5 Pro model maintains a slight lead in general image understanding. | Image: Alibaba AI research DeepSeek's latest R1 model matches OpenAI's o1 in reasoning benchmarks Qwen3-VL combines a vision encoder and language model to process text, images, and videos simultaneously. DeepStack uses visual information from different processing levels. | Image: Alibaba Ad Join our community Open weights under Apache 2.0 Share Support our independent, free-access reporting. Any contribution helps and secures our future. Support now: Bank transfer
Summary Alibaba's Qwen3-VL, launched in September, outperforms GPT-5 and Gemini 2.5 Pro on benchmarks that require solving math questions using images, analyzing videos, and understanding documents. Sources Arxiv Jonathan Kemper Jonathan writes for THE DECODER about how AI tools can improve both work and creative projects. Profile
Share AI in practice Alibaba releases Qwen3 compact open source multimodal models News, tests and reports about VR, AR and MIXED Reality. AI in practice Alibaba's Qwen introduces new models for voice, image editing and safety Load more Google News Join our community
To top The Decoder About Topics AI research
THE DECODER Newsletter Please leave this field empty Check your inbox or spam folder to confirm your subscription. To top Privacy Policy | The Decoder by DEEP CONTENT | ALL RIGHTS RESERVED 2024 Privacy Policy | DE Gradient Color scheme Color scheme
THE DECODER Newsletter Please leave this field empty Check your inbox or spam folder to confirm your subscription. Join our community Qwen3-VL can scan two-hour videos and pinpoint nearly every detail Share Copy link Bank details Suche nach:
AI research The ARC benchmark's fall marks another casualty of relentless AI optimization AI in practice DeepseekMath-V2 is Deepseek's latest attempt to pop the US AI bubble AI research Frustrated authors withdraw papers after realizing their reviewers are just lazy LLMs
The Decoder About Topics AI research Load more Google News |
Alibaba’s Qwen3-VL represents a significant advancement in multimodal AI, demonstrating the capacity to analyze extensive video content and complex documents with exceptional detail. Released in September, the model’s core strength lies in its performance on tasks requiring visual and mathematical reasoning. Technical reports reveal several key architectural upgrades contributing to this capability. Firstly, the implementation of “interleaved MRoPE” – a revised position embedding method – optimizes the model’s handling of mathematical representations within lengthy video sequences, addressing a core limitation of previous iterations. Secondly, the adoption of DeepStack technology facilitates access to intermediate visual results at multiple processing levels, enhancing the system’s ability to dissect visual information. Finally, a text-based timestamp system, replacing the complex T-RoPE method, simplifies the process of managing time-based video analysis. The model’s training regimen, involving the processing of approximately one trillion tokens across 10,000 GPUs, underscores its scale and ambition. Data sources included web scrapes, Common Crawl PDFs, and a substantial STEM task dataset, reflecting a diverse and expansive learning environment. Qwen3-VL’s capabilities are highlighted through rigorous benchmarks, including the “needle-in-a-haystack” test, where it demonstrated 99.5 percent accuracy in locating specific frames within two-hour videos, even with a million tokens. Performance metrics also showcase superiority over competitors like GPT-5 and Gemini 2.5 Pro on tasks such as MathVista and MathVision, achieving 85.8 percent and 74.6 percent respectively, demonstrating a clear advantage in visual math problem-solving. Furthermore, the model excels in document comprehension, scoring 96.5 percent on the DocVQA test and achieving 56.2 percent on the MMLongBench-Doc benchmark for long document analysis. Support for 39 languages through OCRBench, along with a notable 61.8 percent accuracy on the ScreenSpot Pro GUI agent task and 63.7 percent on AndroidWorld’s independent Android app operation, further highlights its versatility. Despite these impressive results, the model isn't without limitations. Specifically, it trails behind GPT-5 in the complex MMMU-Pro test, indicating a current gap in general reasoning capabilities. However, the open-source nature of the model, released under the Apache 2.0 license with weights ranging from 2B to 235B parameters, positions it as a cornerstone for further development within the open-source AI community. The model’s accessible architecture, combined with the substantial training data, are expected to accelerate advancements in multimodal AI research and application. |