Alibaba's Qwen3.7-Max Ran Autonomously for 35 Hours on Unfamiliar Hardware. It Still Kept Getting Better. - Firethering
back to top
Home Softwares
AI Tools DevTools 3D Tools Design Tools Image Editors Video Editors Productivity Utilities
Apps
Android Apps iOS Apps
Games
Windows Games macOS Games Android Games iOS Games
Tech
Picks AI Picks AI Models Trends
Search
Wednesday, May 27, 2026 Home Softwares
AI Tools DevTools 3D Tools Design Tools Image Editors Video Editors Productivity Utilities
Apps
Android Apps iOS Apps
Games
Windows Games macOS Games Android Games iOS Games
Tech
Picks AI Picks AI Models Trends
FacebookInstagramTwitterVimeoYoutube
Home Softwares
AI Tools DevTools 3D Tools Design Tools Image Editors Video Editors Productivity Utilities
Apps
Android Apps iOS Apps
Games
Windows Games macOS Games Android Games iOS Games
Tech
Picks AI Picks AI Models Trends
Search
HomeTechAlibaba's Qwen3.7-Max Ran Autonomously for 35 Hours on Unfamiliar Hardware. It Still... Alibaba’s Qwen3.7-Max Ran Autonomously for 35 Hours on Unfamiliar Hardware. It Still Kept Getting Better.
By Mohit Geryani May 25, 2026 0
Last updated: May 25, 2026
Share FacebookTwitterPinterestWhatsApp
- Advertisement -
Alibaba gave Qwen3.7-Max a kernel optimization task on a hardware platform the model had never encountered before. No documentation or profiling data. No example kernels for the architecture. Just a task description, an existing implementation, and an evaluation script. The model ran for 35 hours. It made 1,158 tool calls. It wrote, compiled, profiled, and rewrote the kernel repeatedly, diagnosing failures, fixing bugs, identifying blocks, and redesigning the architecture multiple times without anyone watching. After 30 hours it was still finding meaningful improvements. The final result was a 10x speedup over the reference implementation. For context: GLM 5.1 ran the same task and reached 7.3x. Kimi K2.6 reached 5x. DeepSeek V4 Pro reached 3.3x. The models that stopped early did so because they issued no tool calls for five consecutive rounds, they concluded they couldn’t make further progress and stopped. Qwen3.7-Max didn’t stop. Table of ContentsWhat the task actually wasWhat Benchmarks ShowsWhy this model trains differentlyLimitationsWho is this for? What the task actually was The kernel in question is Extend Attention, a production component in SGLang, a widely used inference framework. Specifically it handles attention between newly generated tokens and a prefix KV-cache of up to 32K entries, a memory-bound, latency-critical operation that directly affects how fast LLMs serve responses. The hardware was T-Head ZW-M890 PPUs, a processor architecture that wasn’t in any training data. The model had no prior knowledge of how it behaved. It started cold. Over 35 hours it performed 432 kernel evaluations. Each cycle meant writing code, compiling it, running it, reading the profiling output, deciding what to change, and trying again. The model diagnosed compilation failures it hadn’t seen before, identified performance bottlenecks through runtime feedback rather than prior knowledge, and redesigned the kernel architecture multiple times when incremental improvements stopped working. This matters because it tests something different from standard benchmarks. Most evaluations measure whether a model can produce a correct answer given a well-defined problem. This one measured whether a model could sustain coherent strategy across more than a thousand tool calls on an open-ended optimization problem with no human guidance. Those are different skills and most models don’t have it. What Benchmarks Shows via: Qwen Blog The numbers below are from Alibaba’s own evaluation. BenchmarkQwen3.7-MaxClaude Opus 4.6DeepSeek V4 ProSWE-Verified80.480.880.6Terminal Bench 2.069.765.467.9GPQA Diamond92.491.390.1HLE41.440.037.7HMMT 2026 Feb97.196.295.2BFCL-V475.076.770.6 On coding agents it trades blows with Opus 4.6 and DeepSeek rather than clearly beating either. Terminal Bench is the exception where it leads. The reasoning numbers are where the gap opens up more consistently, GPQA Diamond, HLE, and HMMT all show Qwen3.7-Max at or above the strongest available comparison points. The remaining benchmarks gives you clear idea , if its a right model for your use case. You May Like: ByteDance Open-Sourced a 3B Model for Images, Video, Editing, and Reasoning Why this model trains differently Most models get better by seeing more text. Qwen3.7-Max got better by seeing more situations. Alibaba calls it environment scaling. Instead of optimizing for specific benchmarks, they built a large and diverse set of agentic training environments, different tasks, different tools, different harnesses, and trained the model across all of them. The idea is the same as why a model trained on diverse text generalizes better than one trained on narrow text. Diversity of experience produces capability that transfers. The practical result is cross-harness generalization. Qwen3.7-Max performs consistently whether it’s running through Claude Code, Qwen Code, or a custom tool-use framework. It learned to solve problems rather than learn the patterns of a specific scaffold. That’s rarer than it sounds, most agentic models quietly overfit to the evaluation setup they were trained on. Limitations It’s a proprietary API model. No open weights, no local deployment or self-hosting. For teams with data privacy requirements or anyone who wants to run models on their own infrastructure, that’s a hard stop regardless of the benchmark numbers. The instruction following gap is also real. IFBench at 79.1 is strong but lower than some competitors on complex multi-step instruction adherence. For workflows that require strict formatting or on-point output structure across long sessions, that’s worth testing before committing. And the benchmark table is entirely self-reported. Alibaba ran these evaluations. Independent reproduction will clear the complete picture. You May Like: Open source AI agentic models built for real autonomous work Who is this for? A model that ran autonomously for 35 hours on hardware it had never seen, kept improving past the 30-hour mark, and finished 10x faster than the reference implementation is not a normal result. The benchmark numbers are competitive with the best available models. The kernel run is in a different category of evidence entirely. If you’re building agentic workflows and can work within a proprietary API, this is worth serious evaluation. If open weights are a requirement, it isn’t an option yet. That’s the whole story honestly told.
TagsAIAI AgentsAI ModelsAlibabaQwen
LEAVE A REPLY Cancel reply
Comment: Please enter your comment!
Name:* Please enter your name here
Email:* You have entered an incorrect email address! Please enter your email address here
Website:
YOU MAY ALSO LIKE
A Critical Bug in a 325M-Download Package Put Millions of AI Agents at Risk
Mohit Geryani - May 27, 2026 0
One character. That's what it took to bypass authentication on millions of servers running AI agents, MCP tools, and the infrastructure connecting them to user data, email accounts, databases, and in some cases industrial equipment. The vulnerability, now tracked as CVE-2026-48710 and nicknamed BadHost, was found in Starlette, an open-source framework downloaded around 325 million times every week. If you’re building AI infrastructure in Python, there’s a good chance something in your stack depends on it. Starlette is the foundation FastAPI is built on, and FastAPI is what a significant portion of the Python AI tooling ecosystem runs on. Researchers say the official severity score doesn’t fully capture how dangerous the bug actually is. A patch was released Friday in Starlette 1.0.1, but vulnerable versions are still running in production systems right now.
DuckDuckGo Installs Jumped 30% as Frustration With Google’s AI Search Grew
Mohit Geryani - May 27, 2026 3
People on Reddit are calling it something beyond enshittification. One user put it simply: "Google basically just ruined the old ten blue links era." Another said Google is "abusing its status as infrastructure to weasel its AI into consumers' day-to-day." That's the actual audience reaction. The week after Google announced it was replacing its traditional search results with an AI agent that answers queries, executes tasks, and runs background monitoring, DuckDuckGo saw US app installs jump 18.1% week over week on average. It peaked at 30.5% on May 25. On iOS the numbers were sharper, week over week growth hit 33% on average and peaked at 69.9%. The company also said growth held through the Memorial Day weekend, when it usually sees a dip. DuckDuckGo has been stuck at around 2% of the US search market for years. One Google I/O announcement moved its install numbers more than anything DuckDuckGo has done on its own.
Microsoft and Uber Are Running Into an AI Cost Problem
Mohit Geryani - May 26, 2026 0
The pitch was impressive. AI tools would make developers faster, reduce headcount costs, and pay for themselves many times over. Companies that moved early would have a structural advantage over those that waited. Microsoft believed it. So did Uber. Both pushed hard on AI coding tool adoption across their engineering teams. Both are now dealing with same problem: the faster their employees embraced the tools, the faster the bills grew. In some cases those bills have started exceeding what the same work would have cost with human labor. The problem is what happens to the economics when thousands of employees use something that charges per unit of thought.
Please leave this field empty Don’t miss any Tech Story Subscribe To Firethering NewsLetter
Email Address *
You Can Unsubscribe Anytime! Read more in our privacy policy
Check your inbox or spam folder to confirm your subscription. |
Alibaba conducted an autonomous evaluation of the Qwen3.7-Max model, assessing its ability to optimize a complex task on unfamiliar hardware, which yielded significant performance improvements. The experiment involved instructing the model to optimize the Extend Attention kernel, a latency-critical, memory-bound operation within the SGLang inference framework that manages attention between newly generated tokens and a prefix KV-cache. The hardware used for this test, the T-Head ZW-M890 PPUs, was a platform outside the model's prior training data, meaning the model started with no inherent knowledge of the system's behavior.
Over a duration of 35 hours, the model autonomously performed extensive iterative work, executing 432 kernel evaluations. This process required the model to repeatedly write code, compile it, run it, analyze profiling outputs, diagnose unforeseen compilation failures, identify runtime performance bottlenecks without prior knowledge, and subsequently redesign the kernel architecture multiple times as incremental improvements stalled. This capacity to self-correct and redesign architecture without human guidance is highlighted as testing a capability beyond standard benchmark performance, focusing instead on sustained, coherent strategy execution across an open-ended optimization problem.
The successful optimization resulted in a tenfold speedup over the reference implementation. This result contrasts with other comparable models; for instance, GLM 5.1 reached a sevenfold improvement, while Kimi K2.6 achieved a fivefold increase, and DeepSeek V4 Pro achieved a threefold boost. A key distinction in the evaluation process was the stopping criteria: other models ceased execution when they failed to issue tool calls for five consecutive rounds, whereas Qwen3.7-Max persisted, indicating a deeper capacity for sustained, autonomous problem-solving.
Alibaba describes this methodology as environment scaling, asserting that the model improved by experiencing a diverse set of agentic training environments, including various tasks, tools, and harnesses, rather than focusing solely on narrow benchmark optimization. This approach aims to foster cross-harness generalization, enabling the model to solve problems generally rather than simply learning patterns specific to a single evaluation setup, mitigating the risk of overfitting to the training scaffold that often plagues agentic models.
When comparing the model against other models across various benchmarks, the results show competitive performance. While Qwen3.7-Max is not clearly superior in all areas, it demonstrates strength in reasoning metrics such as GPQA Diamond, HLE, and HMMT 2026 Feb, where it ranks at or above some leading competitors like Claude Opus 4.6 and DeepSeek V4 Pro. In contrast, on coding agent metrics, it is competitive with Opus 4.6 and DeepSeek, though it leads on the Terminal Bench.
Despite these impressive autonomous results, several limitations must be acknowledged. The model remains a proprietary API, meaning it lacks open weights, inhibiting self-hosting or deployment for organizations with strict data privacy requirements. Furthermore, gaps remain in instruction following for complex, multi-step workflows, as demonstrated by IFBench performance, suggesting that strict adherence to specific formatting across extended sessions requires further scrutiny. Finally, the presented benchmark data is self-reported by Alibaba, necessitating independent reproduction to fully validate the observed performance gains. Consequently, the model is best suited for applications within agentic workflows where access to a proprietary API is acceptable, provided the user understands the inherent limitations regarding transparency and instruction adherence. |