LmCast :: Stay tuned in

The Math on AI Agents Doesn’t Add Up

Recorded: Jan. 24, 2026, 11:01 a.m.

Original Summarized

The Math on AI Agents Doesn’t Add Up | WIREDSkip to main contentMenuSECURITYPOLITICSTHE BIG STORYBUSINESSSCIENCECULTUREREVIEWSMenuAccountAccountNewslettersSecurityPoliticsThe Big StoryBusinessScienceCultureReviewsChevronMoreExpandThe Big InterviewMagazineEventsWIRED InsiderWIRED ConsultingNewslettersPodcastsVideoMerchSearchSearchSign InSign InSteven LevyBusinessJan 23, 2026 11:00 AMThe Math on AI Agents Doesn’t Add UpA research paper suggests AI agents are mathematically doomed to fail. The industry doesn’t agree.Photo-Illustration: WIRED Staff; Getty ImagesCommentLoaderSave StorySave this storyCommentLoaderSave StorySave this storyThe big AI companies promised us that 2025 would be “the year of the AI agents.” It turned out to be the year of talking about AI agents, and kicking the can for that transformational moment to 2026 or maybe later. But what if the answer to the question “When will our lives be fully automated by generative AI robots that perform our tasks for us and basically run the world?” is, like that New Yorker cartoon, “How about never?”That was basically the message of a paper published without much fanfare some months ago, smack in the middle of the overhyped year of “agentic AI.” Entitled “Hallucination Stations: On Some Basic Limitations of Transformer-Based Language Models,” it purports to mathematically show that “LLMs are incapable of carrying out computational and agentic tasks beyond a certain complexity.” Though the science is beyond me, the authors—a former SAP CTO who studied AI under one of the field’s founding intellects, John McCarthy, and his teenage prodigy son—punctured the vision of agentic paradise with the certainty of mathematics. Even reasoning models that go beyond the pure word-prediction process of LLMs, they say, won’t fix the problem.“There is no way they can be reliable,” Vishal Sikka, the dad, tells me. After a career that, in addition to SAP, included a stint as Infosys CEO and an Oracle board member, he currently heads an AI services startup called Vianai. “So we should forget about AI agents running nuclear power plants?” I ask. “Exactly,” he says. Maybe you can get it to file some papers or something to save time, but you might have to resign yourself to some mistakes.The AI industry begs to differ. For one thing, a big success in agent AI has been coding, which took off last year. Just this week at Davos, Google’s Nobel-winning head of AI, Demis Hassabis, reported breakthroughs in minimizing hallucinations, and hyperscalers and startups alike are pushing the agent narrative. Now they have some backup. A startup called Harmonic is reporting a breakthrough in AI coding that also hinges on mathematics—and tops benchmarks on reliability.Harmonic, which was cofounded by Robinhood CEO Vlad Tenev and Tudor Achim, a Stanford-trained mathematician, claims this recent improvement to its product called Aristotle (no hubris there!) is an indication that there are ways to guarantee the trustworthiness of AI systems. “Are we doomed to be in a world where AI just generates slop and humans can't really check it? That would be a crazy world,” says Achim. Harmonic’s solution is to use formal methods of mathematical reasoning to verify an LLM’s output. Specifically, it encodes outputs in the Lean programming language, which is known for its ability to verify the coding. To be sure, Harmonic’s focus to date has been narrow—its key mission is the pursuit of “mathematical superintelligence,” and coding is a somewhat organic extension. Things like history essays—which can’t be mathematically verified—are beyond its boundaries. For now.Nonetheless, Achim doesn’t seem to think that reliable agentic behavior is as much an issue as some critics believe. “I would say that most models at this point have the level of pure intelligence required to reason through booking a travel itinerary,” he says.Both sides are right—or maybe even on the same side. On one hand, everyone agrees that hallucinations will continue to be a vexing reality. In a paper published last September, OpenAI scientists wrote, “Despite significant progress, hallucinations continue to plague the field, and are still present in the latest models.” They proved that unhappy claim by asking three models, including ChatGPT, to provide the title of the lead author’s dissertation. All three made up fake titles and all misreported the year of publication. In a blog about the paper, OpenAI glumly stated that in AI models, “accuracy will never reach 100 percent.”Right now, those inaccuracies are serious enough to discourage the wide adoption of agents in the corporate world. “The value has not been delivered,” says Himanshu Tyagi, cofounder of an open source AI company called Sentient. He points out that dealing with hallucinations can disrupt an entire work flow, negating much of the value of an agent.Yet the big AI powers and many startups believe these inaccuracies can be dealt with. The key to coexisting with hallucinations, they say, is creating guardrails that filter out the imaginative bullshit that LLMs love to produce. Even Sikka thinks that this is a probable outcome. “Our paper is saying that a pure LLM has this inherent limitation—but at the same time it is true that you can build components around LLMs that overcome those limitations,” he says.Achim, the mathematical verification guy, agrees that hallucinations will always be around—but considers this a feature, not a bug. “I think hallucinations are intrinsic to LLMs and also necessary for going beyond human intelligence,” he says. “The way that systems learn is by hallucinating something. It’s often wrong, but sometimes it's something that no human has ever thought before.”The bottom line is that like generative AI itself, agentic AI is both impossible and inevitable at the same time. There may not be a specific annum that will be looked back upon as “the year of the agent.” But hallucinations or not, every year from now on is going to be “the year of more agents,” as the delta between guardrails and hallucinations narrows. The industry has too much at stake not to make this happen. The tasks that agents perform will always require some degree of verification—and of course people will get sloppy and we’ll suffer small and large disasters—but eventually agents will match or surpass the reliability of human beings, while being faster and cheaper.At that point, some bigger questions arise. One person I contacted to discuss the hallucination paper was computer pioneer Alan Kay, who is friendly with Sikka. His view is that “their argument was posed well enough to get comments from real computational theorists.” (A statement reminiscent of his 1984 take on the Macintosh as “the first personal computer good enough to be criticized.”) But ultimately, he says, the mathematical question is beside the point. Instead, he suggests people consider the issue in light of Marshall McLuhan’s famous “Medium is the message” dictum. “Don’t ask whether something is good or bad, right or wrong,” he paraphrases. “Find out what is going on.”Here’s what’s going on: We may well be on the cusp of a massive automation of human cognitive activity. It’s an open question whether this will improve the quality of our work and our lives. I suspect that the ultimate assessment of that will not be mathematically verifiable.This is an edition of Steven Levy’s Backchannel newsletter. Read previous newsletters here.CommentsBack to topTriangleYou Might Also LikeIn your inbox: Maxwell Zeff's dispatch from the heart of AIThe best EVs coming in 2026Big Story: Your first humanoid coworker will be ChineseWhat to do if ICE invades your neighborhoodSpecial edition: You’re already living in the Chinese centurySteven Levy covers the gamut of tech subjects for WIRED, in print and online, and has been contributing to the magazine since its inception. His writes Backchannel, a weekly newsletter that puts the biggest tech stories in perspective. He has been writing about technology for more than 30 years, writing ... Read MoreEditor at LargeXTopicsBackchannel - NLmodelsartificial intelligenceSilicon ValleyresearchmathRead MoreTopResume Packages and Free Resume Review: Everything You Need to Get Hired in 2026Discover ways to save at TopResume, including their free review service and 4-week Career Services Platform trial.Paramount+ Coupon Codes and Deals: Free Trial, Student Deals, and Military Discounts This 2026Save on streaming with the latest Paramount+ promo codes and deals, including 50% off subscriptions, free trials, and more.Home Depot Promo Codes: $100 Off in January 2026Save an extra $100 or up to 50% today with the latest Home Depot promo codes for appliances, power tools, and more this January.HP Coupon Codes and Deals January 2026Save up to 60%, plus an extra 20% with HP promo codes for laptops, printers, PCs, and more tech.This Mega Snowstorm Will Be a Test for the US Supply ChainShipping experts say the big winter storm across a wide swath of the country should be business as usual—if their safeguards hold.Clearly Filtered Is on a Sitewide Sale Right NowClearly Filtered water pitchers, bottles, and under-sink filters are 10 to 19 percent off. I tested three filters to see how they performed.Uncanny Valley: Donald Trump’s Davos Drama, AI Midterms, and ChatGPT’s Last ResortOn this episode of Uncanny Valley, our hosts unpack the news from Davos, where Trump and major AI companies shared the stage at the World Economic Forum.US Judge Rules ICE Raids Require Judicial Warrants, Contradicting Secret ICE MemoThe ruling in federal court in Minnesota lands as Immigration and Customs Enforcement faces scrutiny over an internal memo claiming judge-signed warrants aren’t needed to enter homes without consent.TikTok Is Now Collecting Even More Data About Its Users. Here Are the 3 Biggest ChangesAccording to its new privacy policy, TikTok now collects more data on its users, including their precise location, after majority ownership officially switched to a group based in the US.Skip the Shovel: How to Prep for the Biggest Storm of the SeasonBitter cold, power outages, and impassible roads are a terrible cocktail. Here’s how to prep and bunker in for an extreme winter storm.Our Favorite Earbuds for Most People Are Over 25 Percent OffThese excellent earbuds were already a good deal before the discount.CBP Wants AI-Powered ‘Quantum Sensors’ for Finding Fentanyl in CarsUS Customs and Border Protection is paying General Dynamics to create prototype “quantum sensors,” to be used with an AI database to detect fentanyl and other narcotics.WIRED is obsessed with what comes next. Through rigorous investigations and game-changing reporting, we tell stories that don’t just reflect the moment—they help create it. When you look back in 10, 20, even 50 years, WIRED will be the publication that led the story of the present, mapped the people, products, and ideas defining it, and explained how those forces forged the future. WIRED: For Future Reference.SubscribeNewslettersTravelFAQWIRED StaffWIRED EducationEditorial StandardsArchiveRSSSite MapAccessibility HelpReviewsBuying GuidesStreaming GuidesWearablesCouponsGift GuidesAdvertiseContact UsManage AccountJobsPress CenterCondé Nast StoreUser AgreementPrivacy PolicyYour California Privacy Rights© 2026 Condé Nast. All rights reserved. WIRED may earn a portion of sales from products that are purchased through our site as part of our Affiliate Partnerships with retailers. The material on this site may not be reproduced, distributed, transmitted, cached or otherwise used, except with the prior written permission of Condé Nast. Ad ChoicesSelect international siteUnited StatesLargeChevronItaliaJapónCzech Republic & SlovakiaFacebookXPinterestYouTubeInstagramTiktok

The core argument presented by the unnamed research paper, “Hallucination Stations,” posits a fundamental limitation for AI agents—specifically, LLMs—regarding their capacity for complex, computational tasks. The authors, a former SAP CTO and his teenage son, argue that current transformer-based language models are inherently incapable of carrying out tasks beyond a certain level of complexity, essentially suggesting an “impossible” future for agentic AI. The paper’s central claim is that LLMs, regardless of how much they evolve beyond pure word prediction, will remain fundamentally unable to reliably perform tasks requiring genuine reasoning and computation.

The authors’ assertion is rooted in a mathematical perspective, portraying the limitations not simply as a matter of current technology, but as an inherent constraint baked into the architecture of LLMs. This viewpoint contrasts sharply with the prevailing optimism within the AI industry, particularly following the hype surrounding “agentic AI” in 2025. While many companies, including Google (led by Demis Hassabis) and startups like Harmonic, are actively pursuing agentic AI, the paper’s findings serve as a sobering counterpoint.

Harmonic, co-founded by Vlad Tenev (Robinhood) and Tudor Achim (Stanford mathematician), offers a potential approach to addressing these limitations. Their strategy involves utilizing “formal methods of mathematical reasoning” to verify the output of LLMs, encoding results within the Lean programming language—known for its ability to guarantee code correctness. This approach attempts to bridge the gap between LLMs and reliable computation by introducing a rigorous, mathematically-validated layer. However, Harmonic’s initial focus remains narrowly defined—specifically, the pursuit of “mathematical superintelligence,” with coding primarily serving as an organic extension. The company’s key limitations include an inability to handle more complex tasks like historical essay writing.

The paper highlights the persistent problem of “hallucinations”—inaccuracies and fabrications produced by LLMs—as a significant impediment to agentic AI. OpenAI’s own research confirms this issue, stating that “hallucinations continue to plague the field” despite advancements. The classic example of ChatGPT’s fabricated dissertation title – a demonstration of this persistent flaw – underscores the fundamental challenge. While companies like Harmonic aim to mitigate hallucinations through formal verification, the paper suggests that this approach isn’t a complete solution.

The debate between the skeptical researchers and the AI industry is not entirely dissimilar from a necessary tension. Both sides recognize the ongoing difficulties associated with LLMs, though they differ in their proposed remedies. The industry, driven by investment and a desire for rapid progress, is actively exploring approaches to overcome these limitations, while the researchers maintain a more critical, mathematically-grounded perspective.

Ultimately, the paper’s central argument—that the intrinsic architecture of LLMs renders them inherently incapable of complex, reliable reasoning—is a nuanced and potentially prescient one. Although the technology continues to evolve, the core challenges introduced by the “hallucination stations” remain. The industry’s attempts to develop agentic AI, are arguably still a step ahead of the underlying models. The long-term future of reliable agentic AI hinges on whether robust mathematical solutions can truly overcome these fundamental limitations or whether we are, in fact, destined to operate perpetually within the boundaries defined by the LLM.