Amazon’s bet that AI benchmarks don’t matter
Recorded: Dec. 3, 2025, 2:02 a.m.
| Original | Summarized |
Amazon’s bet that AI benchmarks don’t matter | The VergeSkip to main contentThe homepageThe VergeThe Verge logo.The VergeThe Verge logo.TechReviewsScienceEntertainmentAIHamburger Navigation ButtonThe homepageThe VergeThe Verge logo.Hamburger Navigation ButtonNavigation DrawerThe VergeThe Verge logo.Login / Sign UpcloseCloseSearchTechExpandAmazonAppleFacebookGoogleMicrosoftSamsungBusinessCreatorsMobilePolicySecurityTransportationReviewsExpandLaptopsPhonesHeadphonesTabletsSmart HomeSmartwatchesSpeakersDronesScienceExpandSpaceEnergyEnvironmentHealthEntertainmentExpandGamesTV ShowsMoviesAudioAIVerge ShoppingExpandBuying GuidesDealsGift GuidesSee All ShoppingCarsExpandElectric CarsAutonomous CarsRide-sharingScootersOther TransportationFeaturesVideosExpandYouTubeTikTokInstagramPodcastsExpandDecoderThe VergecastVersion HistoryNewslettersExpandThe Verge DailyInstallerVerge DealsNotepadOptimizerRegulatorThe StepbackArchivesStoreSubscribeFacebookThreadsInstagramYoutubeRSSThe VergeThe Verge logo.Amazon’s bet that AI benchmarks don’t matterComments DrawerCommentsLoading commentsGetting the conversation ready...ColumnCloseColumnPosts from this topic will be added to your daily email digest and your homepage feed.FollowFollowSee All ColumnAICloseAIPosts from this topic will be added to your daily email digest and your homepage feed.FollowFollowSee All AISourcesCloseSourcesPosts from this topic will be added to your daily email digest and your homepage feed.FollowFollowSee All SourcesAmazon’s bet that AI benchmarks don’t matterWhile OpenAI, Anthropic, and Google trade blows at the top of the charts, Amazon is asking everyone to look somewhere else.While OpenAI, Anthropic, and Google trade blows at the top of the charts, Amazon is asking everyone to look somewhere else.by Alex HeathCloseAlex HeathSources author, Verge contributorPosts from this author will be added to your daily email digest and your homepage feed.FollowFollowSee All by Alex HeathDec 2, 2025, 10:00 PM UTCLinkShareRohit Prasad, Amazon’s SVP of AGI. Credit: GettyAlex HeathCloseAlex HeathPosts from this author will be added to your daily email digest and your homepage feed.FollowFollowSee All by Alex Heath is a contributing writer and author of the Sources newsletter.This is an excerpt of Sources by Alex Heath, a newsletter about AI and the tech industry, syndicated just for The Verge subscribers once a week.Amazon’s AI chief has a message for the model benchmark obsessives: Stop looking at the leaderboards.“I want real-world utility. None of these benchmarks are real,” Rohit Prasad, Amazon’s SVP of AGI, told me ahead of today’s announcements at AWS re:Invent in Las Vegas. “The only way to do real benchmarking is if everyone conforms to the same training data and the evals are completely held out. That’s not what’s happening. The evals are frankly getting noisy, and they’re not showing the real power of these models.”It’s a contrarian stance when every other AI lab is quick to boast about how their new models quickly climb the leaderboards. It’s also convenient for Amazon, given that the previous version of Nova, its flagship model, was sitting at spot 79 on LMArena when Prasad and I spoke last week. Still, dismissing benchmarks only works if Amazon can offer a different story about what progress looks like.“They’re not showing the real power of these models.”The centerpiece of today’s re:Invent announcements is Nova Forge, a service that Amazon claims lets companies train custom AI models in ways previously impossible without spending billions of dollars. The problem Forge addresses is real. Most companies trying to customize AI models face three bad options: fine-tune a closed model (but only at the edges), train on open-weight models (but without the original training data and risking capability regression, where the AI becomes an expert on new data but forgets original, broader skills), or build a model from scratch at enormous cost.Forge offers something else: access to Amazon’s Nova model checkpoints at the pre-training, mid-training, and post-training stages. Companies can inject their proprietary data early in the process, when the model’s “learning capacity is highest,” as Prasad put it, rather than just tweaking model behavior at the end.“What we have done is democratize AI and frontier model development for your use cases at fractions of what it would cost [before],” Prasad said. Forge was created because Amazon’s internal teams wanted a tool to inject their domain expertise into a base model without having to build from scratch.“We built Forge because our internal teams wanted Forge,” he said. It’s a familiar Amazon pattern. AWS itself famously began as infrastructure built for Amazon’s own retail operation before becoming the company’s profit engine.Reddit has been using Forge to build custom safety models trained on 23 years of community moderation data. “I haven’t seen anything like it yet,” Chris Slowe, Reddit’s CTO and first employee, told me. “We’ve had a distinguished engineer who’s just been like a kid in the candy shop.”Slowe said Reddit ran a continued pre-training job last week that’s “looking really promising.” The goal: Replace multiple bespoke safety models with a single Reddit-expert model that understands the nuances of community moderation, including the notoriously subjective rule that appears across subreddits everywhere: “Don’t be a jerk.”“Having an expert model, it’s going to understand the community,” Slowe said. “It’s gonna have a pretty good notion of what jerk means.”That’s the thread Amazon wants developers to pull on: not raw IQ points, but control and specialization.He explained that Forge enables Reddit to control its models, avoid surprises from API changes, retain ownership of its weights, and avoid sending sensitive data to third-party model providers. He said Reddit is already exploring using the same approach for Reddit Answers and other products.When I asked Slowe whether it mattered that Nova isn’t a top-tier model on benchmarks, he was blunt: “In this context, what matters is the Reddit expertness of the model.” That’s the thread Amazon wants developers to pull on: not raw IQ points, but control and specialization.With Forge, Amazon is making a calculated bet that the model race has commoditized and that it can succeed by being the place where companies can build specialized AI for specific business problems. It’s a very AWS-shaped view of the world: infrastructure over intelligence and customization over raw capability. The strategy also lets Amazon sidestep direct comparisons with OpenAI and Anthropic, both of which it once hoped to compete with at the model layer.Whether Forge is genuinely pioneering or just clever positioning depends, of course, on developer adoption. Amazon insists that the model race, as it’s widely understood, doesn’t matter. If that ends up being true, the scoreboard shifts to something much quieter and harder to game: whether AI models actually deliver real-world utility. Follow topics and authors from this story to see more like this in your personalized homepage feed and to receive email updates.Alex HeathCloseAlex HeathSources author, Verge contributorPosts from this author will be added to your daily email digest and your homepage feed.FollowFollowSee All by Alex HeathAICloseAIPosts from this topic will be added to your daily email digest and your homepage feed.FollowFollowSee All AIColumnCloseColumnPosts from this topic will be added to your daily email digest and your homepage feed.FollowFollowSee All ColumnSourcesCloseSourcesPosts from this topic will be added to your daily email digest and your homepage feed.FollowFollowSee All SourcesMost PopularMost PopularOpenAI declares ‘code red’ as Google catches up in AI raceNetflix kills casting from phonesSamsung’s Z TriFold is official and it looks like a tablet with a phone attachedMKBHD is taking down his wallpaper appSteam Machine today, Steam Phones tomorrowThe Verge DailyA free daily digest of the news that matters most.Email (required)Sign UpBy submitting your email, you agree to our Terms and Privacy Notice. This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.Advertiser Content FromThis is the title for the native adMore in ColumnSilicon Valley is rallying behind a guy who sucksYou need to read the treatise on spacing out, Bored and BrilliantThe indie web is here to make the internet weird againThe dark side of optimizing your metabolismWhat the leaked AI executive order tells us about the Big Tech power grabApple TV wants to go bigSilicon Valley is rallying behind a guy who sucksTina NguyenDec 2You need to read the treatise on spacing out, Bored and BrilliantTerrence O'BrienNov 30The indie web is here to make the internet weird againStevie BonifieldNov 30The dark side of optimizing your metabolismVictoria SongNov 28What the leaked AI executive order tells us about the Big Tech power grabTina NguyenNov 26Apple TV wants to go bigAndrew WebsterNov 23Advertiser Content FromThis is the title for the native adTop StoriesDec 2Silicon Valley is rallying behind a guy who sucksDec 2Steam Machine today, Steam Phones tomorrowDec 2It’s their job to keep AI from destroying everythingDec 2Metroid Prime 4 excels when it’s actually being MetroidDec 2The Nikon ZR gets surprisingly close to a real RED camera (for a lot less money)Dec 2HBO Max’s Mad Men 4K release is the opposite of a remasterThe VergeThe Verge logo.FacebookThreadsInstagramYoutubeRSSContactTip UsCommunity GuidelinesArchivesAboutEthics StatementHow We Rate and Review ProductsCookie SettingsTerms of UsePrivacy NoticeCookie PolicyLicensing FAQAccessibilityPlatform Status© 2025 Vox Media, LLC. All Rights Reserved |
Amazon is taking a contrarian stance within the rapidly evolving artificial intelligence landscape, spearheaded by Rohit Prasad, their SVP of AGI. Prasad is arguing that the prevalent focus on model benchmarks—such as those found on the LMArena leaderboard—are largely irrelevant. His central assertion is that true progress in AI lies not in achieving top scores on these comparative metrics, but rather in real-world utility and practical application. Prasad’s perspective contrasts sharply with the competitive fervor surrounding OpenAI, Anthropic, and Google, all of whom are actively promoting their models’ ranking improvements. He emphasizes the inherent limitations of these benchmarks, highlighting that they are often based on standardized training data and evaluation protocols, failing to fully represent the diverse and customized needs of various use cases. Amazon’s strategy, embodied by the launch of Nova Forge, seeks to democratize AI development by providing companies with access to Amazon’s Nova model checkpoints at various stages of training—pre-training, mid-training, and post-training. This approach addresses a significant pain point for many organizations: the difficulty and expense of building custom AI models from scratch. Previously, companies had limited options: fine-tuning closed models—restricting their adaptability, training on open-weight models—risking capability regression (where the AI loses previously acquired skills and gains expertise in new domains), or undertaking the massive investment required to build a model wholly from the ground up. Forge offers a third avenue: the ability to inject proprietary data into the Nova model early in the process, when the model's learning capacity is considered highest. This aligns with a pragmatic approach, prioritizing functionality and control over sheer computational performance. Existing internal teams at Amazon requested Forge, showcasing a direct need within the organization. A key example highlighted is Reddit’s utilization of Forge to build custom safety models trained on 23 years of community moderation data. Chris Slowe, Reddit’s CTO and first employee, described the experience as “looking really promising.” Notably, the model’s success is measured by its expertise in understanding and applying nuanced community guidelines, most famously the ubiquitous—and often contested—rule: “Don’t be a jerk.” This illustrates a shift in focus from raw intelligence metrics to domain-specific expertise. The design of Forge also provides companies with control over their models, mitigating the risk of unexpected changes in API behavior, retaining ownership of the model weights, and safeguarding sensitive data from third-party providers. This reinforces Amazon's focus on providing a tailored, secure, and manageable AI solution. Amazon’s strategic move is underpinned by a calculated bet—that the AI model race has largely commoditized, and that the company can succeed by becoming the central hub for specialized AI development. This reflects a 'AWS-shaped' perspective: emphasizing infrastructure and customization over pure technological advancement. The strategy also allows Amazon to strategically sidestep direct comparisons with OpenAI and Anthropic, whom they previously hoped to compete with directly. Ultimately, Amazon’s success hinges on the adoption of Forge by developers. The company believes the prevailing model race, based on benchark scores, is a misdirection, and that true progress is defined by the real-world utility and practical application of AI models. |