Show HN: Only 1 LLM can fly a drone
Recorded: Jan. 26, 2026, 3 p.m.
| Original | Summarized |
GitHub - kxzk/snapbench: 📸 gotta find 'em all; spatial reasoning benchmark for LLMs Skip to content Navigation Menu Toggle navigation
Sign in
Appearance settings PlatformAI CODE CREATIONGitHub CopilotWrite better code with AIGitHub SparkBuild and deploy intelligent appsGitHub ModelsManage and compare promptsMCP RegistryNewIntegrate external toolsDEVELOPER WORKFLOWSActionsAutomate any workflowCodespacesInstant dev environmentsIssuesPlan and track workCode ReviewManage code changesAPPLICATION SECURITYGitHub Advanced SecurityFind and fix vulnerabilitiesCode securitySecure your code as you buildSecret protectionStop leaks before they startEXPLOREWhy GitHubDocumentationBlogChangelogMarketplaceView all featuresSolutionsBY COMPANY SIZEEnterprisesSmall and medium teamsStartupsNonprofitsBY USE CASEApp ModernizationDevSecOpsDevOpsCI/CDView all use casesBY INDUSTRYHealthcareFinancial servicesManufacturingGovernmentView all industriesView all solutionsResourcesEXPLORE BY TOPICAISoftware DevelopmentDevOpsSecurityView all topicsEXPLORE BY TYPECustomer storiesEvents & webinarsEbooks & reportsBusiness insightsGitHub SkillsSUPPORT & SERVICESDocumentationCustomer supportCommunity forumTrust centerPartnersOpen SourceCOMMUNITYGitHub SponsorsFund open source developersPROGRAMSSecurity LabMaintainer CommunityAcceleratorArchive ProgramREPOSITORIESTopicsTrendingCollectionsEnterpriseENTERPRISE SOLUTIONSEnterprise platformAI-powered developer platformAVAILABLE ADD-ONSGitHub Advanced SecurityEnterprise-grade security featuresCopilot for BusinessEnterprise-grade AI featuresPremium SupportEnterprise-grade 24/7 supportPricing Search or jump to... Search code, repositories, users, issues, pull requests...
Search Clear
Search syntax tips Provide feedback Include my email address so I can be contacted Cancel Submit feedback Saved searches
Name Query To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up
Appearance settings Resetting focus You signed in with another tab or window. Reload to refresh your session. Dismiss alert kxzk snapbench Public
Notifications
Fork
Star 📸 gotta find 'em all; spatial reasoning benchmark for LLMs 11 0 Branches Tags Activity
Star
Notifications Code Issues Pull requests Actions Security Insights
Additional navigation options
Code Issues Pull requests Actions Security Insights
kxzk/snapbench
 mainBranchesTagsGo to fileCodeOpen more actions menuFolders and filesNameNameLast commit messageLast commit dateLatest commit History59 Commitsassetsassets  benchbench  datadata  imagesimages  llm_dronellm_drone  srcsrc  .gitignore.gitignore  MakefileMakefile  README.mdREADME.md  build.zigbuild.zig  build.zig.zonbuild.zig.zon  View all filesRepository files navigationREADMESnapBench Inspired by Pokémon Snap (1999). VLM pilots a drone through 3D world to locate and identify creatures. %%{init: {'theme': 'base', 'themeVariables': { 'background': '#ffffff', 'primaryColor': '#ffffff'}}}%% subgraph VLM["**VLM** (OpenRouter)"] subgraph Simulation["**Simulation** (Zig/raylib)"] C -->|"screenshot + prompt"| V style Controller fill:#8B5A2B,stroke:#5C3A1A,color:#fff Loading Overview demo_3x.mov Gotta catch 'em all? Is this a rigorous benchmark? No. However, it's a reasonably fair comparison - same prompt, same seeds, same iteration limits. I'm sure with enough refinement you could coax better results out of each model. But that's kind of the point: out of the box, with zero hand-holding, only one model figured out how to actually fly. Gemini Flash: Actively adjusts altitude, descends to creature level, identifies This left me puzzled. Claude Opus is arguably the most capable model in the lineup. It knows it needs to identify creatures. It tries - aggressively. But it never adjusts its approach angle. In most other runs, Flash found one creature quickly but ran out of iterations searching for the others. The world is big. 50 iterations isn't a lot of time. Claude Opus 4.5 (most expensive) to dominate Instead, the cheapest model beat models costing 10x more. Spatial reasoning doesn't scale with model size - at least not yet I genuinely don't know. But if you're building an LLM-powered agent that needs to navigate physical or virtual space, the most expensive model might not be your best choice. The simulation has rough edges (it's a side project, not a polished benchmark suite) Try it yourself Tool Zig Rust Python uv You'll also need an OpenRouter API key. # set your API key # terminal 2: start the drone controller Model-specific prompts: Tune instructions to each model's strengths Attribution Drone by NateGazzard CC-BY via Poly Pizza Donated to Poly Pizza to support the platform.
About 📸 gotta find 'em all; spatial reasoning benchmark for LLMs Readme Uh oh! There was an error while loading. Please reload this page. Activity 11 0 0 Report repository Releases Packages No packages published Languages Zig Rust Python GLSL Makefile Footer © 2026 GitHub, Inc. Footer navigation Terms Privacy Security Status Community Docs Contact Manage cookies Do not share my personal information You can’t perform that action at this time. |
This GitHub repository, “snapbench,” presents a fascinating, if somewhat rough-around-the-edges, experiment in evaluating Large Language Models (LLMs) for spatial reasoning and navigation tasks. The core concept is inspired by the classic arcade game “Pokémon Snap,” where the player pilots a drone to locate and identify creatures within a 3D environment. However, instead of a polished game experience, snapbench aims to provide a rudimentary benchmark for assessing how well LLMs can perform a simple, embodied task—finding and identifying three creatures in a procedurally generated voxel world. The architecture of the snapbench simulation is deliberately simplified, utilizing three distinct components: a Rust-based controller, an OpenRouter-powered VLM (Vision-Language Model) and a Zig/raylib-based simulation. The controller orchestrates the flow of the simulation, managing communication between the VLM and the simulation. The VLM is responsible for interpreting instructions and generating commands, while the simulation handles the visual representation and physics of the environment. The use of OpenRouter as the interface allows the VLM to function as a relatively standard LLM, facilitating easy comparison across different models. The Zig/raylib component is responsible for rendering the 3D world and handling drone movement. The simulation itself generates a procedural voxel world and spawns four creatures—cats, dogs, pigs, and sheep—for the drone to discover. The core interaction loop involves the drone moving through the environment, taking screenshots, and receiving commands from the VLM based on its observations. The VLM is given the prompt “pilot a drone through a 3D voxel world and find 3 creatures.” The objective is for the drone to successfully identify three creatures, where identification is defined as being within 5 units of the target. The benchmark is executed using the `uv` command-line tool, enabling easy manual testing. The snapshotbench experiment resulted in a notable disparity in performance among the seven frontier LLMs evaluated. Gemini Flash emerged as the sole successful participant, demonstrating the ability to navigate and identify the three creatures. The other models consistently failed, with the Claude Opus exhibiting a particularly puzzling behavior – aggressively attempting to identify the creatures despite never adjusting its approach angle. This suggests a fundamental difference in how the models approach the task—Flash appears to understand the need for descent to achieve identification, whereas other models treat the prompt as a simple instruction without a requirement to prioritize altitude control. A key observation was the "two-creature anomaly," triggered by the random spawning of two creatures close together, leading to one model’s success in finding both. Given the limited iteration limits (50), this coincidence underscored the relatively short time constraints of the benchmark. The project highlights the surprising finding that larger, more expensive models don't necessarily perform better in simple embodied tasks. The success of the cheaper Gemini 3 Flash model suggests that, at least in this context, model size isn't the primary driver of success. It highlights the importance of instruction following and potentially the training data used for each model. Further research could benefit from exploring factors such as model-specific prompts, richer feedback mechanisms (incorporating spatial context like distance readings and a minimap), and multi-agent configurations. The development of snapbench includes rough edges—it is, in effect, a rudimentary experimental project rather than a polished benchmark suite. The prompts are a single blanket prompt used across all models, as is the limited scope of the simulation and the relatively brief iteration limits. The resulting dataset is valuable, however, and further experimentation could reveal valuable insights into the capabilities and limitations of LLMs in spatial reasoning and navigation. The project also includes a call for assets such as Pokémon models from Poly Pizza, a recognition of the creative use of existing assets. Future iterations of the project could also benefit from a more realistic drone benchmark (e.g., utilizing a BetaFPV drone instead of a simulator). Although the project may not produce a rigorous benchmark, it effectively serves as a starting point for experimenting with and understanding the potential of LLMs in embodied and spatial tasks. |