Show HN: Only 1 LLM can fly a drone

Recorded: Jan. 26, 2026, 3 p.m.

Original

Summarized

GitHub - kxzk/snapbench: 📸 gotta find 'em all; spatial reasoning benchmark for LLMs

Navigation Menu

Toggle navigation

Appearance settings

PlatformAI CODE CREATIONGitHub CopilotWrite better code with AIGitHub SparkBuild and deploy intelligent appsGitHub ModelsManage and compare promptsMCP RegistryNewIntegrate external toolsDEVELOPER WORKFLOWSActionsAutomate any workflowCodespacesInstant dev environmentsIssuesPlan and track workCode ReviewManage code changesAPPLICATION SECURITYGitHub Advanced SecurityFind and fix vulnerabilitiesCode securitySecure your code as you buildSecret protectionStop leaks before they startEXPLOREWhy GitHubDocumentationBlogChangelogMarketplaceView all featuresSolutionsBY COMPANY SIZEEnterprisesSmall and medium teamsStartupsNonprofitsBY USE CASEApp ModernizationDevSecOpsDevOpsCI/CDView all use casesBY INDUSTRYHealthcareFinancial servicesManufacturingGovernmentView all industriesView all solutionsResourcesEXPLORE BY TOPICAISoftware DevelopmentDevOpsSecurityView all topicsEXPLORE BY TYPECustomer storiesEvents & webinarsEbooks & reportsBusiness insightsGitHub SkillsSUPPORT & SERVICESDocumentationCustomer supportCommunity forumTrust centerPartnersOpen SourceCOMMUNITYGitHub SponsorsFund open source developersPROGRAMSSecurity LabMaintainer CommunityAcceleratorArchive ProgramREPOSITORIESTopicsTrendingCollectionsEnterpriseENTERPRISE SOLUTIONSEnterprise platformAI-powered developer platformAVAILABLE ADD-ONSGitHub Advanced SecurityEnterprise-grade security featuresCopilot for BusinessEnterprise-grade AI featuresPremium SupportEnterprise-grade 24/7 supportPricing

Search or jump to...

Search code, repositories, users, issues, pull requests...

Clear

Search syntax tips

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Cancel

Submit feedback

Saved searches

Use saved searches to filter your results more quickly

Name

Query

To see all available qualifiers, see our documentation.

Cancel

Create saved search

Appearance settings

Resetting focus

You signed in with another tab or window. Reload to refresh your session.
You signed out in another tab or window. Reload to refresh your session.
You switched accounts on another tab or window. Reload to refresh your session.

Dismiss alert

kxzk

/

snapbench

Public

Notifications
You must be signed in to change notification settings

Fork
0

Star
11

📸 gotta find 'em all; spatial reasoning benchmark for LLMs

11
stars

0
forks

Branches

Tags

Activity

Star

Notifications
You must be signed in to change notification settings

Code

Issues
0

Pull requests
0

Actions

Security
0

Insights

Additional navigation options

Code

Issues

Pull requests

Actions

Security

Insights

kxzk/snapbench

mainBranchesTagsGo to fileCodeOpen more actions menuFolders and filesNameNameLast commit messageLast commit dateLatest commit History59 Commitsassetsassets benchbench datadata imagesimages llm_dronellm_drone srcsrc .gitignore.gitignore MakefileMakefile README.mdREADME.md build.zigbuild.zig build.zig.zonbuild.zig.zon View all filesRepository files navigationREADMESnapBench

Inspired by Pokémon Snap (1999). VLM pilots a drone through 3D world to locate and identify creatures.

Architecture

%%{init: {'theme': 'base', 'themeVariables': { 'background': '#ffffff', 'primaryColor': '#ffffff'}}}%%
flowchart LR
subgraph Controller["**Controller** (Rust)"]
C[Orchestration]
end

subgraph VLM["**VLM** (OpenRouter)"]
V[Vision-Language Model]
end

subgraph Simulation["**Simulation** (Zig/raylib)"]
S[Game State]
end

C -->|"screenshot + prompt"| V
C <-->|"cmds + state<br>**UDP:9999**"| S

style Controller fill:#8B5A2B,stroke:#5C3A1A,color:#fff
style VLM fill:#87CEEB,stroke:#5BA3C6,color:#1a1a1a
style Simulation fill:#4A7C23,stroke:#2D5A10,color:#fff
style C fill:#B8864A,stroke:#8B5A2B,color:#fff
style V fill:#B5E0F7,stroke:#87CEEB,color:#1a1a1a
style S fill:#6BA33A,stroke:#4A7C23,color:#fff

Overview
The simulation generates procedural terrain and spawns creatures (cat, dog, pig, sheep) for the drone to discover. It handles drone physics and collision detection, accepting 8 movement commands plus identify and screenshot. The Rust controller captures frames from the simulation, constructs prompts enriched with position and state data, then parses VLM responses into executable command sequences. The objective: locate and successfully identify 3 creatures, where identify succeeds when the drone is within 5 units of a target.

demo_3x.mov

Gotta catch 'em all?
I gave 7 frontier LLMs a simple task: pilot a drone through a 3D voxel world and find 3 creatures.
Only one could do it.

Is this a rigorous benchmark? No. However, it's a reasonably fair comparison - same prompt, same seeds, same iteration limits. I'm sure with enough refinement you could coax better results out of each model. But that's kind of the point: out of the box, with zero hand-holding, only one model figured out how to actually fly.
Why can't Claude look down?
The core differentiator wasn't intelligence - it was altitude control. Creatures sit on the ground. To identify them, you need to descend.

Gemini Flash: Actively adjusts altitude, descends to creature level, identifies
GPT-5.2-chat: Gets close horizontally but never lowers
Claude Opus: Attempts identification 160+ times, never succeeds - approaching at wrong angles
Others: Wander randomly or get stuck

This left me puzzled. Claude Opus is arguably the most capable model in the lineup. It knows it needs to identify creatures. It tries - aggressively. But it never adjusts its approach angle.
The two-creature anomaly
Run 13 (seed 72) was the only run where any model found 2 creatures. Why? They happened to spawn near each other. Gemini Flash found one, turned around, and spotted the second.

In most other runs, Flash found one creature quickly but ran out of iterations searching for the others. The world is big. 50 iterations isn't a lot of time.
Bigger ≠ better
This was the most surprising finding. I expected:

Claude Opus 4.5 (most expensive) to dominate
Gemini 3 Pro to outperform Gemini 3 Flash (same family, more capability)

Instead, the cheapest model beat models costing 10x more.
What's going on here? A few theories:

Spatial reasoning doesn't scale with model size - at least not yet
Flash was trained differently - maybe more robotics data, more embodied scenarios?
Smaller models follow instructions more literally - "go down" means go down, not "consider the optimal trajectory"

I genuinely don't know. But if you're building an LLM-powered agent that needs to navigate physical or virtual space, the most expensive model might not be your best choice.
Color theory, maybe
Anecdotally, creatures with higher contrast (gray sheep, pink pigs) seemed easier to spot than brown-ish creatures that blended into the terrain. A future version might normalize creature visibility. Or maybe that's the point - real-world object detection isn't normalized either.
Prior work
Before this, I tried having LLMs pilot a real DJI Tello drone.
Results: it flew straight up, hit the ceiling, and did donuts until I caught it. (I was using Haiku 4.5, which in hindsight explains a lot.)
The Tello is now broken. I've ordered a BetaFPV and might get another Tello since they're so easy to program. Now that I know Gemini Flash can actually navigate, a real-world follow-up might be worth revisiting.
Rough edges
This is half-serious research, half "let's see what happens."

The simulation has rough edges (it's a side project, not a polished benchmark suite)
One blanket prompt is used for all models - model-specific tuning would likely improve results
The feedback loop is basic (position, screenshot, recent commands) - there's room to get creative with what information gets passed back
Iteration limits (50) may artificially cap models that are slower but would eventually succeed

Try it yourself
Prerequisites

Tool
Version
Install

Zig
≥0.15.2
ziglang.org/download

Rust
stable (2024 edition)
rust-lang.org/tools/install

Python
≥3.11
python.org

uv
latest
docs.astral.sh/uv

You'll also need an OpenRouter API key.
Setup
gh repo clone kxzk/snapbench
cd snapbench

# set your API key
export OPENROUTER_API_KEY="sk-or-..."
Running the simulation manually
# terminal 1: start the simulation (with optional seed)
zig build run -Doptimize=ReleaseFast -- 42
# or
make sim

# terminal 2: start the drone controller
cargo run --release --manifest-path llm_drone/Cargo.toml -- --model google/gemini-3-flash-preview
# or
make drone
Running the benchmark suite
# runs all models defined in bench/models.toml
uv run bench/bench_runner.py
# or
make bench
Results get saved to data/run_<id>.csv.
Where this could go

Model-specific prompts: Tune instructions to each model's strengths
Richer feedback: Pass more spatial context (distance readings, compass, minimap?)
Multi-agent runs: What if you gave each model a drone and made them compete?
Extended iterations: Let slow models run longer to isolate reasoning from speed
Real drone benchmark: Gemini Flash vs. the BetaFPV
Pokémon assets: Found low-poly Pokémon models on Poly Pizza—leaning into the Pokémon Snap inspiration
World improvements: Larger terrain, better visuals, performance optimizations

Attribution

Drone by NateGazzard CC-BY via Poly Pizza
Cube World Kit by Quaternius via Poly Pizza

Donated to Poly Pizza to support the platform.

About

📸 gotta find 'em all; spatial reasoning benchmark for LLMs

Resources

Readme

Uh oh!

There was an error while loading. Please reload this page.

Activity
Stars

11
stars
Watchers

0
watching
Forks

0
forks

Report repository

Releases
No releases published

Packages
0

No packages published

Languages

Zig
64.3%

Rust
22.4%

Python
11.1%

GLSL
1.2%

Makefile
1.0%

Footer

Footer navigation

Terms

Privacy

Security

Status

Community

Docs

Contact

Manage cookies

Do not share my personal information

You can’t perform that action at this time.

This GitHub repository, “snapbench,” presents a fascinating, if somewhat rough-around-the-edges, experiment in evaluating Large Language Models (LLMs) for spatial reasoning and navigation tasks. The core concept is inspired by the classic arcade game “Pokémon Snap,” where the player pilots a drone to locate and identify creatures within a 3D environment. However, instead of a polished game experience, snapbench aims to provide a rudimentary benchmark for assessing how well LLMs can perform a simple, embodied task—finding and identifying three creatures in a procedurally generated voxel world.

The architecture of the snapbench simulation is deliberately simplified, utilizing three distinct components: a Rust-based controller, an OpenRouter-powered VLM (Vision-Language Model) and a Zig/raylib-based simulation. The controller orchestrates the flow of the simulation, managing communication between the VLM and the simulation. The VLM is responsible for interpreting instructions and generating commands, while the simulation handles the visual representation and physics of the environment. The use of OpenRouter as the interface allows the VLM to function as a relatively standard LLM, facilitating easy comparison across different models. The Zig/raylib component is responsible for rendering the 3D world and handling drone movement.

The simulation itself generates a procedural voxel world and spawns four creatures—cats, dogs, pigs, and sheep—for the drone to discover. The core interaction loop involves the drone moving through the environment, taking screenshots, and receiving commands from the VLM based on its observations. The VLM is given the prompt “pilot a drone through a 3D voxel world and find 3 creatures.” The objective is for the drone to successfully identify three creatures, where identification is defined as being within 5 units of the target. The benchmark is executed using the `uv` command-line tool, enabling easy manual testing.

The snapshotbench experiment resulted in a notable disparity in performance among the seven frontier LLMs evaluated. Gemini Flash emerged as the sole successful participant, demonstrating the ability to navigate and identify the three creatures. The other models consistently failed, with the Claude Opus exhibiting a particularly puzzling behavior – aggressively attempting to identify the creatures despite never adjusting its approach angle. This suggests a fundamental difference in how the models approach the task—Flash appears to understand the need for descent to achieve identification, whereas other models treat the prompt as a simple instruction without a requirement to prioritize altitude control. A key observation was the "two-creature anomaly," triggered by the random spawning of two creatures close together, leading to one model’s success in finding both. Given the limited iteration limits (50), this coincidence underscored the relatively short time constraints of the benchmark.

The project highlights the surprising finding that larger, more expensive models don't necessarily perform better in simple embodied tasks. The success of the cheaper Gemini 3 Flash model suggests that, at least in this context, model size isn't the primary driver of success. It highlights the importance of instruction following and potentially the training data used for each model. Further research could benefit from exploring factors such as model-specific prompts, richer feedback mechanisms (incorporating spatial context like distance readings and a minimap), and multi-agent configurations.

The development of snapbench includes rough edges—it is, in effect, a rudimentary experimental project rather than a polished benchmark suite. The prompts are a single blanket prompt used across all models, as is the limited scope of the simulation and the relatively brief iteration limits. The resulting dataset is valuable, however, and further experimentation could reveal valuable insights into the capabilities and limitations of LLMs in spatial reasoning and navigation. The project also includes a call for assets such as Pokémon models from Poly Pizza, a recognition of the creative use of existing assets. Future iterations of the project could also benefit from a more realistic drone benchmark (e.g., utilizing a BetaFPV drone instead of a simulator). Although the project may not produce a rigorous benchmark, it effectively serves as a starting point for experimenting with and understanding the potential of LLMs in embodied and spatial tasks.