The Generative Burrito Test

Recorded: Nov. 26, 2025, 1:03 a.m.

Original

The Generative Burrito Test
A CRITICAL benchmark for image generation models

This was originally inspired by the horse riding astronaut meme way back in 2023. But I think Simon's Pelican benchmark is what keeps the idea alive for me, even though they are testing different modalities. Burritos are obviously more important than both pelicans and equestrian absurdism.
Also, I was initially surprised that it couldn't replicate the image well because I assumed there would be plenty of similar examples in the training data (unlike said equestrian absurdity). But I think it's a bit of a weird concept because all the ingredients get smushed and smashed and congealed.
All images generated using fal defaults. Obviously you can probably prompt it better, but that's HIL effort, and feels like cheating.

The Prompt

A partially eaten burrito with cheese, sour cream, guacamole, lettuce, salsa, pinto beans, and chicken.

Model Gallery

SD 1.5
fal.ai

Fast SDXL
fal.ai

Fast Lightning SDXL
fal.ai

Flux Schnell
fal.ai

Flux Dev
fal.ai

Ideogram V2
fal.ai

SD v3.5 Large
fal.ai

SD v3.5 Medium
fal.ai

Flux Pro v1.1 Ultra
fal.ai

Ideogram V2a
fal.ai

Ideogram V3
fal.ai

HiDream I1 Full
fal.ai

HiDream I1 Dev
fal.ai

Imagen4 Preview Fast
fal.ai

Imagen4 Preview
fal.ai

Bagel
fal.ai

Bria 3.2
fal.ai

Qwen Image
fal.ai

Nano Banana 1
fal.ai

Seedream V4
fal.ai

Wan 2.5 Preview
fal.ai

Hunyuan V3
fal.ai

Nano Banana Pro
fal.ai

Flux 2 Flex
fal.ai

← Back to generativist's home page

The Generative Burrito Test represents a novel and surprisingly insightful benchmark for evaluating the capabilities of image generation models. Originally conceived as a playful response to the “horse riding astronaut” meme trend of 2023, the test’s enduring relevance is rooted in its ability to expose fundamental challenges within generative AI, particularly regarding the nuanced understanding of complex, composite scenes. The test’s core premise – generating a realistic depiction of a partially eaten burrito laden with various components – is deceptively simple. However, the outcomes demonstrated a significant divergence in quality and fidelity across different models, highlighting the varying degrees of competence these systems possess in rendering intricate visual details and maintaining logical consistency within a single image.

Initially, the creator was taken aback by the difficulty encountered in replicating the requested image, attributing this to a lack of readily available parallel examples within the training datasets. This suggestion carries considerable weight, given the inherent complexity of the prompt. The challenge lies not simply in representing each individual ingredient – cheese, sour cream, guacamole, lettuce, salsa, pinto beans, and chicken – but in accurately portraying their spatial relationships, textural variations, and the resulting amalgamation of a partially consumed burrito. The “smushed and smashed and congealed” state, as described, represents a crucial element that traditional image generation models struggled to consistently produce. This suggests an inadequacy in the models’ ability to simulate the physical properties and resultant visual effects of food, specifically the messy and unpredictable nature of a partially eaten burrito.

The resulting gallery of images generated by various models – including SD 1.5, Fast SDXL, Fast Lightning SDXL, Flux Schnell, Flux Dev, Ideogram V2, SD v3.5 Large, SD v3.5 Medium, Flux Pro v1.1 Ultra, Ideogram V2a, Ideogram V3, HiDream I1 Full, HiDream I1 Dev, Imagen4 Preview Fast, Imagen4 Preview, Bagel, Bria 3.2, Qwen Image, Nano Banana 1, Seedream V4, and Wan 2.5 Preview – showcased a wide spectrum of results. While some models were able to generate identifiable depictions of the requested elements, the overall composition consistently lacked the required realism and coherence. Variations in rendering, such as misplaced ingredients or an unnatural layering of textures, indicated a limitation in the models’ ability to integrate multiple visual components into a stable and believable scene. This suggests that current image generation technology, particularly concerning food imagery, is still in a stage of development, requiring further refinement to capture the subtle nuances of spatial relationships and materiality. The pursuit of a convincing “Generative Burrito Test” therefore poses a valuable, albeit somewhat unorthodox, metric for assessing the evolving state of image generation models and prompting continued research toward enhanced visual realism and compositional understanding.