25 Years of Eggs
home
about
25 Years of Eggs
Published: February 23, 2026 Everyone needs a rewarding hobby. I’ve been scanning all of my receipts since 2001. I never typed in a single price - just kept the images. I figured someday the technology to read them would catch up, and the data would be interesting. This year I tested it. Two AI coding agents, 11,345 receipts. I started with eggs. If you can track one item across 25 years of garbled thermal prints, OCR failures, and folder typos, you can track anything. 14 days. 1.6 billion tokens. 589 egg receipts found. Here’s what the data says.
Here’s what I typed on Day 1:
Ok so let’s make a project plan. In the ~/Records/ we have a ton of receipts. Many are pdf/image/etc. I want to go through and extract the actual content of the receipts to find how much we spend on eggs. Receipts are notoriously terrible to OCR, so we might need to do something more advanced.
Codex explored my file system, found two existing SQLite databases I’d forgotten about, discovered 11,345 receipts across PDFs, emails, and images, and came back with a project plan. I said “write this out to a plan.md please.” It did. We were building within the hour. The whole thing took 14 days. Maybe 15 hours of me actually at the keyboard - short bursts of direction-giving separated by long stretches of the agents just running. Codex ran 15 interactive sessions. Claude handled 10.
Shades of White The oldest receipts were flatbed scans - multiple receipts per page, random orientations, white paper on a white scanner bed. Codex and I tried seven classical CV approaches to find receipt boundaries. Edge detection. Adaptive thresholding. Contour analysis. Morphological operations. Watershed segmentation. Template matching. A grid-based decomposition I pitched as “a classic HackerRank problem.” None of them worked. The core issue: receipts are white and so is the scanner bed. I started calling it the “shades of white” problem. The cleverest attempt was inspired by removing tourists from landmark photos - stack all scans, compute the median pixel at each position, subtract to reveal edges. I thought that one was going to work. Best F1: 0.302. We also threw macOS Vision OCR at it (via a Swift script Codex wrote on the fly), Tesseract, several other tools. I was starting to think the flatbed scans might just be a loss. Then I tried Meta’s SAM3.
One API call with text="receipt". 0.92-0.98 confidence on every boundary. Four seconds per scan. 1,873 receipts from 760 multi-receipt pages. Seven approaches in hours; SAM3 in an afternoon. Wait - You Already Know the Answer Receipts land at random angles, and OCR needs them upright. We tried Tesseract’s orientation detection, macOS Vision OCR, Moondream 2 and 3 - each one better than the last but none reliable enough. Then I realized that every time I pasted a receipt into our Claude conversation to debug something, it was already reading the text perfectly. Rotated, faded, didn’t matter. Why am I building a rotation pipeline when the tool I’m talking to already solves this? So we sent all 11,345 receipts through Sonnet and Codex. Sometimes the answer is staring you right in the face. Replacing Tesseract Overnight Halfway through the project, Tesseract was the weak link. It read “OAT MILK” as “OATH ILK.” It dropped decimals - $4.37 became $437. On old thermal prints it produced nothing at all. Codex opened 20 of the worst ones by hand and found that some weren’t even receipts. A family photo. A postcard. A greeting card. All filed under “Receipts.” I found PaddleOCR-VL - a 0.9B parameter vision-language model that runs locally on Apple Silicon. First test on a sample bank statement: clean, accurate text in 2.1 seconds. Tesseract was faster but dramatically noisier. Second test on a tall Fred Meyer receipt: disaster. The model entered a repetition loop, hallucinating “TILL YGRT” endlessly. The fix turned out to be simple - split tall receipts into slices. Dynamic slicing based on aspect ratio: num_slices = max(2, round(aspect_ratio / 1.5)). Five parallel shards ran overnight. GPU pegged at 100% for 10.8 hours. In the morning: 11,345 receipts OCR’d successfully. Cleaner text for every receipt in the archive. PaddleOCR-VL isn’t a Codex replacement - it can’t do structured extraction or follow instructions. It’s a better Tesseract. The real pipeline: receipt image → PaddleOCR-VL (local, clean text) → Codex/Claude (structured extraction). Twelve Hours to Three Once receipts were segmented, oriented, and OCR’d, they needed structured extraction - find the egg line items, pull prices and quantities. It started with regex. The models love regex. Keyword matching for “egg,” money patterns for prices. Heuristics found eggs in 25/25 positive samples with 0 false positives. Not bad. But on the full corpus, false negatives piled up - Fred Meyer abbreviated codes like STO LRG BRUNN, Whole Foods truncated to EDGS, OCR mangled “EGGS” into LG EGO 12 CT. No regex catches these. So I told Codex “we have unlimited tokens, let’s use them all,” and we pivoted to sending every receipt through Codex for structured extraction. From that one sentence, Codex came back with a parallel worker architecture - sharding, health management, checkpointing, retry logic. The whole thing. When I ran out of tokens on Codex mid-run, it auto-switched to Claude and kept going. I didn’t ask it to do that. I didn’t know it had happened until I read the logs. But the runs kept crashing. Long CLI jobs died when sessions timed out. The script committed results at end-of-run, so early deaths lost everything. I watched it happen three times. On the fourth attempt I said “I would have expected we start a new process per batch.” That was the fix - one fresh process per batch, hard call cap, exit cleanly, resume from cache. Codex patched it, launched it in a tmux session, and the ETA dropped from 12 hours to 3. Not a hard fix. Just the kind of thing you know after you’ve watched enough overnight jobs die at 3 AM. 11,345 receipts processed. The thing that was supposed to take all night finished before I went to bed. The Classifier That Beat Its Own Ground Truth First I needed ground truth. I asked Claude to build me a labeling tool - keyboard-first, receipt image on the left, classification data on the right, arrow keys to navigate, single keypress to verdict. It built the whole Flask app in 22 minutes. I sat down and hand-labeled 375 receipts. Regex found 650 receipts mentioning “egg.” Against those 375 labels: 88% recall. The misses told the story - abbreviated codes, OCR garble, truncated descriptions. No keyword search catches STO LRG BRUNN. The fix: use those hand-labeled edge cases as few-shot examples in an LLM classifier. Twenty examples of what “eggs” looks like on a garbled thermal print from 2003. Batch 10 receipts per call. Eight parallel workers. Two hours. 11,345 receipts classified. Final accuracy: 99%+. Every supposed “miss” by the LLM turned out to be a mislabel in the ground truth. A bicycle shop receipt the old heuristic had flagged. A barcode-only scan. Egg noodles. The classifier was more correct than my labels. Then more QA. A second tool for eyeballing 497 weak images: Space for no-eggs, X for has-eggs. A third for data entry on 95 receipts with missing fields - numpad-optimized, auto-advancing. Four tools total, each built in minutes, each one I ground through by hand. Quality So how good is the data? I pulled 372 random samples and checked them by hand. Initially: 96% correct. The errors were mostly garbled OCR on old scans. One was a hallucination - the pipeline fabricated egg data for a receipt that contained no eggs at all. Real-world data is messy:
Receipt folder typos spanning years: “Reciepts” (2016-2017) and “Recipts” A receipt scanned backwards - Claude decoded the mirrored OCR character by character Email receipts silently preferring text/plain over text/html, dropping pricing lines that only existed in the HTML part
Here’s what made the quality good: every time I caught something, I could show the agents what to look for and they’d go fix it everywhere. I caught a store address hiding in OCR noise: “915 Ny 45th St” was 915 NW 45th St, Seattle. I showed them the pattern, they ran a recovery pass on 40 missing-location receipts - all 40 resolved. By the Numbers
Wall clock time 14 days (Feb 8-22, 2026)
Hands-on time 15 hours
Tokens consumed 1.6 billion
Estimated token cost $1,591
Confirmed egg receipts 589
Total egg spend captured $1,972
Total eggs 8,604
Codex and Claude are excellent at building tools and extracting structured data, but they couldn’t segment an image or replace an OCR engine. The right answer was a stack of specialized models - SAM3 for segmentation, PaddleOCR for text, Codex and Claude for everything else. I expected this, but it was worth trying the simple path first. These are the days of miracle and wonder. I can’t wait to see what 30 years of eggs looks like. |
This document, penned by the author, details a fascinating and surprisingly complex project undertaken to analyze 25 years’ worth of receipts, primarily focused on tracking egg expenditures. The core of the endeavor revolved around leveraging AI coding agents – Codex and Claude – alongside a series of specialized tools, culminating in a remarkably accurate dataset of over 8,600 eggs purchased. The project’s narrative unfolds as a testament to iterative problem-solving, serendipitous discoveries, and the evolving power of AI in automating mundane tasks.
The initial phase of the project centered around establishing a plan for extracting data from the author’s extensive archive of receipt images. Utilizing Codex, the author successfully navigated the complexities of accessing and processing a database of 11,345 receipts stored across various formats—PDFs, emails, and images. The project’s initial focus was on eggs, recognizing that a successful extraction from this dataset could provide a foundational approach for analyzing any item. The author’s careful approach—starting with an image-based OCR and then identifying issues—demonstrates a resourceful and strategic process in tackling what was initially a daunting task. The “shades of white” problem, stemming from the consistent use of white paper during scanning, highlights a critical challenge in image recognition and demonstrates the initial reliance on classic computer vision techniques, ultimately proving inadequate for the task.
A pivotal moment arrived with the deployment of Meta’s SAM3 API. This API, capable of recognizing text boundaries with a remarkable 92-98% confidence level in just four seconds per scan, dramatically simplified the process. The author’s quick insight – realizing that Claude, already equipped for text processing, could effectively handle orientation and faded text – catalyzed a shift in strategy. This underscored the value of observation and questioning preconceived assumptions within a technological workflow. The subsequent adoption of PaddleOCR-VL, a local, Apple Silicon-based vision-language model, further refined the process, replacing Tesseract and providing significantly cleaner text output. The author's adjustment from relying on a complex pipeline to utilizing this newer model showcasing the adaptability and efficiency afforded by leveraging the latest technological advancements.
The project then shifted to the task of structuring the extracted data, moving beyond simple text identification to extracting information like price and quantity. Initially, the author experimented with regular expressions, which proved surprisingly effective, achieving 88% recall against the provided ground truth. However, the challenges of abbreviated codes and OCR errors exposed the limitations of this approach, leading to the deployment of Codex to handle structured extraction—a decision driven by a desire to maximize token usage and a demonstration of the agents’ capabilities. This highlighted the potential of AI to autonomously optimize workflows and adapt to unexpected hurdles, a critical moment of realization and the turning point of the project. The “one fresh process per batch” fix, born from observing overnight job failures, exemplifies the value of experience and intuition in managing complex automated systems.
The author established a rigorous quality control process, utilizing a combination of manual review and a newly constructed, AI-powered classification tool. Employing a few-shot learning approach with Claude, the author created an LLM classifier that ultimately outperformed the hand-labeled ground truth, showcasing the power of leveraging readily available data to improve model performance. A final layer of quality assurance involving a “eyeballing” tool and a data entry module, all built quickly and intuitively, finalized the dataset.
Ultimately, the project yielded a remarkably accurate dataset of over 8,600 eggs purchased, with 99%+ accuracy thanks to the collaborative network of specialized tools and AI agents. The journey itself – the experimentation, the iterative refinement, and the unexpected discoveries – demonstrated that technology isn’t just about powerful tools, but also about intelligent problem-solving and the ability to adapt and optimize in real-time. The project’s statistics – 14 days, 15 hours of active work, 1.6 billion tokens consumed – paint a vivid picture of a sustained, deeply engaging investigation, while the final numbers—589 confirmed egg receipts, $1,972 total expenditure, and 8,604 eggs—represent the quantifiable outcome of a remarkable journey. The author’s perspective emphasizes not only the technical challenges but also the human element of the process, revealing a resourceful, inquisitive, and ultimately successful exploration of 25 years’ worth of receipts. |