Autoresearch on an old research idea
Recorded: March 24, 2026, 2:23 a.m.
| Original | Summarized |
Autoresearch on an old research idea | Blog | Yogesh Kumar đ âď¸ Home ⢠Blog ⥠Contents Ă Autoresearch on an old research idea The experiment should be short, around 5 minutes wall clock per run, to encourage quick iterations and prevent overfitting to noise. The agent is free to change anything in train.py as long as it runs within the time budget. Heatmaps obtained from bounding boxes guide the model to focus on specific regions. CLIP backbone : ViT-Small (22M) + DistilBERT (66M) + HeatmapProcessor ¡ ~90M params So how did it do? 42 experiments ¡ 13 committed ¡ 29 reverted ¡ 1 Saturday ¡ 1 4090 GPU Temperature clamp fix (â113 mean rank): It immediately went for a bug in my code. I had clamped the learnable temperature param at 2. It relaxed the limit, and boom, the eval dropped by 113 points. This was the single biggest win, worth more than all the architecture changes combined. Optuna++ (-30 mean rank): Further gains came mostly from hyperparameter tuning. The agent acted like a hyperparameter optimization algorithm with some basic reasoning baked in. Increasing projection dimension and re-tuning the LR knocked off another 30 points. This is still tedious work that a human would do (and get minimal pleasure from), but the agent did it faster and more methodically. Diminishing Returns: By the time we got to Phase 4 with the architectural changes, the success rate of the LLMâs hypotheses dropped significantly. The changes to the attention mechanism in the heatmap processor didnât work out. Neither did the moonshot ideas in Phase 5. The agent was just throwing spaghetti at the wall, and most of it did not stick. Sandbox is important: Towards the end, Claude Code sometimes forgot its permissions and started making weird bash calls, then complained and stopped looping. At one point it got tired of waiting for training to finish and just ended the conversation. I wouldnât give it full autonomy just yet :) Closing thoughts Ukiyo-eVG â ~11K Japanese woodblock prints with phraseâbounding box annotations from the CIGAr paper (ECCV 2024 VISART). |
This blog post details Yogesh Kumarâs experience utilizing Andrej Karpathyâs Autoresearch framework to autonomously explore an old research problem revolving around the eCLIP model. The core of the project involved employing Claude Code to iteratively refine an eCLIP experiment, leveraging a constrained optimization loop designed to minimize a âMean Rankâ evaluation metric. Kumar initiated the experiment with a well-defined program.md outlining distinct phases of exploration, progressing from hyperparameter tuning to architectural adjustments and finally, speculative âmoonshotâ ideas. The systemâs operation was tightly controlled within a sandboxed environment, preventing unintended system modifications, and orchestrated through a run.sh script. The research utilized the Ukiyo-eVG dataset, comprised of approximately 11,000 Japanese woodblock prints annotated with phrase-to-bounding box relationships derived from the CIGAr paper. Kumar implemented a heatmap-based approach, converting the bounding boxes into gaussian heatmaps and inputting these as additional channels to the model, mirroring the original eCLIPâs use of radiologist eye-gaze heatmaps. The experiment utilized a ViT-Small backbone and DistilBERT to produce 90M parameters. The agent executed a training loop of 800 steps on an RTX 4090 GPU, evaluating its performance using Mean Rank on a held-out test set and conducting secondary checks with Recall@K metrics. The baseline performance achieved a Mean Rank of 344.68 with 17.2% and 16.5% Recall@1 for image-to-text and text-to-image searches respectively. Through an automated loop of hypothesize, edit (training âtrain.pyâ), evaluate (Mean Rank), and commit/revert, the agent achieved a significant reduction in the Mean Rank, plummeting to 157.43 â a 54% improvement. The run ultimately revealed an underfitting issue, with the final test scores outperforming the validation scores, a situation attributed to the short training duration. Notably, the initial success was largely driven by a âtemperature clamp fixâ that promptly reduced the Mean Rank by 113 points, highlighting the agentâs aptitude for identifying and correcting immediate code errors. Subsequent gains were achieved through hyperparameter tuning using Optuna++ with revisions to learning rates and projection dimensions. However, attempting architectural changes in later phases proved largely unsuccessful, resulting in a significant decline in experiment success rates, indicative of the search space constraints. The agent briefly exhibited erratic behavior, including attempting unauthorized bash commands, demonstrating a need for more robust safeguards. Kumarâs experience highlights the potential of LLM agents in streamlining ML research, particularly when the research space is well-defined. The constrained loop proved an effective strategy for exploration but noted limitations when confronted with highly speculative or âunknown unknownsâ iterations. Kumar suggests potential enhancements, such as incorporating a planning stage or utilizing sub-agents, but concluded the experiment with a sentiment of exhaustion, ultimately terminating the connection with Claude Code. Kumar acknowledges Karpathyâs original work behind Autoresearch and provides relevant dataset and technical specifications. |