Effective harnesses for long-running agents
Recorded: Nov. 29, 2025, 1:08 a.m.
| Original | Summarized |
Effective harnesses for long-running agents \ AnthropicSkip to main contentSkip to footerResearchEconomic FuturesCommitmentsLearnNewsTry ClaudeEngineering at AnthropicEffective harnesses for long-running agentsPublished Nov 26, 2025Agents still face challenges working across many context windows. We looked to human engineers for inspiration in creating a more effective harness for long-running agents.As AI agents become more capable, developers are increasingly asking them to take on complex tasks requiring work that spans hours, or even days. However, getting agents to make consistent progress across multiple context windows remains an open problem.The core challenge of long-running agents is that they must work in discrete sessions, and each new session begins with no memory of what came before. Imagine a software project staffed by engineers working in shifts, where each new engineer arrives with no memory of what happened on the previous shift. Because context windows are limited, and because most complex projects cannot be completed within a single window, agents need a way to bridge the gap between coding sessions.We developed a two-fold solution to enable the Claude Agent SDK to work effectively across many context windows: an initializer agent that sets up the environment on the first run, and a coding agent that is tasked with making incremental progress in every session, while leaving clear artifacts for the next session. You can find code examples in the accompanying quickstart.The long-running agent problemThe Claude Agent SDK is a powerful, general-purpose agent harness adept at coding, as well as other tasks that require the model to use tools to gather context, plan, and execute. It has context management capabilities such as compaction, which enables an agent to work on a task without exhausting the context window. Theoretically, given this setup, it should be possible for an agent to continue to do useful work for an arbitrarily long time.However, compaction isn’t sufficient. Out of the box, even a frontier coding model like Opus 4.5 running on the Claude Agent SDK in a loop across multiple context windows will fall short of building a production-quality web app if it’s only given a high-level prompt, such as “build a clone of claude.ai.”Claude’s failures manifested in two patterns. First, the agent tended to try to do too much at once—essentially to attempt to one-shot the app. Often, this led to the model running out of context in the middle of its implementation, leaving the next session to start with a feature half-implemented and undocumented. The agent would then have to guess at what had happened, and spend substantial time trying to get the basic app working again. This happens even with compaction, which doesn’t always pass perfectly clear instructions to the next agent.A second failure mode would often occur later in a project. After some features had already been built, a later agent instance would look around, see that progress had been made, and declare the job done.This decomposes the problem into two parts. First, we need to set up an initial environment that lays the foundation for all the features that a given prompt requires, which sets up the agent to work step-by-step and feature-by-feature. Second, we should prompt each agent to make incremental progress towards its goal while also leaving the environment in a clean state at the end of a session. By “clean state” we mean the kind of code that would be appropriate for merging to a main branch: there are no major bugs, the code is orderly and well-documented, and in general, a developer could easily begin work on a new feature without first having to clean up an unrelated mess.When experimenting internally, we addressed these problems using a two-part solution:Initializer agent: The very first agent session uses a specialized prompt that asks the model to set up the initial environment: an init.sh script, a claude-progress.txt file that keeps a log of what agents have done, and an initial git commit that shows what files were added.Coding agent: Every subsequent session asks the model to make incremental progress, then leave structured updates.1The key insight here was finding a way for agents to quickly understand the state of work when starting with a fresh context window, which is accomplished with the claude-progress.txt file alongside the git history. Inspiration for these practices came from knowing what effective software engineers do every day.Environment management In the updated Claude 4 prompting guide, we shared some best practices for multi-context window workflows, including a harness structure that uses “a different prompt for the very first context window.” This “different prompt” requests that the initializer agent set up the environment with all the necessary context that future coding agents will need to work effectively. Here, we provide a deeper dive on some of the key components of such an environment.Feature listTo address the problem of the agent one-shotting an app or prematurely considering the project complete, we prompted the initializer agent to write a comprehensive file of feature requirements expanding on the user’s initial prompt. In the claude.ai clone example, this meant over 200 features, such as “a user can open a new chat, type in a query, press enter, and see an AI response.” These features were all initially marked as “failing” so that later coding agents would have a clear outline of what full functionality looked like.{ |
Effective Harnesses for Long-Running Agents Anthropic’s research tackles the significant challenge of enabling AI agents to maintain consistent progress across extended tasks, particularly those requiring hours or days of work. The core problem lies in the limitations of context windows – the amount of information an agent can process at once – and how discrete agent sessions, without memory of previous work, exacerbate this. This summary details Anthropic’s innovative solution, focusing on practical implementation and key insights. The primary challenge addressed is the agent's tendency to attempt overly ambitious, “one-shot” solutions to complex problems. When an agent is given a high-level prompt like “build a clone of claude.ai”, it might try to complete the entire project in a single session, often leading to a fragmented and ultimately unsuccessful outcome. This stems from the agent’s inability to retain context between sessions. The agent would frequently fall short of producing a production-quality web application, particularly when starting with a minimal prompt. Anthropic developed a two-fold solution centered around the Claude Agent SDK. Firstly, an *Initializer Agent* is used to establish the initial environment on the first run. This agent generates an `init.sh` script, a `claude-progress.txt` file to log agent activity, and an initial Git commit. Secondly, a *Coding Agent* is tasked with making incremental progress by editing the feature list file, which expands on the user’s initial prompt, and commits its progress. Crucially, the process relies on maintaining a structured approach to development. The feature list, formatted in JSON, acts as a central repository of requirements, ensuring the agent focuses on specific, manageable features. This document is not merely a list of objectives, but a carefully crafted roadmap, preventing the agent from “one-shotting” the project. To mitigate the issue of the agent concluding a project prematurely, Anthropic implemented strict controls. The initializer agent insists on a structured JSON file with a list of end-to-end feature descriptions. Within this document, each feature is marked as “failing” until it has been successfully implemented and verified, preventing the agent from declaring features complete without rigorous testing. Furthermore, the coding agent’s incremental approach is critical to its success. The agent is prompted to work on only one feature at a time, coupled with leaving the environment in a clean state after a code change. This involves committing progress to Git with descriptive commit messages and writing summaries in a progress file. This approach addresses the tendency for agents to make large, unstructured changes that were difficult to track. The use of git also enables easy rollback of bad code changes, ensuring continuous working code. Notably, Anthropic uses strong prompts to discourage the agent from removing or editing tests because this could lead to missing or buggy functionality. To further enhance reliability, Anthropic introduced crucial testing procedures. Originally the agent had a tendency to mark a feature as complete without proper testing. Anthropic now prompts the agent to leverage browser automation tools and conduct thorough testing using tools like Puppeteer MCP, mirroring how a human developer would test functionalities. This resulted in improved performance and early bug detection. One key challenge remains regarding browser automation tools and the ability to identify every type of bug and Anthropic has identified that Claude can’t see browser-native alert modals through the Puppeteer MCP. Finally, to facilitate the agent’s workflow, Anthropic developed a standardized "getting up to speed" routine, including running `pwd` to verify the current directory, reading the git logs and progress files, and reading the features list. This process saves the agent tokens and helps to avoid redundant work, ensuring that it can quickly understand the project's state and confidently proceed with the next task. The standard process involves starting the local development server and then using the Puppeteer MCP to start a new chat, send a message, and receive a response. Anthropic’s research represents a critical step towards realizing the full potential of long-running AI agents. While further investigation is needed to determine the optimal agent architecture, this two-part solution—the initializer agent and the coding agent—provides a robust framework for achieving consistent progress in complex, extended tasks. |