Why AI Agents Cannot Change Software Systems

Recorded: May 27, 2026, 2 p.m.

Original

Summarized

Agents Cannot Maintain Systems: The Additive–Transformative Gap in LLM Software Delivery

Home

Start Here

Leadership OS

Newsletter

Foundations

Build

Leadership

Why Phroneses

About

Services
Rate Card
FAQ
Contact

Article
Build

Agents Cannot Maintain Systems: The Additive–Transformative Gap in LLM Software Delivery

Jh Evans

Thu 21 May 2026
• 5 min read

This article explains why current LLMs cannot safely modify real software
systems, despite impressive code‑generation demos.
Table of contents
The Promise of Automated Software Delivery
In 2026, the automated software delivery dream is for an agent to:

read a repository
understand project structure
plan a multi‑step change
write code, tests, and docs
run the code and fix its own mistakes
produce a PR‑ready diff

The first three tasks are additive; the last three are transformative. The
first three add information without changing the behaviour of the system: they
require reading, mapping, and planning, but not altering any existing causal
structure in the codebase.
Applying new code is self-contained, additive work; modifying an existing system
is transformative work that requires an understanding of dependencies,
invariants, and consequences. This distinction — additive vs transformative —
is the core reason current LLMs can assist but cannot autonomously deliver
software.
Parts of the above can be done but only for tightly controlled demos on simple
code that is tens of lines long, not on real-world repositories with thousands
of lines of code that has existed for years where dozens of people have
updated it.
What the Labs Have Actually Delivered
The agentic work of OpenAI, Google, Cognition Labs, GitHub (Microsoft),
Sourcegraph, JetBrains, Replit, Amazon, Meta, and Anthropic, that is listed in
Further Reading, was published in 2023 and 2024.
Depending on where you look, you may have been given another impression: that
"agents are here". However, reality tells a different story.
Agents are improving, but are not reliable, not autonomous, and not production‑safe.
LLMs can assist with software delivery, but they cannot own it.
Why is this?
LLMs generate statistically plausible continuations of text. This works well
for self-contained tasks like writing a function or drafting documentation
because these are pattern‑extension problems. But pattern‑matching is not
system understanding, and plausibility is not correctness.
Software systems are causal: components depend on each other, invariants
constrain behaviour, and changes propagate through the system. The moment a
task stops being self‑contained and becomes system‑dependent — requiring
dependency coherence, persistent state, or awareness of how changes ripple
through a real codebase — pattern‑matching is no longer sufficient.
Currently, LLMs can imitate the shape of engineering work, but they cannot
maintain a stable internal representation of a system that must be coherently
changed, and that gap is exactly why LLMs fail the moment the task becomes
system‑level.
Persistent state creates temporal dependencies
A self‑contained task has no past and no future. A system‑dependent task does.
As soon as a change depends on:

previous writes
accumulated data
cached values
long‑lived objects
external system state

any agentic model must reason about how the system got here and how it will
behave after the change.
LLMs cannot maintain that internal causal chain.
Writing code to Agentic Systems: The Fundamental Gap
The gap becomes clear when you compare two activities: writing new code and
modifying an existing system.
Code generation is local and additive: the model extends a pattern without
needing to understand the system.
But agentic work is global and transformative: the LLM must change the system
itself, which requires understanding dependencies, invariants, interactions,
and downstream consequences.
This is causal reasoning, not pattern extension. LLMs predict tokens, not
consequences — and that is why the leap from writing code to producing a safe,
system‑aware PR‑ready diff is not incremental but a shift into a fundamentally
different problem space.
Producing a PR‑ready diff (the section in question)
A pull request (PR) is a piece of code that will change a system.
For that change to be safe, the change must respect the system's current
architecture, its intent, and all downstream consequences.
Software engineers work hard to ensure that such a change is safe through
testing and their own judgement and experience before having a collegue review
the change.
Applying a change is no longer pattern-matching but understanding causal
behaviour: how will the system change if this PR is applied?
The correctness of the PR depends on understanding the whole system, not just
generating text.
The LLM must change the system, which requires understanding dependencies,
invariants, interactions and consequences, all of which demand causal
reasoning, not pattern matching.
Pattern‑matching can write code; only causal reasoning can maintain systems.
What can I do?
Confirm for yourself any claim that you see. Define your own realistic
real-world repository to work on, one that is thousands of lines of code, that
has supported past real-world work patterns.
Having your own results, applied to your own repository will tell you volumes
more than any press release or online anecdote.
For the moment:

treat agentic AI as a strategic direction
treat current tools as assistants, not engineers
invest in clarity, architecture, and test discipline
expect progress, but not miracles
do not plan delivery pipelines around unproven capabilities

Maintain human judgement as the centre of the system.
The dream is intact. The evidence is not yet here.
Why this matters: code is cheap, judgement is not
LLM-augmented software delivery does not remove engineering.
It moves engineering up a level.
Humans need to focus on:

intent
constraints
architecture
correctness
safety
trade‑offs

The desired end state is not "AI writes code" but AI maintains systems. If we get
there, humans will still need to maintain intent.
The consequence of an agentic system is not to remove engineering, but to
elevate it, so that teams spend less time on mechanical construction and more time on
judgement, alignment, and shaping the environment in which agents operate.
The organisations that benefit most will be those that treat agentic development
not as automation, but as a structural shift in how software is conceived,
validated, and maintained.
Final Thought
Until AI can reason causally about systems, human judgement remains the
foundation of software delivery.
Related Work

The real gains from AI come from improving the shared work between engineers — planning, coordination, review, debugging, and delivery — not from speeding up individual coding.
Software engineers must understand tokens, structure, and probabilistic behaviour to build reliable systems and avoid mismatches between test and production behaviour.
AI systems behave like probabilistic components; engineers must build structured interfaces and layered constraints to make them reliable inside software systems.

If this piece was useful, you’ll appreciate the free Phroneses newsletter — clear thinking on engineering leadership, organisational clarity, and reliable systems. Practical, honest, and built for people who care about doing the work well.

Subscribe to the newsletter →

I work with leaders and teams on clarity, capability, and momentum.
Work with me →

Table of Contents

The Promise of Automated Software Delivery
What the Labs Have Actually Delivered
Why is this?
Persistent state creates temporal dependencies
Writing code to Agentic Systems: The Fundamental Gap
Producing a PR‑ready diff (the section in question)
What can I do?
Why this matters: code is cheap, judgement is not
Final Thought
Related Work
Table of Contents
Further Reading

Further Reading
OpenAI o1/o3, OpenAI, September, 2024
- https://openai.com/index/introducing-openai-o1-preview/
Gemini Code Demos, Google, December, 2023
- https://blog.google/technology/ai/google-gemini-ai/
Devin, Cognition Labs, March, 2024
- https://www.cognition-labs.com/
GitHub Copilot, GitHub (Microsoft), November, 2023
- https://github.blog/2023-11-08-the-new-github-copilot-your-ai-pair-programmer/
Cody, Sourcegraph, April, 2024
- https://sourcegraph.com/blog/cody-2-0
AI Assistant in JetBrains IDEs, JetBrains, December, 2023
- https://blog.jetbrains.com/blog/2023/12/06/jetbrains-ai-assistant-is-now-available/
Replit Agents, Replit, November, 2023
- https://blog.replit.com/agents
Amazon CodeWhisperer, Amazon, April, 2023
- https://aws.amazon.com/codewhisperer/
Code Llama, Meta, August, 2023
- https://ai.meta.com/blog/code-llama-large-language-model-coding/
Claude 3 Code Reasoning, Anthropic, March, 2024
- https://www.anthropic.com/news/claude-3-family

Twitter

Facebook

About the Author →

Advisory for leaders and teams. Work with me →
Services
·
Rate Card
·
FAQ
·
Contact

Published with Pelican • Theme Attila •

System theme

The current limitations of large language models in autonomously delivering software stem from an additive–transformative gap in their capabilities when interacting with real-world systems. While agents have demonstrated impressive code-generation capabilities, they currently cannot safely modify existing software systems, despite this apparent progress. This distinction centers on the nature of the tasks: applying new code is largely additive, involving reading, mapping, and planning without altering the system's behavior, whereas modifying an existing system is transformative, requiring a deep understanding of dependencies, invariants, and the potential downstream consequences of changes. This divergence between pattern extension, which LLMs excel at, and causal reasoning, which is necessary for system maintenance, forms the core reason why LLMs can assist but cannot autonomously manage complex software delivery.

The difficulty arises because software systems are inherently causal; components depend on one another, invariants constrain behavior, and changes propagate throughout the structure. For an agent to successfully modify a system, it must reason about the system's history and anticipate future states, which necessitates understanding persistent state, accumulated data, and how modifications will ripple through the codebase. LLMs are fundamentally pattern-matching engines that predict sequences of text; they are not equipped to maintain the complex, evolving internal causal chains required for system-level operations. Consequently, the leap from generating new code, which is a local and additive task, to producing a safe, system-aware pull request requires not just generating text but understanding the total causal behavior of the entire system.

Producing a pull request that safely alters a system demands understanding the architecture, intent, and all potential consequences, which moves the requirement beyond pattern matching into causal reasoning. Software engineers typically ensure safety through testing and judgment, reflecting a necessary understanding of causal behavior before applying changes. Because an agent must actively change the system, it must embody this causal reasoning to understand dependencies, interactions, and long-term effects, rather than simply extending patterns. Therefore, pattern matching can generate code, but only causal reasoning can maintain stable and correct systems.

To move forward, the focus should shift from expecting autonomous system maintenance from current AI capabilities to treating agentic AI as a strategic direction rather than a replacement for engineering roles. The article suggests that the current value of AI lies in improving the shared work among engineers—such as planning, coordination, review, and debugging—rather than automating individual coding tasks. The ultimate goal is not for AI to write code, but for it to maintain entire systems. This shift liberates human engineers to concentrate on high-level concerns like intent, architectural constraints, safety, and trade-offs, elevating the role of engineering from mechanical construction to focused judgement and shaping the operating environment for agents. Until AI can reason causally about systems, human judgment remains the indispensable foundation of reliable software delivery.