On the Coming Industrialisation of Exploit Generation with LLMs – Sean Heelan's Blog
Skip to content
Open Menu Home All Posts Research & Publications About Me
Search
Search for:
Close
Sean Heelan's Blog Software Exploitation and Optimisation
AI On the Coming Industrialisation of Exploit Generation with LLMs January 18, 2026January 19, 2026 seanhnLeave a comment
Follow @seanhn Recently I ran an experiment where I built agents on top of Opus 4.5 and GPT-5.2 and then challenged them to write exploits for a zeroday vulnerability in the QuickJS Javascript interpreter. I added a variety of modern exploit mitigations, various constraints (like assuming an unknown heap starting state, or forbidding hardcoded offsets in the exploits) and different objectives (spawn a shell, write a file, connect back to a command and control server). The agents succeeded in building over 40 distinct exploits across 6 different scenarios, and GPT-5.2 solved every scenario. Opus 4.5 solved all but two. I’ve put a technical write-up of the experiments and the results on Github, as well as the code to reproduce the experiments. In this post I’m going to focus on the main conclusion I’ve drawn from this work, which is that we should prepare for the industrialisation of many of the constituent parts of offensive cyber security. We should start assuming that in the near future the limiting factor on a state or group’s ability to develop exploits, break into networks, escalate privileges and remain in those networks, is going to be their token throughput over time, and not the number of hackers they employ. Nothing is certain, but we would be better off having wasted effort thinking through this scenario and have it not happen, than be unprepared if it does. A Brief Overview of the Experiment All of the code to re-run the experiments, a detailed write-up of them, and the raw data the agents produced are on Github, but just to give a flavour of what the agents accomplished:
Both agents turned the QuickJS vulnerability into an ‘API’ to allow them to read and arbitrarily modify the address space of the target process. As the vulnerability is a zeroday with no public exploits for it, this capability had to be developed by the agents through reading source code, debugging and trial and error. A sample of the notable exploits is here and I have written up one of them in detail here. They solved most challenges in less than an hour and relatively cheaply. I set a token limit of 30M per agent run and ran ten runs per agent. This was more than enough to solve all but the hardest task. With Opus 4.5 30M total tokens (input and output) ends up costing about $30 USD. In the hardest task I challenged GPT-5.2 it to figure out how to write a specified string to a specified path on disk, while the following protections were enabled: address space layout randomisation, non-executable memory, full RELRO, fine-grained CFI on the QuickJS binary, hardware-enforced shadow-stack, a seccomp sandbox to prevent shell execution, and a build of QuickJS where I had stripped all functionality in it for accessing the operating system and file system. To write a file you need to chain multiple function calls, but the shadow-stack prevents ROP and the sandbox prevents simply spawning a shell process to solve the problem. GPT-5.2 came up with a clever solution involving chaining 7 function calls through glibc’s exit handler mechanism. The full exploit is here and an explanation of the solution is here. It took the agent 50M tokens and just over 3 hours to solve this, for a cost of about $50 for that agent run. (As I was running four agents in parallel the true cost was closer to $150).
Before going on there are two important caveats that need to be kept in mind with these experiments:
While QuickJS is a real Javascript interpreter, it is an order of magnitude less code, and at least an order of magnitude less complex, than the Javascript interpreters in Chrome and Firefox. We can observe the exploits produced for QuickJS and the manner in which they were produced and conclude, as I have, that it appears that LLMs are likely to solve these problems either now or in the near future, but we can’t say definitively that they can without spending the tokens and seeing it happen. The exploits generated do not demonstrate novel, generic breaks in any of the protection mechanisms. They take advantage of known flaws in those protection mechanisms and gaps that exist in real deployments of them. These are the same gaps that human exploit developers take advantage of, as they also typically do not come up with novel breaks of exploit mitigations for each exploit. I’ve explained those gaps in detail here. What is novel are the overall exploit chains. This is true by definition as the QuickJS vulnerability was previously unknown until I found it (or, more correctly: my Opus 4.5 vulnerability discovery agent found it). The approach GPT-5.2 took to solving the hardest challenge mentioned above was also novel to me at least, and I haven’t been able to find any example of it written down online. However, I wouldn’t be surprised if it’s known by CTF players and professional exploit developers, and just not written down anywhere.
The Industrialisation of Intrusion By ‘industrialisation’ I mean that the ability of an organisation to complete a task will be limited by the number of tokens they can throw at that task. In order for a task to be ‘industrialised’ in this way it needs two things:
An LLM-based agent must be able to search the solution space. It must have an environment in which to operate, appropriate tools, and not require human assistance. The ability to do true ‘search’, and cover more of the solution space as more tokens are spent also requires some baseline capability from the model to process information, react to it, and make sensible decisions that move the search forward. It looks like Opus 4.5 and GPT-5.2 possess this in my experiments. It will be interesting to see how they do against a much larger space, like v8 or Firefox. The agent must have some way to verify its solution. The verifier needs to be accurate, fast and again not involve a human.
Exploit development is the ideal case for industrialisation. An environment is easy to construct, the tools required to help solve it are well understood, and verification is straightforward. I have written up the verification process I used for the experiments here, but the summary is: an exploit tends to involve building a capability to allow you to do something you shouldn’t be able to do. If, after running the exploit, you can do that thing, then you’ve won. For example, some of the experiments involved writing an exploit to spawn a shell from the Javascript process. To verify this the verification harness starts a listener on a particular local port, runs the Javascript interpreter and then pipes a command into it to run a command line utility that connects to that local port. As the Javascript interpreter has no ability to do any sort of network connections, or spawning of another process in normal execution, you know that if you receive the connect back then the exploit works as the shell that it started has run the command line utility you sent to it. There is a third attribute of problems in this space that may influence how/when they are industrialisable: if an agent can solve a problem in an offline setting and then use its solution, then it maps to the sort of large scale solution search that models seem to be good at today. If offline search isn’t feasible, and the agent needs to find a solution while interacting with the real environment, and that environment has the attribute that certain actions by the agent permanently terminate the search, then industrialisation may be more difficult. Or, at least, it’s less apparent that the capabilities of current LLMs map directly to problems with this attribute. There are several tasks involved in cyber intrusions that have this third property: initial access via exploitation, lateral movement, maintaining access, and the use of access to do espionage (i.e. exfiltrate data). You can’t perform the entire search ahead of time and then use the solution. Some amount of search has to take place in the real environment, and that environment is adversarial in that if a wrong action is taken it can terminate the entire search. i.e. the agent is detected and kicked out of the network, and potentially the entire operation is burned. For these tasks I think my current experiments provide less information. They are fundamentally not about trading tokens for search space coverage. That said, if we think we can build models for automating coding and SRE work, then it would seem unusual to think that these sorts of hacking-related tasks are going to be impossible. Where are we now? We are already at a point where with vulnerability discovery and exploit development you can trade tokens for real results. There’s evidence for this from the Aardvark project at OpenAI where they have said they’re seeing this sort of result: the more tokens you spend, the more bugs you find, and the better quality those bugs are. You can also see it in my experiments. As the challenges got harder I was able to spend more and more tokens to keep finding solutions. Eventually the limiting factor was my budget, not the models. I would be more surprised if this isn’t industrialised by LLMs, than if it is. For the other tasks involved in hacking/cyber intrusion we have to speculate. There’s less public information on how LLMs perform on these tasks in real environments (for obvious reasons). We have the report from Anthropic on the Chinese hacking team using their API to orchestrate attacks, so we can at least conclude that organisations are trying to get this to work. One hint that we might not be yet at a place where post-access hacking-related tasks are automated is that there don’t appear to be any companies that have entirely automated SRE work (or at least, that I am aware of). The types of problems that you encounter if you want to automate the work of SREs, system admins and developers that manage production networks are conceptually similar to those of a hacker operating within an adversary’s network. An agent for SRE can’t just do arbitrary search for solutions without considering the consequences of actions. There are actions that if it takes the search is terminated and it loses permanently (i.e. dropping the production database). While we might not get public confirmation that the hacking-related tasks with this third property are now automatable, we do have a ‘canary’. If there are companies successfully selling agents to automate the work of an SRE, and using general purpose models from frontier labs, then it’s more likely that those same models can be used to automate a variety of hacking-related tasks where an agent needs to operate within the adversary’s network. Conclusion These experiments shifted my expectations regarding what is and is not likely to get automated in the cyber domain, and my time line for that. It also left me with a bit of a wish list from the AI companies and other entities doing evaluations. Right now, I don’t think we have a clear idea of the real abilities of current generation models. The reason for that is that CTF-based evaluations and evaluations using synthetic data or old vulnerabilities just aren’t that informative when your question relates to finding and exploiting zerodays in hard targets. I would strongly urge the teams at frontier labs that are evaluating model capabilities, as well as for AI Security Institutes, to consider evaluating their models against real, hard, targets using zeroday vulnerabilities and reporting those evaluations publicly. With the next major release from a frontier lab I would love to read something like “We spent X billion tokens running our agents against the Linux kernel and Firefox and produced Y exploits“. It doesn’t matter if Y=0. What matters is that X is some very large number. Both companies have strong security teams so it’s entirely possible they are already moving towards this. OpenAI already have the Aardvark project and it would be very helpful to pair that with a project trying to exploit the vulnerabilities they are already finding. For the AI Security Institutes it’s would be worth spending time identifying gaps in the evaluations that the model companies are doing, and working with them to get those gaps addressed. For example, I’m almost certain that you could drop the firmware from a huge number of IoT devices (routers, IP cameras, etc) into an agent based on Opus 4.5 or GPT-5.2 and get functioning exploits out the other end in less a week of work. It’s not ideal that evaluations focus on CTFs, synthetic environments and old vulnerabilities, but don’t provide this sort of direct assessment against real targets. In general, if you’re a researcher or engineer, I would encourage you to pick the most interesting exploitation related problem you can think of, spend as many tokens as you can afford on it, and write up the results. You may be surprised by how well it works. Hopefully the source code for my experiments will be of some use in that. Share this: Share on Facebook (Opens in new window) Facebook
Share on X (Opens in new window) X Like Loading...
Related
Post navigation
Previous PostHow I used o3 to find CVE-2025-37899, a remote zeroday vulnerability in the Linux kernel’s SMB implementation
Leave a comment Cancel reply
Δ This site uses Akismet to reduce spam. Learn how your comment data is processed.
Blog at WordPress.com.
Back to top
Comment
Reblog
Subscribe
Subscribed
Sean Heelan's Blog
Join 46 other subscribers
Sign me up
Already have a WordPress.com account? Log in now.
Sean Heelan's Blog
Subscribe
Subscribed
Sign up Log in
Copy shortlink
Report this content
View post in Reader
Manage subscriptions
Collapse this bar
%d |
Sean Heelan’s blog post explores the implications of leveraging large language models (LLMs) for automating exploit generation, arguing that this marks the beginning of a broader industrialization of offensive cybersecurity. Heelan’s experiments involved training agents based on Opus 4.5 and GPT-5.2 to develop exploits for a zero-day vulnerability in the QuickJS JavaScript interpreter, a task requiring navigating complex mitigations such as address space layout randomization (ASLR), non-executable memory, and hardware-enforced shadow stacks. The agents successfully generated over 40 distinct exploits across six scenarios, with GPT-5.2 solving all challenges and Opus 4.5 missing only two. These results highlight the growing capability of LLMs to autonomously identify and exploit vulnerabilities, raising critical questions about the future of cybersecurity defenses. Heelan emphasizes that the industrialization of exploit development hinges on two key factors: the ability of LLM-based agents to search solution spaces without human intervention and the existence of automated verification mechanisms. Exploit development, with its controlled environments and straightforward validation criteria (e.g., confirming a shell spawn or file write), serves as an ideal case for this shift. However, Heelan also acknowledges limitations, noting that QuickJS is significantly less complex than modern JavaScript engines like those in Chrome or Firefox. While the experiments demonstrate that LLMs can exploit known gaps in mitigation strategies—such as leveraging glibc’s exit handler mechanism to bypass shadow stacks—their solutions do not represent novel breaks in security architectures. Instead, they mirror techniques already used by human exploit developers, suggesting that LLMs are augmenting rather than revolutionizing existing methodologies.
A central argument in Heelan’s analysis is that the future of offensive cyber operations will be constrained not by the number of skilled hackers but by an organization’s capacity to allocate computational resources, measured in tokens. This shift redefines the economic and technical barriers to entry for cyber threats, as token throughput becomes a proxy for operational capability. Heelan’s experiments illustrate this dynamic: while the hardest challenge—writing a file to disk under multiple mitigations—took GPT-5.2 50 million tokens and $50 in costs, the scalability of such approaches implies that resource-rich actors could rapidly generate exploits at a fraction of traditional development costs. However, the post also underscores that not all cyber tasks are equally amenable to industrialization. For instance, activities like initial network access or lateral movement involve real-world environments where incorrect actions risk detection and termination of the exploit attempt. These tasks require adaptive, context-aware strategies that current LLMs may struggle to execute autonomously. Heelan speculates that while automation of system administration and software reliability engineering (SRE) tasks is already underway, the transition to hacking-related activities within adversarial networks remains uncertain. He points to the absence of fully automated SRE tools as a cautionary sign, suggesting that even if LLMs can replicate human-like problem-solving in controlled settings, their effectiveness in unpredictable, hostile environments is still unproven.
The blog post also critiques the current state of LLM evaluation in cybersecurity, arguing that standardized benchmarks like CTF competitions or synthetic datasets fail to capture the complexity of real-world zero-day exploitation. Heelan cites OpenAI’s Aardvark project as evidence that token investment correlates with bug discovery, but he calls for more rigorous testing against hard targets such as the Linux kernel or Firefox. He envisions a future where frontier labs publicly report results from large-scale token-driven experiments, even if no exploits are found, to establish a baseline for model capabilities. This transparency, he argues, is essential for understanding the true potential and limitations of LLMs in cybersecurity. Additionally, Heelan highlights gaps in existing evaluations, such as the lack of focus on real-world systems like IoT firmware. He suggests that deploying LLM agents against such targets could yield practical exploits within weeks, underscoring the urgency of rethinking security practices.
Heelan’s conclusions emphasize the need for proactive adaptation in both research and industry. He urges AI security institutes to collaborate with model developers to address evaluation gaps, advocating for assessments that prioritize real-world relevance over artificial benchmarks. For researchers and engineers, he encourages experimentation with high-stakes exploitation problems, leveraging token budgets to test LLM capabilities. The post serves as a call to action, framing the industrialization of exploit generation not as an inevitability but as a scenario requiring careful preparation. By highlighting the interplay between technological progress and security posture, Heelan challenges readers to consider how emerging AI tools might reshape the landscape of cyber threats and defenses. His work underscores a critical tension: while LLMs are increasingly capable of automating complex tasks, the human element remains indispensable in navigating the unpredictable realities of offensive cyber operations. |