How to wrangle non-deterministic AI outputs into conventional software? (2025)

Recorded: Jan. 19, 2026, 10:03 a.m.

Original

Summarized

AI Components for a Deterministic System (An Example) - Domain Language

Domain Language
Tackling Complexity in the Heart of Software

Menu
Home
Training

Public Training
Video-based eLearning
Thinking in DDD (Remote)
Strategic Design
DDD Immersion
DDD Overview
DDD Overview for Business

eLearning
DDD Resources

DDD Reference
Nontechnical Path Through the Book
DDD Surrounded by Legacy Software
Whirlpool Process of Model Exploration
Videos from DDD Conferences
Community Site

UPCOMING EVENTS …
About Us

Search for:

Articles
AI Components for a Deterministic System (An Example)

Posted on August 24, 2025
by
Eric Evans

by Eric Evans
When we set out to incorporate AI components into larger systems that are mostly conventional software, we encounter various difficulties. How do we wrangle behavior that is intrinsically non-deterministic so that it can be used in structured, deterministic systems? The flexibility of input is great! But the variation of output makes it difficult to do further processing by conventional software.
In this simple example I’ll characterize and constrain a non-deterministic result to make it usable in deterministic software. This leads into domain modeling and strategic design.
What follows isn’t rocket science, but it is the sort of basics I think we need to apply in order to get results.
1. A Question Conventional Code Can’t Easily Answer
Let’s start with a use-case I actually have. When I’m trying to get my bearings in a software system, I usually want to know what domains are addressed and in which parts of the code. So imagine an app that would generate that sort of view of a repo:

A list of domains addressed in the project, produced from a scan of the whole code-base.
Navigation to the relevant directories or files with high domain content, and lists which domains are addressed in each.
To start simply — a list of domains addressed in the project as a whole or in any given module.

To be concrete, let’s look at the open source project “OpenEMR”. Here’s a very small code sample from that project:
namespace OpenEMR\\Cqm\\Qdm;

class Patient extends \\OpenEMR\\Cqm\\Qdm\\BaseTypes\\Any
{
use Traits\\PatientExtension;

public $birthDatetime = null;

public $qdmVersion = '5.5';

public $_type = 'QDM::Patient';
}

We might ask, “what domains are addressed in this code?” Conventional code does not lend itself to that kind of question, but it is a natural use of an LLM.
Prompt: "What domain does this code address?"Response:"This code defines a Patient class within the OpenEMR Clinical Quality Measures (CQM) Quality Data Model (QDM) framework. The domain it addresses is healthcare information systems, specifically electronic medical records and clinical quality measurement..."
An intelligent answer! But we couldn’t pass that to conventional software for further processing. Of course, we would instruct the LLM to structure and format its output.
Prompt: What domain does this code address? Respond with JSONFormat your response as JSON:{{ "domains": [ {{ "name": "Name of domain", "confidence": 0-100}} ]}}Response:{ "domains": [ { "name": "Electronic Health Records (EHR)", "confidence": 95 }, { "name": "Clinical Quality Measures (CQM)", "confidence": 90 }, { "name": "Quality Data Model (QDM)", "confidence": 85 }, { "name": "Healthcare Informatics", "confidence": 80 } ] }
Okay, so now we have an answer that could be integrated in a technical way. Yet this is will not support the comparisons and hierarchical roll-ups I was hoping for.
Because categories are chosen freely in each run, the classification of different files will not be easy to compare. To illustrate the point, I’ll repeat the same question using the same file. Every time I ask question I get a different answer:

{ "domains": [ { "name": "Healthcare/Electronic Medical Records", "confidence": 95 }, { "name": "Clinical Quality Measures", "confidence": 90 }, { "name": "Medical Data Management", "confidence": 85 } ]}
{ "domains": [ { "name": "Healthcare Information Technology", "confidence": 95 }, { "name": "Electronic Medical Records", "confidence": 90 }, { "name": "Clinical Quality Measures", "confidence": 85 }, { "name": "Health Data Standards", "confidence": 80 } ] }

The answers make sense individually but would be difficult to compare or combine.
Modeling Tasks vs Classification Tasks
The stochastic nature of LLMs can be a challenge in making reliable systems. However, in this case, I see it differently. – Assigning categories is a classification task, which LLMs are good at. – Creating the categorization scheme is a modeling task, which is fundamentally different.
We are giving the LLM a fairly difficult modeling task: model the relationship of a code sample to various kinds of business activities. It draws on its general knowledge of what software is typically used for and connects that with the the language and functionality of the code included in the context. Out of that comes a set of categories that fit this particular code sample, and probably generalizes a bit. But it would be very surprising if the various categories produced in this way fit together.
Classification, to be useful, must be somewhat repeatable. Modeling, when done well, produces a diverse range of possibilities. There are correct and incorrect answers in classification tasks. There is no “correct” model for a domain. If we want consistent categories for different code modules or layers of the hierarchy, we must select a model and use it throughout the process.
2. Create Canonical Categories, then classify.
Let’s separate the modeling task and the classification tasks into separate prompts. For the domain modeling task (the creation of the classification scheme), we want to give the LLM a broad view of the whole project. Maximally, we could put all the code in the project into the prompt, along with instructions to make a list of the domains being addressed (similar to the prompt as above). Using the entirety of a large code in the prompt would be expensive and might exceed the context window. In practice, we would probably get similar results from randomly sampling modules from the project.
However we do it, the generated taxonomy would be different every time we ran this prompt, so we would need to take one output and keep it as kind of canonical model. Then, this frozen category list could be included in a distinct prompt focused only on classification. This we would run independently for each file or module we wanted to analyze.
Prompt:Analyze this code and identify the business domains it addresses using following classification.01 - Electronic Health Records02 - Notifications03 - Authorization and Security04 - Healthcare Informatics05 - Finance06 - Insurance<etc...>... format your response as JSON ...Here is the code to analyze:... whichever module we want to look at ...Response:{ "domains": [ { "code": "01", "name": "Electronic Health Records", "confidence": 95 }, { "code": "04", "name": "Healthcare Informatics", "confidence": 85 } ]}
This would allow an application to relate different parts of the project or aggregate results hierarchically. Results still couldn’t be compared to an independent run of the software that didn’t use the same frozen categories.
3. Incremental Modeling
Up to this point, we’ve assumed that you have all the files at the outset, and that new things fit into the old categories. Sometimes we want our process to be more of a stream, where we can use new code samples to update an existing classification scheme incrementally.
One viable alternative to the all-in-one-prompt is to incrementally accumulate a classification by feeding files (or whatever sized chunks) to a prompt something like this:
{code_chunk}Now analyze this code, considering these previously observed domains that were addressed in other files in the same project:{previous_domains_list}First, look to see if the previously observed domains are also being addressed in this file.Add a JSON field called "repeated-domains" and list your conclusion: domains from the previous list that are also addressed in this file.Next, look for other domains that are addressed in this file that were not mentioned in the previous list.Provide your analysis in valid JSON format as shown below:{{ "repeated-domains": ["domain1", "domain2"], "new-domains": ["domain3", "domain4"]}}
So we can produce or update a schema automatically and then classify individual modules with it (and redo the classification of previously classified modules).
The results of this, I find, are hit and miss. Sometimes, the categorization scheme is just not very good. Part of that is because modeling is a very difficult task. We probably would want to use the biggest model available to us for this, whereas the classification might be done competently by a smaller model. There is another reason we might not get good results: Even for expert humans, creating good models calls for iteration.
4. Iteration
To get iteration refinement out of a model is a bit more complicated. There are various ways it can be done and people (so far only people!) keep coming up with new ones.

Take some samples from the code base, and the existing categories list, prompt the model to refine the classification scheme. Then take a new sample and prompt the model to refine. Repeat.
Call a “critic model” (which could be the same LLM, but different prompt) whose job is to give feedback about the categories. Use the refinement prompt again, including the feedback. Repeat.
Call the prompt for generating or refining a classification scheme multiple times, which produces multiple candidate schemes. Pass these to a “judge” a model prompted not to produce such schemes but to rate them. (This rating must be based on the suitability to the use-cases, not a general, abstract preference.) Now select based on the ratings. Repeat.

I’ve only played with these techniques. They are very interesting to me! However, for the particular use case I’m focused on here, the approach I’ll describe in the next section gave a better result. We have to have the self-discipline to choose the best solution even when it isn’t the coolest.
I’ll repeat this point: Creating a classification system is a modeling task, which is much harder than the classification task itself and calls for techniques like iteration and critique. There is no one model that such a process would converge on. Rather, we need to be clear about the intended uses for the resulting model. Those goals must be incorporated into the search and selection process. In other words, modeling is still modeling when an AI does it, and many of our techniques apply.
5. Using an Established Standard Model
Although identifying the domain of a software module is a bit niche, the idea of classifying business domains is not. In fact there are multiple well-established standard systems of categories — hierarchies of varying level of specificity. I considered a few and chose NAICS, using only the top (most general) level.
Prompt:Analyze this code and identify the business domains it addresses using following classification.Focus on top-level (2-digit) NAICS sectors:11 - Agriculture, Forestry, Fishing and Hunting21 - Mining, Quarrying, and Oil and Gas Extraction22 - Utilities23 - Construction31 - Manufacturing (Food, Beverage, Textile, Apparel)32 - Manufacturing (Wood, Paper, Printing, Petroleum, Chemical)33 - Manufacturing (Metal, Electronics, Transportation Equipment)42 - Wholesale Trade44 - Retail Trade (Stores)45 - Retail Trade (Online, Direct Selling)48 - Transportation49 - Warehousing and Storage51 - Information52 - Finance and InsuranceResponse:{ "domains": [ { "naics_code": "62", "domain_name": "Health Care and Social Assistance", "confidence": 95 }, { "naics_code": "54", "domain_name": "Professional, Scientific, and Technical Services", "confidence": 20 } ] }
At first glance, this doesn’t seem better or even much different from the AI generated categories. However, it offers some significant advantages. These classification schemes have been used widely and shown to be broadly applicable. This fits the “Published Language” pattern from DDD. There is almost always some ambiguity in any classification, but far less in such mature models. As a result, the LLM’s classification output is more consistent. When we run this prompt multiple times on the same file, the results are similar:

{ "domains": [ { "naics_code": "62", "domain_name": "Health Care and Social Assistance", "confidence": 95 }, { "naics_code": "54", "domain_name": "Professional, Scientific, and Technical Services", "confidence": 20 } ] }
{ "domains": [ { "naics_code": "62", "domain_name": "Health Care and Social Assistance", "confidence": 90 } ] }

Note that there is still variation, but it would be easy to filter out using the confidence level. In this case, the high-confidence categories (say anything above 80%) turned out to be stable enough to dispense with any reconciliation processes. This advantage would come from any comprehensive, low-ambiguity categorization scheme, whether created by humans or by LLM. but finding such a model is often harder than it looks. Battle-hardened, well-documented, published languages eliminate that task.
Using a standard classification also takes away our flexibility to choose our own model. Depending on the application, that may be an unacceptable tradeoff, but watch out for our bias toward believing we need a custom model! In the case of this application, the actual taxonomy of domains is really a “generic subdomain”. The core domain/differentiator is more related to the ability to automatically classify any sort of code, and possibly some aspects of how we look at the hierarchy and roll-up of the smaller parts into the larger parts (for example, if we could recognize context boundaries or tracking intended boundaries). When a subdomain is generic for an application, it is best treat it generically and use a standard model whenever possible.
Another advantage of a standard model, and especially a published language, is that it can make it easier to integrate with external systems. (However, that is not a known requirement of this particular application, so in this case I wouldn’t put much value on potential integration.)
Published languages have great advantages! They are worth looking for. Of course, even for fairly common generic subdomains, there are often no mature models available. In this case you’ll have to create your own categories. Using an LLM to do this is an option to seriously consider, but do expect to use a relatively large model, a large context window, and perhaps some iterative refinement. Some human review and editing.
If you are truly convinced that the classification is part of your core domain, then, as of 2025, I’d suggest having humans drive the modeling in an exploratory, iterative process such as the ones we’ve talked about for over 20 years. Once these carefully chosen categories are in place, an LLM will probably be a good classifier.

AI, Eric Evans

Eric Evans

View all posts by Eric Evans →

Post navigation

Older postStrategic Design for Software Teams with Eric Evans | October 20-21, 2025 | BerlinNewer postInterview: Strategic Design today (and AI)

Search for:

Virtual DDD A community-driven platform for people who want to get more in-depth knowledge of Domain-Driven Design and solving complex business problems.
If you don’t live near an active Domain-Driven Design meetup, or just want to get more knowledge of DDD, please join this vast growing community!
Details here

Recent

Context Mapping with an AI-based Component

DDD Europe 2026 | Antwerp, Belgium | Workshops: June 8-9-10 | Conference: June 10-11-12

Interview: Strategic Design today (and AI)

AI Components for a Deterministic System (An Example)

Strategic Design for Software Teams with Eric Evans | October 20-21, 2025 | Berlin

TagsAI
Alberto Brandolini
Conference
DDD Community
DDD Immersion
DDD Overview
Eric Evans
Event Sourcing
Jessica Kerr
Patrik Fredriksson
Paul Rayner
Portland Maine
Public Training
Strategic Design
Sweden
Training
Workshops

Eric Evans outlines the challenges of integrating AI components into deterministic systems, emphasizing the need to reconcile non-deterministic outputs with structured software requirements. His focus is on using AI for domain classification within codebases, such as identifying business domains addressed by specific code modules. The core issue lies in the variability of AI-generated results, which complicates tasks like comparing outputs across files or aggregating data hierarchically. For instance, when analyzing a code snippet from OpenEMR’s Patient class, an LLM might generate diverse domain classifications (e.g., “Electronic Health Records” vs. “Healthcare Informatics”) with varying confidence levels, making it difficult to establish consistent categorizations. This inconsistency undermines the reliability of AI in deterministic systems, where predictable and repeatable outcomes are critical.

Evans distinguishes between classification tasks and modeling tasks. Classification, which involves assigning pre-defined categories to inputs, is a strength of LLMs, but creating the categorization scheme itself—a modeling task—requires structured domain understanding. The problem arises when AI models generate ad hoc, non-hierarchical categories that lack consistency. To address this, Evans proposes separating the two tasks: first, developing a canonical classification scheme through domain modeling, and then using that framework to classify individual code modules. This approach ensures stability by freezing the categorization model, allowing for repeatable results. For example, instead of relying on AI to generate domains dynamically, a predefined list (e.g., “Electronic Health Records,” “Finance”) could be applied consistently across files. This method aligns with Domain-Driven Design (DDD) principles, where domain models are rigorously defined and maintained to guide software architecture.

A key challenge in this process is the iterative refinement of classification schemes. Evans acknowledges that modeling tasks, even when aided by AI, require human input and iteration to improve accuracy. He explores techniques like incremental updates, where new code samples are analyzed against existing categories to identify overlaps or novel domains. However, he cautions that these methods can yield inconsistent results due to the inherent complexity of domain modeling. For instance, a model might fail to recognize relevant categories or introduce ambiguities that require manual correction. To mitigate this, Evans suggests using established classification standards, such as the North American Industry Classification System (NAICS), which provide mature, low-ambiguity frameworks. By leveraging NAICS sectors like “Health Care and Social Assistance” or “Professional, Scientific, and Technical Services,” developers can ensure consistency and reduce the need for custom modeling. This approach also facilitates integration with external systems, as standardized models are widely recognized and documented.

Despite the advantages of standard frameworks, Evans notes that they may limit flexibility. Custom models allow for tailored categorizations aligned with specific project needs, but they demand significant effort to create and maintain. He argues that for generic subdomains—such as code classification—the use of established standards is often preferable, as the core value lies in the ability to automate domain identification rather than the exact categories. However, if classification is a critical differentiator (e.g., in niche applications requiring unique domain hierarchies), human-driven modeling remains essential. In such cases, iterative processes involving feedback loops and multiple model refinements can help achieve precision. Evans highlights that while AI can accelerate modeling, it cannot replace the nuanced judgment of domain experts, particularly in complex or evolving systems.

The article also underscores the importance of balancing AI’s strengths with deterministic system requirements. While LLMs excel at classification, their stochastic nature necessitates careful handling to avoid output variability. By freezing categorization schemes and using standardized models, developers can mitigate this issue while maintaining compatibility with conventional software. Evans concludes that the choice between custom and standard models depends on the application’s goals: for generic, scalable solutions, established frameworks offer reliability; for specialized use cases, iterative human-AI collaboration may be necessary. His insights align with DDD’s emphasis on domain modeling as a foundational practice, demonstrating how AI can enhance but not replace the structured approaches that underpin robust software design. The key takeaway is that successful integration of AI into deterministic systems requires a deliberate separation of modeling and classification tasks, coupled with strategic use of existing frameworks to ensure consistency and scalability.