LLMs as text simulators

I’ve often written here about developing systems that leverage simulation. Simulation combining physical processes, information systems, and crowd behaviours. Simulations that support organisational decision making, customer experiences, and learning.

And in late 2025 we were having a moment where more people were starting to describe Large Language Models (LLMs) as simulators of text. Simulation of the text product, based on statistical reconstruction, it is important to note, not simulation of the human process of thinking and writing, but more on that later.

Simulation scenarios adapted from Leave Product Development to the Dummies

As LLMs become embedded into ever more workflows in ever more complex agentic architectures, text simulation seems like an interesting lens through which to understand where we might use them most effectively, while guarding against failure modes.

My thoughts as hypertext

Here’s what I think about when I think about LLMs as text simulators:

Explicit simulation and machine learning (ML) deliver similar predictive products, but ML only cares about the product, while simulation also considers the process
As a form of ML, LLMs are therefore simulators of the human text product, but not the process
The generative paradigm can be though of as iterating predictions
To understand performance of predictive or generative models in novel situations, we must first validate by comparing to known-good examples
Explicit simulation (of the process) gives us fast feedback, establishes boundaries, and helps optimise performance in more novel situations, but as LLMs don’t simulate the process, we can make use of these affordances

I’ll expand on these points below.

Explicit simulation vs machine learning

Both explicit simulation and Machine Learning (ML) are predictive techniques, in that they predict a possible result, outcome or future from input, historical or current state of the world. They are also sometimes complementary, for instance when simulated environments are used to generate data to train ML models, such as for self-driving vehicles.

Simulation compared to ML — Comparing simulation and machine learning process and products

Simulation is based on an explicit model of the world, a model that encodes distinct behaviours that we observe in the real world and that we can reason about. In contrast, ML learns a model that maximises by any means the number of correct results in its training, which is generally difficult to reason about.

This learned model may be described as an opaque box. When probed or inspected, it may comprise things that we recognise as approximate world models, but they are typically fuzzy, incomplete, poorly factored (both inefficient and highly coupled) and often just conceptually wrong. Learned models typically do not reproduce with any fidelity what we understand to be the real process. Ultimately, ML predictions are based on correlations in data (that may be very useful, or may be spurious or unethical) rather than a model of cause and effect between entities in an environment.

Explicit simulation also has its limitations, namely that we may misunderstand the fundamental real world processes, or approximate them incorrectly in developing algorithms, in which case we won’t properly replicate the desired phenomena. We also won’t see phenomena that can’t emerge from the algorithmic paradigm we adopt, such as air resistance on a projectile if we only simulate gravity.

The outputs or products of predictive models of either type are therefore imperfect. They are sometimes right and sometimes wrong. We will explore this below.

Even so, we find a lot of value in predictive models, in applications that can help us make sense of the future, when we can tolerate occasional faults. Predictive models can inform our plans and guide our operations, and in the process dramatically reduce time, reduce cost, increase safety and increase learning in many applications.

LLMs as machine learning

Returning to LLMs as text simulators, we recognise that LLMs are a special type of ML, trained on enormous volumes of text humans have generated, to be able to reproduce similar text in similar circumstances. As a special type of ML, they are a predictive model.

The text from a single LLM invocation will read like text a human might produce. Often the content will be suitable for the intended purpose. However, because LLMs simulate the product but not the process, this will be a result of correlations in the data, not the result of a human reasoning and writing process. So a not insignificant fraction of the time, while the text will remain fluent, the content will not be suitable, being inaccurate, irrelevant, discomforting or even dangerous, otherwise known as hallucinations.

Although they can be quite frequent, it can be difficult to detect hallucinations, due to framing, confirmation or automation bias, due to a lack of expertise in the relevant field, or due to the challenge of maintaining vigilance. So framing, expertise and vigilance are required to evaluate answers from LLMs.

But hang on, you say, aren’t LLMs a type of generative AI? We’ll explore predictive and generative paradigms below, after we look a little closer at evaluation.

Evaluating predictive models

How do we know when a predictive model is right (enough)? The answer for new predictions is that we don’t know at the time of making the prediction (otherwise we wouldn’t need a predictive model). Sometimes we’ll never know.

However, we can evaluate any model against predictions with known results. Results are known either because we’re using historical instances, or we have an independent way of generating (largely) correct results, either through human expertise, real-world tests, or provably correct or at least more reliable models.

Comparing validation in explicit simulation and ML

In classical ML, before making entirely new predictions that we may never be able to validate, the established approach is to run the model against a held-out “validation” data set that wasn’t part of the training data, but is representative of the range of inputs we expect when the model is operational.

The simplest validation metric for a binary prediction is accuracy. But there are many other metrics we might use, depending on the shape of the data.

If the model performs sufficiently well on this validation set, we deem it to do more good with right answers than it does harm with poor or dangerous answers, and may release it into the world to make entirely new predictions. In this sense, we’re exploiting an asymmetry in predictive payoffs, if we can capture upside when correct while being protected from any serious downsides of being incorrect.

Simulator process validation

Being another predictive method, we can equally use the validation set approach for the products of explicit simulation. Though we may have more demanding expectations of performance against a validation set for an explicit simulator than for ML, the concept remains the same.

As an explicit simulator is also an explicitly-coded piece of software, we can also test at more levels. With explicit simulation, we can validate the process as well as the product. Considering the practical test pyramid, the validation set may be thought of as the end-to-end scenario testing level at the tip. At the base of the pyramid, we can unit test every algorithm, data transformation and utility method in our solution. In between, we can test integrations of code and data.

Validating the process as well as the product gives us leading indicators of success, helps us predict and control for likely errors ahead of time and understand the applicability of predictions and hence the bounds we should place on inputs. It allows us to precisely control variables in the simulated environment (see geomteric vs regression slackometer solutions) and hyperparameters (such as random seeds) to control for the variability they would otherwise introduce into testing. It also helps us adapt rapidly and in small increments should any of these things change, especially as the predictive model requirements evolve. These are all helpful characteristics when dealing with our typical environments of VUCA!

The takeaway from simulator process validation is that the more explicit components we bring in to LLM-based solutions, the more we can validate the solution process–with all the benefits this brings–as well as the GenAI product. Gary Marcus might call this neuro-symbolic and I might explore this further in a future post.

Predictive and generative

Why all this talk about predictive models when LLMs are GenAI?

Generative is often presented as a new analytics paradigm, off the right of the established scale of descriptive, diagnostic, predictive, prescriptive. Indeed I argued this position myself in the pre-ChatGPT era of this-person-does-not-exist.com but mainly to get clients and my team thinking about new possibilities.

Generative applications may be thought of as expanding low-dimensional inputs to high-dimensional outputs. My COVID lockdown project this-wheelie-does-not-exist turned a duration in seconds (1 dimension) into a 10 second trace of a bike’s upward rotation while doing a wheelie, at 1/10th second intervals (100 dimensions). In the text domain, LLMs can expand a few bullet points generate a lengthy email.

Explicit simulation also does this. From a handful of initial and boundary conditions, we can generate a lengthy sequence of complex system states. The eventual prediction product is typically some summary or metrics derived from the set of all system states. For instance, in a car crash test simulating the motions of thousands of structural points over hundreds of time-steps, we ultimately want to know “are the occupants ok?” So we can also think of simulation as a generative paradigm.

The diagram below shows the equivalence of the generative capability of iterative simulation and LLM text generation.

Simulation & generative — Comparing simulation and LLMs as iterative generative models

The complex information content of the generative output from either simulations or LLMs comes from iterating–repeatedly applying a prediction, using the outputs from the previous step as inputs (and maybe adding some noise). The prediction rules generates emergent complexity. This is the same for simulating an object in motion or a series of next tokens. An object in motion is governed by Netwton’s Laws. In the case of next tokens, the LLM decoder rules define how to sample from likely distributions learned from the data.

So generative in some formulations is the repeated application of predictive models with output feedback. The result is a prediction of higher dimensionality.

Multiple answers

Generative AI is often characterised as having multiple good enough answers (I don’t like “endless right answers”, while technically correct, it’s disingenuous as there are endless poor and dangerous answers too, and you can’t necessarily a priori tell the difference, see also). Good enough implicitly depends on some evaluation criteria, bringing us back to the validation requirement of predictive analytics.

Predictive analytics more broadly, including simulation, is also concerned with multiple correct answers, as we deal with measurement uncertainty, inability to precisely model reality, execution variability, etc, which mean we need to consider sensitivity and stability around point predictions, so one prediction is rarely enough.

Testing side effects

Where generative AI applications can also act on the world (aka “agentic”), we don’t just have to test function output from input, but we have to test side-effects as well, such as marking an email as read, actually writing a unit test, or transferring bitcoin, in addition to responding with a summary of tasks done.

Validating generative results

How do we know when a generative result is good (enough)? The answer (presumably) for results generated for novel scenarios is that we don’t at the time of generation (otherwise we wouldn’t need a generative model). Sometimes we’ll never know.

However, if we get multiple attempts and we have a relatively low-cost means of validating an answer, then generating endless answers until we find a good-enough answer is a reasonable approach – an approach with wider applicability that is well established in search and optimisation techniques like hyperparameter sweeps, genetic algorithms and simulated annealing.

The vibe approach is to make a human user the validator in an interactive loop. Fantastic for relatively easy problems that require only one or a handful of iterations. However, this doesn’t give us much confidence we can operationalise any vibed solution for a wider range of inputs or more complex problems. On the contrary, we may instead be told “you’re prompting it wrong”, while “right” answers may be cherry-picked. It also challenges vigilance and may lead to human validator “brain fry”.

The more iterations required for a satisfactory answer, the less likely that using a generative solution increases productivity, while simultaneously human users may feel compelled to try more iterations by the inherently variable reward scheme, likened by some to slot machine addiction.

Here, we might distinguish between capability and different dimensions of reliability, as described in the excellent paper Towards a Science of AI Agent reliability.

So we need approaches that scales evaluation to novel scenarios in production for generative applications, like we saw for predictive applications above

Evals for generative AI

In either LLMs or generative simulation, to move beyond vibes, we also need a validation set, aka evals. These are somehow grounded in the real world, either by the judgement of human experts or by direct or indirect measurements of processes in systems. Some eval techniques, such as LLM-as-a-judge use further LLMs to asses the outputs of subject LLMs. Given the high dimensionality of outputs, many evals may be required in general and hence evals can be a costly component of any generative solution.

Generative processes validation

If we could test the generative process too, this might at least augment the quality we can achieve through evals or possibly reduce the cost of evals. However, as above, testing the process is difficult for LLM components. Techniques like Chain of Thought and reasoning traces have been demonstrated to have variable relationships to actual results. This difficulty might instead be an argument for explicitly-coded agents (aka traditional programs) for key tasks, trading off capability for reliability.

Taking stock

Here’s what I’ve thought about when I thought about LLMs as text simulators:

Explicit simulation and machine learning (ML) deliver similar predictive products, but ML only cares about the product, while simulation also considers the process
As a form of ML, LLMs are therefore simulators of the human text product, but not the process
The generative paradigm can be though of as iterating predictions
To understand performance of predictive or generative models in novel situations, we must first validate by comparing to known-good examples
Explicit simulation (of the process) gives us fast feedback, establishes boundaries, and helps optimise performance in more novel situations, but as LLMs don’t simulate the process, we can make use of these affordances

Predicting next tokens

Any follow-up is a work-in-progress, but I am currently thinking about tackling building with LLMs through the lenses I’ve historically taken with simulation and ML, including risk-aware iterative development in thin, vertical, end-to-end slices, the use of sound software engineering practices such as CI/CD with diverse testing strategies, architecture and compositing solutions, and effective team designs and organisational practices such as knowledge management. Maybe more – back soon!