I’ve often written here about developing systems that leverage simulation. Simulation combining physical processes, information systems, and crowd behaviours. Simulations that support organisational decision making, customer experiences, and learning.
And in late 2025 we were having a moment where more people were starting to describe Large Language Models (LLMs) as simulators of text. Simulation of the text product, based on statistical reconstruction, it is important to note, not simulation of the human process of thinking and writing, but more on that later.

So how do we understand LLMs as text simulators? I’ll share my thinking on the similarities and differences, and I hope to come back in a future article to explore the implications for using LLMs and derivatives in our work and digital products. As LLMs become embedded into ever more workflows in ever more complex agentic architectures, text simulation seems like an interesting lens through which to understand where we might use them most effectively, while guarding against failure modes.
My thoughts as hypertext
Explicit simulation vs machine learning
Both explicit simulation and Machine Learning (ML) are predictive techniques, in that they predict a possible result, outcome or future from input, historical or current state of the world. They are also sometimes complementary, for instance when simulated environments are used to generate data to train ML models, such as for self-driving vehicles.

Simulation is based on an explicit model of the world, a model that encodes distinct behaviours that we observe in the real world and that we can reason about. In contrast, ML learns a transformation (model) that maximises by any means the number of correct results in its training, which is generally difficult to reason about.
This learned model may be described as an opaque box. When probed or inspected, it may comprise things that we recognise as approximate world models, but they are typically fuzzy, incomplete, poorly factored (both inefficient and highly coupled) and often just conceptually wrong. Learned models typically do not reproduce with any fidelity what we understand to be the real process. Ultimately, ML predictions are based on correlations in data (that may be very useful, or may be spurious or unethical) rather than a model of cause and effect between entities in an environment.
Explicit simulation also has its limitations, namely that we may misunderstand the fundamental real world processes, or approximate them incorrectly in developing algorithms, in which case we won’t properly replicate the desired phenomena. We also won’t see phenomena that can’t emerge from the algorithmic paradigm we adopt, such as air resistance if we only simulate gravity.
The outputs or products of predictive models of either type are therefore imperfect. They are sometimes right and sometimes wrong. We will explore this below.
Even so, we find a lot of value in predictive models, in applications that can help us make sense of the future, when we can tolerate occasional faults. Predictive models can inform our plans and guide our operations, and in the process dramatically reduce time, reduce cost, increase safety and increase learning in many applications.
LLMs as machine learning
Returning to LLMs as text simulators, we recognise that LLMs are a special type of ML, trained on enormous volumes of text humans have generated, to be able to reproduce similar text in similar circumstances. As above, they simulate the product, but not the human process of generating text.
The text from a single LLM invocation will read like text a human might produce. Often the content will be suitable for the intended purpose. However, this will be a result of correlations in the data, not the result of a human reasoning and writing process. So a not insignificant fraction of the time, while the text will remain fluent, the content will not be suitable, being inaccurate, irrelevant, discomforting or even dangerous, otherwise known as hallucinations.
Although they can be quite frequent, it can be difficult to detect hallucinations, due to framing, confirmation or automation bias, due to a lack of expertise in the relevant field, or due to the challenge of maintaining vigilance. So framing, expertise and vigilance are required to evaluate answers from LLMs. (Ironically, evaluation expertise may negate the need for using LLMs for predictions if we can otherwise encode this expertise in a valid frame, especially in a more reliable delivery mechanism that needs less vigilance.)
But hang on, you say, aren’t LLMs a type of generative AI? We’ll explore predictive and generative paradigms below, after we look a little closer at evaluation.
Evaluating predictive models
How do we know when a predictive model is right (enough)? The answer for new predictions is that we don’t know at the time of making the prediction (otherwise we wouldn’t need a predictive model). Sometimes we’ll never know.
However, we can evaluate any model against predictions with known results. Results are known either because we’re using historical instances, or we have an independent way of generating (largely) correct results, either through human expertise, real-world tests, or provably correct or at least more reliable models.

In classical ML, before making entirely new predictions that we may never be able to validate, the established approach is to run the model against a held-out “validation” data set–one that wasn’t part of the training data, but is representative of the range of operational inputs we expect.
The simplest validation metric for a binary prediction is accuracy. But there are many other metrics we might use, depending on the shape of the data.
If the model performs sufficiently well on this validation set, we deem it to do more good with right answers than it does harm with poor or dangerous answers, and may release it into the world to make entirely new predictions. In this sense, we’re exploiting an asymmetry in predictive payoffs, if we can capture upside when correct while being protected from any serious downsides of being incorrect.
Simulator process validation
As another predictive method, we can equally use the validation set approach for the products of explicit simulation. And we must necessarily do this at some level of complexity of simulated scenarios, scenarios that aren’t viable to explore any other way (or, again, of what value is the simulator?). We may have more demanding expectations of performance against a validation set for an explicit simulator than for ML but the concept remains the same.
As an explicit simulator is also an explicitly-coded piece of software, we can also test at more levels. With explicit simulation, we can validate the process as well as the product. Considering the practical test pyramid, the validation set may be thought of as the end-to-end scenario testing level at the tip. At the base of the pyramid, we can unit test every algorithm, data transformation and utility method in our solution. In between, we can test integrations of code and data.
Validating the process as well as the product gives us leading indicators of success, helps us predict and control for likely errors ahead of time and understand the applicability of predictions and hence the bounds we should place on inputs. It allows us to precisely control variables in the simulated environment and hyperparameters (such as random seeds) to control for the variability they would otherwise introduce into testing. It also helps us adapt rapidly and in small increments should any of these things change, especially as the predictive model requirements evolve. These are all helpful characteristics when dealing with VUCA.
The takeaway from simulator process validation is that the more explicit components we bring in to LLM-based solutions, the more we can validate the solution process–with all the benefits this brings–as well as the GenAI product. We’ll explore this further in the next post.
Predictive and generative
Why all this talk about predictive models when LLMs are GenAI?
Generative is often presented as a new analytics paradigm, off the right of the established scale of descriptive, diagnostic, predictive, prescriptive. Indeed I argued this position myself in the pre-ChatGPT era of this-person-does-not-exist.com but mainly to get clients and my team thinking about new possibilities.
Generative applications may be thought of as expanding low-dimensional inputs to high-dimensional outputs. My project this-wheelie-does-not-exist turned a duration in seconds (1 dimension) into a 10 second trace of a bike’s upward rotation while doing a wheelie, at 1/10th second intervals (100 dimensions). In the text domain, LLMs can expand a few bullet points generate a lengthy email.
Explicit simulation also does this. From a handful of initial and boundary conditions, we can generate a lengthy sequence of complex system states. The eventual prediction product is typically some summary or metrics derived from the set of all system states. For instance, in a car crash test simulating the motions of thousands of structural points over hundreds of time-steps, we ultimately want to know “are the occupants ok?” So we can also think of simulation as a generative paradigm.
The diagram below shows the equivalence of the generative capability of iterative simulation and LLM text generation.

The complex information content of the generative output from either simulations or LLMs comes from iterating–repeatedly applying a prediction, using the outputs from the previous step as inputs (and maybe adding some noise). The prediction rules generates emergent complexity. This is the same for simulating an object in motion or a series of next tokens. An object in motion is governed by Netwton’s Laws. In the case of next tokens, the rules define how to sample from likely distributions learned from the data.
So generative in some formulations is the repeated application of predictive models with output feedback. The result is a prediction of higher dimensionality.
Multiple answers
Generative AI is often characterised as having multiple good enough answers (I don’t like “endless right answers”, while technically correct, it’s disingenuous as there are endless poor and dangerous answers too, and you can’t necessarily a priori tell the difference, see also). Good enough implicitly depends on some evaluation criteria.
Predictive analytics more broadly, including simulation, is also concerned with multiple correct answers, as we deal with measurement uncertainty, inability to precisely model reality, execution variability, etc, which mean we need to consider sensitivity and stability around point predictions, so one prediction is rarely enough.
Validating generative results
How do we know when a generative result is good (enough)? The answer for new generations is that we don’t at the time of generation (otherwise we wouldn’t need a generative model). Sometimes we’ll never know.
However, if we have a relatively low-cost means of validating an answer, then generating endless answers until we find a good-enough answer is a reasonable approach – an approach with wider applicability that is well established in search and optimisation techniques like hyperparameter sweeps, genetic algorithms and simulated annealing.
The vibe approach is to make a human user the validator in an interactive loop. Fantastic for relatively easy problems that require only one or a handful of iterations. However, this doesn’t give us much confidence we can operationalise any vibed solution for a wider range of inputs or more complex problems. On the contrary, we may instead be told “you’re prompting it wrong”, while “right” answers may be cherry-picked.
The more iterations required for a satisfactory answer, the less likely that using a generative solution increases productivity, while simultaneously human users may feel compelled to try more iterations by the inherently variable reward scheme, likened by some to slot machine addiction.
So we need approaches that scales evaluation to novel scenarios in production for generative applications, like we saw for predictive applications above
Evals
In either LLMs or generative simulation, to move beyond vibes, we also need a validation set, aka evals. These are somehow grounded in the real world, either by the judgement of human experts or by direct or indirect measurements of processes in systems. Some eval techniques, such as LLM-as-a-judge use further LLMs to asses the outputs of subject LLMs. Evals can be a costly component of the solution, and as a key part of working with LLMs, we’ll explore evals in more detail in future.
Taking stock
To conclude this piece, we have seen:
- Explicit simulation and machine learning (ML) are comparable predictive paradigms, in terms of their products, but not the process
- Predictions are useful but all models have their limitations
- LLMs are a form of ML, with benefits and limitations of ML, and the challenge of hallucinations
- To determine the usefulness of predictive model products at scale for novel scenarios, we evaluate the products in known scenarios
- With explicit simulation, we have an advantage in that we can also validate the process
- The generative paradigm can be though of as an extension to the predictive paradigm
- Generative applications typically have many right answers
- But with some proportion of “wrong” answers from an LLM solution, we need a similar scalable mechanism for rigorously establishing their usefulness – evals.
Predicting next tokens
The follow-up on Working with LLMs as text simulators is a work-in-progress, but I am currently planning to tackle building with LLMs through the lenses I’ve historically taken with simulation and ML, including risk-aware iterative development in thin, vertical, end-to-end slices, the use of sound software engineering practices such as CI/CD with diverse testing strategies, architecture and compositing solutions, and effective team designs and organisational practices such as knowledge management. Maybe more – back soon!

Leave a Reply