LLMs are lineage black holes

Data lineage is important to most organisations, even if they don’t make use of it. Systematically capturing the upstream provenance and downstream consumers of any piece of data is critical to trusting the utility of that data and understanding its impacts, at any scale beyond a handful of excel spreadsheets.

The nature of lineage

When systematically captured, lineage is a bidirectional bifurcating network. While data flows one way through the graph, we can read dependencies both ways from any node. In general, each upstream source impacts multiple downstream targets, and each downstream target is dependent on multiple upstream sources.

Lineage shows the potential of one piece of data to influence another. Whether it actually does depends on the precise logic of combination, transformation and aggregation, and the values of individual data points. As lineage is typically analysed statically, we get the potential, but not the actual, impact. As such, if we have a known set of nodes and full lineage tracking, we might consider the resultant mapping to exclude sources and targets that have no possibility of influencing or influence from any node, rather than include those that have actually contributed to the values of the node.

With transformations performed by SQL statements (eg counts), we might be able to get more clever about actual influence on any node with deeper static analyses, but there will always remain some cases that actually have to be evaluated to determine influence. Evaluation is expensive and may be intractable, so we stick with the static analysis of potential influence.

With transformations performed by in-house trained Machine Learning (ML) models (eg propensity models), our ability to determine actual influence is based on the model’s interpretability, in general requiring evaluation. However, when we control all of the features, labels and hyperparameters in a structured way, we can still use static analysis to determine all the inputs and outputs–through both the training and the inference process–that have the potential for influence at any node.

And equivalently, we can say with certainty that certain inputs can’t influence our data.

Lineage in LLM space

In our data landscape, we have a manifold of data nodes, like points in the space-time continuum, that we can reach through lineage. Other nodes outside of our light-cone of lineage are physically impossible to reach.

Where this usefulness of this reachability property breaks down is where we incorporate third party ML models that don’t provide any lineage, most notably closed training set and closed source Large Language Models (LLMs). In these cases, we have no record of the the specific inputs that went into training the model or that go into inference, but we know that a very wide range did.

We know LLMs are trained on a very wide range of data sources, from primary material in the commons, to copyrighted texts, to information scraped from proprietary sources, to harmful content that may in some cases be illegal, to their own or other models’ output (autophagy or distillation), to deliberately planted adversarial content (data poisoning).

Given this, we theoretically can’t exclude any reasonably accessible data source in the whole world from the lineage of any LLM. Once we transform data with an LLM, we’ve potentially mixed in any data we could imagine, with no regard to its provenance, accuracy, quality or safety. Notably, LLMs produce harmful and undesirable behavior when trained on datasets containing even a small fraction of poisoned data.

So, like information beyond a black hole’s event horizon, no definitive lineage signal can reach a data observer downstream of an LLM. (However, like the film Event Horizon, using LLMs might bring back undesirable results from outside of our known data universe.)

Taming lineage black holes

LLMs, however, can be extremely useful in transforming data, so how might we practically tackle this lineage problem?

While there are many strategies, I think the single best approach is to push LLM transformations as far downstream as possible, so they only influence the smallest set of nodes. The further upstream the LLM transformation, the more lineage debt (as well as determinacy debt) we carry. Indeed we can start upstream for expediency and manage with the [technical] debt metaphor to push downstream over time.

Downstream nodes should be labelled as LLM-influenced. In actual practice our downstream nodes will frequently escape the singularity, but there’s always a chance they don’t. That is, if the nodes have any degree of permanence. They may be ephemeral, in a user experience, scratchpad or agentic workflow, as we may see more disposable (or antifragile) patterns in AI systems.

Until the point of LLM introduction, we focus on upstream data quality and governance to reduce other unknowns, and thus we increase reusability across multiple consuming data flows.