The life-changing magic of tidying your data

After tackling data waste for some years now, I thought it would be fun to revisit the Marie Kondo approach to tidying up work in large organisations (from 10 years ago), and apply it to data too…

Surprise! Managing ~~work~~ data in a large organisation is a lot like keeping your belongings in check at home.

Get it wrong at home and you have mess and clutter. Get it wrong in the organisation and you have excessive ~~WIP~~ data inventory, retarding responsiveness, pulverising productivity, and eroding engagement, and inventory is just one of 7 wastes of data production…

(You can read the full introduction in the original life-changing magic of tidying your work and complete the substitutions for data if you wish…)

On the complexity of data storage systems

KonMari writes:

Most people realise that clutter is caused by too much stuff. But why do we have too much stuff? Usually it is because we do not accurately grasp how much we actually own. And we fail to grasp how much we own because our storage methods are too complex.

Our data storage methods are complex. The deepest data gravity wells are one or more generations of shared data platform. In low orbit are multiple domain-specific satellite data platforms providing analytics inside vendor products that support functions like sales, martech, etc, which also exert their own pulls. Customer-facing digital products weave idiosyncratic trajectories to implement their own analytic data caches, feature stores, etc. All of these major bodies are orbited by rings of Excel sheets, constantly colliding and fragmenting further.

This situation might be tractable if, as advocated in pure data mesh, every analytic data product published a consistent interface, maintained by a dedicated owner, in order to support a universal, trusted catalogue. However, only some of these systems have catalogues, and of those, there’s often no consistent metadata format or consensus on who owns the cataloguing of data to assure we can grasp what we own.

Thus, any observer inside this system struggles to see the whole picture, and from a distance, crucial details that make decluttering actionable in such a tightly connected system are not visible.

On making things visible

KonMari observes that you cannot accurately assess how much stuff you have without seeing it all in one place. She recommends searching the whole house first, bringing everything to the one location, and spreading the items out on the floor to gain visibility.

You’ll probably never see all your data in one place. As above, this is an ideal we might approach but never reach, but we can tackle the most important cases first, and we can segment responsibility into manageable chunks with a domain-aligned approach, limiting our exposure to dark data.

To ensure the most important data is visible at scale, we first need clear ownership of data, then maintenance of crucial metadata by owners. Our ownership model should consider the whole data supply chain from source data suppliers to intermediate data processors, to end-consumers of analytic data products, because unlike domestic belongings, data is processed and recombined inside organisations.

However, where data remains within one domain, or even one team, we may make its management in that context the responsibility of those owners, with clear guardrails around security, privacy, cost, etc, so visibility becomes a local concern.

On categories

KonMari observes that items in one category are stored in multiple different places, spread out around the house. Categories she identifies include clothes, books, etc. She contends that it’s not possible to assess what you want to keep and discard without seeing the sum of your belongings in each category. Consequently, she recommends thinking in terms of category, rather than place.

Thinking in categories in organisations means thinking about capabilities and value streams. We may store data in multiple systems, but we want to see our capabilities in one place, to understand duplication, but also legitimate specialisation.

Note we should not split categories along lines of functional specialisation (such as data science vs analytics engineering) as this only leads to handoffs that reduce quality and responsiveness as business context is lost.

Instead, we can use Team Topologies to ensure our data operating model surfaces categories, with platform teams or enabling teams delivering cross-cutting capabilities, while complicated subsystem teams provide custom capabilities across multiple value streams, which are specialised to end customers in stream-aligned teams. Effective topologies for data and ML teams covers these topologies in more detail.

On joy

KonMari writes:

The best way to choose what to keep and what to throw away is to … ask: ‘Does this spark joy?’ If it does, keep it. If not, throw it out.

When it’s too hard to work with data, it doesn’t spark joy, but dread.

The key things that make it hard to work with data are the wastes of overprocessing (complexity) and transportation (losing provenance), which reduce trust or increase the cognitive load of trusting data, and the waste of motion making what should be a simple tasks difficult. These wastes are often the root cause, or lead to the supposition, of defects in end-products, sparking further dread.

The more we strip back these wastes, the more we can spark joy in what remains.

On discarding first

KonMari observes that storage considerations interrupt the process of discarding. She recommends that discarding comes first, and storage comes second, and the activities remain distinct. If you start to think about where to put something before you have decided whether to keep or discard it, you will stop discarding.

Wastes of data production might spark dread instead of joy in many ways. But your path to joy is easier if you have less stuff to start with. So first, figure out what data to throw away without remediation.

This is overproduction (reports that are never viewed) or inventory (intermediate data stores with no end-product). Overproduction can be identified by end-users or by consumption data. Discard overproduction. Then, look for inventory in the form of data that has no consumers, based on lineage or queries. Discard the inventory, and repeat the process to progressively remove newly identified upstream inventory.

This should be an early stage on your journey to decluttering data, lest we make efficient that which should not be done at all.

On putting things away

KonMari observes that mess and clutter is a result of not putting things away. Consequently she recommends that storage systems should make it easy to put things away, not easy to get them out.

Putting data away is moving it from active use to cold storage or deleting it altogether. If you plan for how to put data away with every new use case, you can reduce the pain of bulk or surgical removal later. Source metadata about permitted uses combined with downstream lineage are the key ingredients for making it easy to take data out of service.

On letting things go

A client of KonMari’s comments:

Up to now, I believed it was important to do things that added to my life … I realised for the first time that letting go is even more important than adding.

KonMari observes that, beyond the mechanics of managing stuff (or data), there is a psychological cost of clutter. Her clients often report feeling constrained by perceived responsibility to stuff that brings them no joy. I suspect the same is true in the organisation: we fail to recognise and embrace possibilities because we are constrained by perceived responsibilities to data that ultimately has little or no value.

Imagine if we could throw off those shackles. That’s worth letting a few things go.