7 wastes of data production – when pipelines become sewers

I recently had the chance to present an updated version of my 7 wastes of data production talk at DataEngBytes Melbourne 2023. I think the talk was stronger this time around and I really appreciated all the great feedback from the audience.

Thanks to Peter Hanssens and the DEB crew for having me as part of an impressive speaker lineup and for putting on a great event.

Check out the video below and the slides.

Pipes screensaver a very appropriate thumbnail

For the earlier versions, see the original 7 wastes post and the 2021 LAST Conference version Data mesh: a lean perspective.

Outline

There’s a lot of ground to cover in 30 minutes with 7 wastes from both run and build lenses, plus 5 lean principles to address the waste. I’ll leave the summary here and encourage you to watch the video or read the slides if you want to know more.

WasteRunBuild
OverproductionUnused productsUnused products
InventoryStored or processed data
not used
Development work in progress
OverprocessingCorrecting poor quality dataWorking with untrusted data
TransportationReplication without reproducibilityHandoffs between teams
MotionManual intervention or finishingContext switching
WaitingDelays in taking action on business eventsDelays due to handoffs or feedback lead time
DefectsDefects introduced into data at any pointDefects introduced into processing code
7 wastes of data production in run and build

Perspectives edition #27

I was thrilled to contribute to Thoughtworks Perspectives edition #27: Power squared: How human capabilities will supercharge AI’s business impact. There are a lot of great quotes from my colleagues Barton Friedland and Ossi Syd in the article, and here’s one from me:

The ability to build or consume solutions isn’t necessarily going to be your differentiator – but the ability to integrate them into your processes and products in the best way is.

It’s also neat to be quoted in Português and Español!

22 rules of generative AI

Thinking about adopting, incorporating or building generative AI products? Here are some things to think about, depending on your role or roles.

I assume you’re bringing your own product idea(s) based on an understanding of an opportunity or problems for customers. These rules therefore focus on the solution space.

Solutions with generative AI typically involve creating, combining or transforming some kind of digital content. Digital content may mean text, code, images, sound, video, 3D, etc, for digital consumption, or it may mean digitized designs for real world products or services such as code (again), recipes, instructions, CAD blueprints, etc. Some of this may also be relevant for how you use other people’s generative AI tools in your own work.

Strategy and product management roles

1. Know what input you have to an AI product or feature that’s difficult to replicate. This is generally proprietary data, but it may be an algorithm tuned in-house, access to compute resources, or a particularly responsive deployment process, etc. This separates competitive differentiators from competitive parity features.

2. Interrogate the role of data. Do you need historical data to start, can you generate what you need through experimentation, or can you leverage your proprietary data with open source data, modelling techniques or SaaS products? Work with you technical leads to understand the multitude of mathematical and ML techniques available to ensure data adds the most value for the least effort.

3. Understand where to use open source or Commercial Off-The-Shelf (COTS) software for parity features, but also understand the risks of COTS including roadmaps, implementation, operations and data.

4. Recognise that functional performance of AI features is uncertain at the outset and variable in operation, which creates delivery risk. Address this by: creating a safe experimentation environment, supporting dual discovery (creating knowledge) and development (creating software) tracks with a continuous delivery approach, and – perhaps the hardest part – actually responding to change.

Design roles

5. Design for failure, and loss of vigilance in the face of rare failures. Failure can mean outputs that are nonsensical, fabricated, incorrect, or – depending on scope and training data – harmful.

6. Learn the affordances of AI technologies so you understand how to incorporate them into user experiences, and can effectively communicate their function to your users.

7. Study various emerging UX patterns. My quick take: generative AI may be used as a discrete tool with (considering #5) predictable results for the user, such as replacing the background in a photo, it may be used as a collaborator, reliant on a dialogue or back-and-forth iterative design or search process between the user and AI, such as ChatGPT, or it may be used as an author, producing a nearly finished work that the user then edits to their satisfaction (which comes with risk of subtle undetected errors).

8. Consider what role the AI is playing in the collaborator pattern – is it designer, builder, tester, or will the user decide? There is value in generating novel options to explore as a designer, in expediting complex workflows as a builder, and in verifying or validating solutions to some level of fidelity as a tester. However, for testing, remember you can not inspect quality into a product, and consider building in quality from the start.

9. Design for explainability, to help users understand how their actions influence the output. (This overlaps heavily with #6)

10. More and more stakeholders will want to know what goes into their AI products. If you haven’t already, start on your labelling scheme for AI features, which may include: intended use, data ingredients and production process, warnings, reporting process, and so on, with reference to risk and governance below.

Data science and data engineering roles

11. Work in short cycles in multidisciplinary product teams to address end-to-end delivery risks.

12. Quantify the functional performance of systems, the satisfaction of guardrails, and confidence in these measures for to support product decisions.

13. Make it technically easy and safe to work with and combine rich data.

14. Implement and automate a data governance model that enables delivery of data products and AI features to the support business strategy (i.e., a governance model that captures the concerns of other rules and stakeholders here).

Architecture and software engineering roles

15. Understand that each AI solution is narrow, but composable with other digital services. In this respect, treat each AI solution as a distinct service until a compelling case is made for consolidation. (Note that, as above, product management should be aware of how to make use of existing solutions.)

16. Consolidate AI platform services at the right level of abstraction. The implementation of AI services may be somewhat consistent, or it may be completely idiosyncratic depending on the solution requirements and available techniques. The right level of abstraction may be emergent and big up-front design may be risky.

17. Use continuous delivery for short feedback cycles and delivery that is both iterative – to reduce risk from knowledge gaps – and responsive – to reduce the risk of a changing world.

18. Continuous delivery necessitates a robust testing and monitoring strategy. Make use of test pyramids for both code and data for economical and timely quality assurance.

Risk and governance roles

19. Privacy and data security are the foundation on which everything else is built.

20. Generative AI solutions, like other AI solutions, may also perpetuate harmful content, biases or correlations in their historical training data.

21. Understand that current generative AI solutions may be subject to some or all of the following legal and ethical issues, depending on their source data, training and deployment as a service: privacy, copyright or other violation regarding collection of training data, outputs that plagiarise or create “digital forgeries” of training data, whether the aggregation and intermediation of individual creators at scale is monopoly behaviour and whether original creators should be compensated, that training data may include harmful content (which may be replicated into harmful outputs), that people may have been exposed to harmful content in a moderation process, and that storing data and the compute for training and inference may have substantial environmental costs.

22. Develop strategies to address the further structural failure modes of AI solutions, such as: misalignment with user goals, deployment into ethically unsound applications, the issue of illusory progress where small gains may look promising but never cross the required threshold, the magnification of rare failures at scale and the resolution of any liability for those failures.

Conclusion

These are the type of role-based considerations I alluded to in Reasoning About Machine Creativity. The list is far from complete, and the reader would doubtless benefit from sources and references! I intended to write this post in one shot, which I did in 90 minutes while hitting the target 22 rules without significant editing, so I will return after some reflection. Let me know if these considerations are helpful in your roles.

Data Mesh Radio

I joined Scott Hirleman for an episode (#95) of the Data Mesh Radio podcast. Scott does great work connecting and educating the data mesh community, and we had fun talking about:

  • Fitness functions to define “what good looks like” for data mesh and guide the evolution of analytic data architecture and operating model
  • Team topologies as a system for organisational design that is sympathetic to data mesh
  • Driving a delivery program through use cases
  • Thin slicing and evolution of products

My episode is #95 Measuring Your Data Mesh Journey Progress with Fitness Functions

Data mesh: a lean perspective

Data mesh can be understood as a response to lean wastes identified in data organisations. I paired with Ned Letcher to present this perspective at the LAST Conference 2021, which was much delayed due to COVID restrictions.

Lean wastes including overproduction, inventory, etc, may be concealed and made more difficult to address by centralised data systems and team architectures. Conversely, data mesh may make these wastes visible where they exist and provides mechanisms for reducing these wastes.

The presentation is written up in this article, and the slides are also available.

Guiding the Evolution of Data Mesh with Fitness Functions

I presented this webinar with Zhamak Dehghani – see the recording Guiding the Evolution of Data Mesh with Fitness Functions. There was great engagement with the topic and we captured some questions and further thoughts on this mini-blog post, published a little later.

This presentation brought together the idea of architectural fitness functions from the book Building Evolutionary Architectures with the core data mesh principles and logical architecture.

Our thoughts around guiding fitness functions included the below. These high-level measurable objectives were supported by a range of proposed metrics. This table is a handy summary; check out the webinar for more.

Domain Ownership
Scaling sources and consumers
Truthfulness
Domain autonomy
Reduced accidental complexity
Data as a Product
Serving users’ needs
Ease of discovery
Evaluation of quality
Service levels
Self-Serve Data Platform
Abstraction of complexity
Domain team autonomy
Protocols enable an ecosystem
Automation
Federated Computational Governance
Governance for common good
Degree of decentralisation
Interoperability
Increasing returns from scale

Scaling Change

Once upon a time, scaling production may have been enough to be competitive. Now, the most competitive organisations scale change to continually improve customer experience. How can we use what we’ve learned scaling production to scale change?

Metaphors for scaling
Metaphors for scaling

I recently presented a talk titled “Scaling Change”. In the talk I explore the connections between scaling production, sustaining software development, and scaling change, using metaphors, maths and management heuristics. The same model of change applies from organisational, marketing, design and technology perspectives.  How can factories, home loans and nightclubs help us to think about and manage change at scale?

Read on with the spoiler post if you’d rather get right to the heart of the talk.