7 wastes of data production – when pipelines become sewers

I recently had the chance to present an updated version of my 7 wastes of data production talk at DataEngBytes Melbourne 2023. I think the talk was stronger this time around and I really appreciated all the great feedback from the audience.

Thanks to Peter Hanssens and the DEB crew for having me as part of an impressive speaker lineup and for putting on a great event.

The video will be coming very soon (I’ll add it here when it’s up) and you can check out the slides meantime.

For the earlier versions, see the original 7 wastes post and the 2021 LAST Conference version Data mesh: a lean perspective.

Outline

There’s a lot of ground to cover in 30 minutes with 7 wastes from both run and build lenses, plus 5 lean principles to address the waste. I’ll leave the summary here and encourage you to watch the video or read the slides if you want to know more.

WasteRunBuild
OverproductionUnused productsUnused products
InventoryStored or processed data
not used
Development work in progress
OverprocessingCorrecting poor quality dataWorking with untrusted data
TransportationReplication without reproducibilityHandoffs between teams
MotionManual intervention or finishingContext switching
WaitingDelays in taking action on business eventsDelays due to handoffs or feedback lead time
DefectsDefects introduced into data at any pointDefects introduced into processing code
7 wastes of data production in run and build

Privacy puzzles

I contributed a database reconstruction attack demonstration to the companion repository to the excellent book Practical Data Privacy by my colleague Katharine Jarmul.

My interest was piqued by my colleague Mitchell Lisle sharing the paper Understanding Database Reconstruction Attacks on Public Data from the US Census Bureau authors Simson Garfinkel, John M. Abowd, and Christian Martindale. Mitchell and I collaborated on a pair of solutions using mathematical optimisation/satisfaction techniques. Check out Mitchell’s solution using the Z3 library. I used OR-Tools instead.

The notebook demonstrates that individual rows of a database may be reconstructed, even if only summary statistics are shared, by considering the constraints that the statistics place on possible values of the data. Constraints include mean and median for all numerical values globally and for various cohorts of records determined by class values.

Note that the intent is of this notebook is not to compromise any private data, but to raise awareness of the potential for privacy breaches due to reconstruction attacks!

Perspectives edition #27

I was thrilled to contribute to Thoughtworks Perspectives edition #27: Power squared: How human capabilities will supercharge AI’s business impact. There are a lot of great quotes from my colleagues Barton Friedland and Ossi Syd in the article, and here’s one from me:

The ability to build or consume solutions isn’t necessarily going to be your differentiator – but the ability to integrate them into your processes and products in the best way is.

It’s also neat to be quoted in Português and Español!

Electrifying the world with AI Augmented decision-making

I wrote an article about optimising the design of EV charging networks. It’s a story of work done by a team at Thoughtworks, demonstrating the potential of AI augmented decision-making (including some cool optimisation techniques), in this rapidly evolving but durably important space.

We were able to thread together these many [business problem, AI techniques, data sourcing, technology] concerns in such a short time because of our relentless focus — not on the code or the mathematics — but on generating value: namely making it easier for humans to make decisions according to business goals.

Humour me – DRY vs WRY

Don’t Repeat Yourself (DRY) is a tenet of software engineering, but – humour me – let’s consider some reasons Why to Repeat Yourself (WRY).

LEGO reuse lessons

In 2021, I wrote a series of posts analysing LEGO® data about parts appearing in sets to understand what it might tell us about reuse of software components in digital products. I’ve finally summarised the key findings that show both DRY and WRY forces at play. We’re strictly talking about reuse VS specialisation (not repetition), but I think the lessons on the reuse dynamic are relevant.

Exponential growth in volume

The total number of parts and sets ever created has grown exponentially over the years. The result is that in 2021, there were 10 times as many parts as 30 years ago, and about 5 times as many sets. Thus, even though parts can be re-combined to create new models, new parts are constantly introduced at an increasing rate.

Bar chart of new lego parts each year, with a line showing total parts. The vertical scale is logarithmic, and both new and total parts follow a straight line on the chart, indicating exponential growth

Read more in LEGO as a Metaphor for Software Reuse – Does the Data Stack Up?

Exponential decay in lifespan

While the oldest parts are 70 years old, only about 1/7 of all parts ever created are in active use in 2021 and fully 1/3 of parts have a lifespan of only one year. Over time, survival decays exponentially. In each of the first 5 years, 50% of parts don’t survive to the next year. Beyond that, remaining parts halve in number every seven years.

Chart of lego part lifespans. The vertical axis is logarithmic. The scattered points can be approximated with two linear segments, one for the first five years, showing a half-life of 0;9 years, an another for the remaining ~70 years, showing a half-life of 7.2 years

Read more in LEGO and Software – Lifespans.

Power-law distribution of reuse

Some parts are heavily reused in sets offered for sale, but the vast majority of parts are never reused or only reused a little, which can be approximated with a power law. Reuse is far more uneven than a typical 80/20 distribution: 80% of reuse instances are due to only 3% of parts, and 20% of parts account for 98% of reuse instances. At the other end of the spectrum, 60% of parts are used in only one set, and only 10% of parts appear in more than 10 sets.

Log-log scatter plot of count of part inclusion in sets. Two linear segments fitted to plot show power law approximation for reuse

Read more in LEGO and Software – Part Reuse.

Churn driven by growth and specialisation

Given the growth and specialisation profiles, total churn of parts approached 100% in 2020, whereas in the decade centred on 1990, it was only about 20%. High churn is consistent with a small base of heavily reused parts, and ever-increasing numbers of specialised parts with short lifespans.

Read more in LEGO and Software – Variety and Specialisation and LEGO and Software – Lifespans.

Part roles emerge from the reuse graph

We can understand more about the roles played by specialised and reused parts though analysis of the graph of connections between parts and sets, and identify new opportunities for recombination.

Network visualisation showing association rules between common parts

Read more in LEGO and Software – Part Roles.

Lessons for software

What would I take away for software?

Reusability of components doesn’t necessarily lead to reuse. The majority of reuse will come from a few components that perform fundamental roles. Focus on getting this right.

More – and more specialised – products may drive specialisation of components. Digital product lines are never static and we may expect some components to have short lifespans and churn heavily. Good development practices and loosely-coupled architectures allow teams to work with ephemeral and idiosyncratic components. However, ongoing review can still identify opportunities to harvest patterns and consolidate specialised components.

Note that, even when we produce multiple similar code artefacts, we may see effective reuse of higher-level approaches and concepts.

These aren’t prescriptive rules, but a reflection of the patterns in the data. There are more comprehensive observations in the individual articles. We should remember that reuse is not the primary aim of producing software, but a principle that supports better organisation towards sustainably responsive delivery.

Discussion of data relevance

Why is LEGO data relevant? In many conversations I’ve had about software reuse, LEGO is presented as a desirable model. This may be peculiar to me, but I think it is a fairly common conversation.

The number of possible mechanical couplings of just a handful of bricks is indeed enormous, but I wanted to understand how these components had actually been assembled into products that were sold to customers over some period of time. The data is sourced from the Rebrickable API. I’ve just taken part data at face value in this analysis; if something is recorded as a distinct part, I treat it as a distinct part. There may be better ways to translate the LEGO metaphor to software components.

Maybe there’s a generational factor in LEGO as a metaphor too; in the 1980s and 1990s, you would play with a much smaller and more stable base of active parts than the 2000s and 2010s, and that could shape your thinking. I’d love to hear feedback.

LEGO® is a trademark of the LEGO Group of companies which does not sponsor, authorize or endorse this site.

I did it my way – hand-rolled navigation with open spatial data

Sure commercial maps app directions are great, but have you ever found the customisation options limited? What if you want to use bike paths and back streets when cycling, or avoid winding roads that might make backseat passengers car-sick on a road trip?

Example of cycling route options that are direct (B) and pleasant (A)

The paved route

OpenStreetMap and OpenRouteService do provide this type of functionality, and could be considered for use as-is or with further customisation. PostGIS and pgRouting provide capabilities if you bring your own data. Many dedicated apps support groups with particular mobility needs.

My way, off the highway

In researching these capabilities however, and because I’m a fan of maps and I wanted to understand the whole data transformation journey, I decided to hand-roll my own navigation solution using pyshp, numpy, scipy, and networkx, all visualised with matplotlib. The end result is far from polished, but it can ingest 1.1M road geometries for the Australian state of Victoria, and generate a topological graph for routing within minutes, then use that map to to generate turn-by-turn directions in real time.

See the source code and the brief write-up below if you’re interested.

Data

The solution uses data from the Vicmap Transport data set, which provides road centrelines for highways, streets, tracks, paths, etc, for all of Victoria and some bordering regions. The spatial features are augmented with 71 attributes useful for routing, including road names, permitted directions of travel, height limits, etc. I used a GDA2020 datum grid projection shapefile export. Pyshp provides a list of geometries and attributes via shapeRecords.

Topology

Vicmap Transport road centrelines are collections of polylines. The endpoints of these polylines (aka road segments) helpfully coincide where we might infer an continuous stretch of road or intersection. This allows us to describe how roads are connected with a graph.

Map of Bairnsdale township showing spatial location of roads overlaid with their network connectivity
Road network modeled as a graph in Bairnsdale, Victoria (data: Vicmap Transport)

Each endpoint will map to a node in the graph. The node may be unique to one road segment, or shared between multiple road segments if it’s at a junction. We find the coincident endpoints for a shared node with the method query pairs. The road segments with coincident endpoints then define the edges of the graph at this node. A directed graph can be defined using the direction code attribute of each segment (forward, reverse, or both directions).

Routing

Two routes through Bairnsdale township overlaid on the network connectivity map

With a graph representation of the road network, we can find a route between any two nodes (if connected) with standard algorithms. The examples above and below uses Dijkstra’s algorithm to find shortest path based on edge weights that reflect our routing preferences. The orange route is “fewest hops” (count of road segments) and the green route is “shortest distance”. Geometric length of road segments is calculated in a post-processing pass over the ingested polyline data, and assigned as a weight to each edge.

Two routes through the wider Bairnsdale area overlaid on the network connectivity map
Routing options in the area surrounding Bairnsdale

Optimisation and scaling

My first spikes were hideously inefficient, but once the method was established, there was a lot of room for improvement. I addressed three major performance bottlenecks as I scaled from processing 50k road segments in 50 minutes, to 1.1M road segments in 30 seconds. These figures represent a 2017 Macbook Pro, or free Colab instance, being roughly similar.

Processing stage and end-to-end times50k segments1.1M segments
Coincident segment endpoints
For loop accumulating unique ids with distance test50 mins
numpy array calculation argwhere distance < e and numpig6 mins
scipy.spatial.KDTree.query_pairs2 mins
Shared node mapping
List comprehension elementwise mapBroke Colab
numpy materialised mapping< 2 mins
Directed edges (previously undirected)
For loop accumulating correctly directed edges case-wise and discarding topological duplicates> 12 hrs
numpy vectorisation of loop conditional logic30 sec
Stages of optimisation to support scaling

An additional ~30s is required to post-process geometric length of segments as an attribute per edge, and I imagine similar for other derived edge attributes. For instance, we might get finer grained average cycling speed per segment, or traffic risk factors, etc.

For calculating routes, (i.e., at inference time) it takes about 1-4s to find a shortest path, depending on the length of the route (using pure Python networkx). We can now find routes of over 950km length, from Mildura in the state’s north-west to Mallacoota in the east.

Two long routes overlaid on sampled road data from all of Victoria and surrounds
Two routes from Mildura to Mallacoota

More latitude for navigation

We would like to be able to find the start and end nodes of a route from latitude and longitude. However, as the nodes in our routing graph are located by VicGrid grid coordinates (eastings and northings), we first need to “wrap” this planar grid around the roughly spherical Earth. While geometrically complex (see below), it’s easy to do this transformation back and forth between coordinate systems with pyproj

A grid with axis-aligned arrow pairs at each point showing how space is squashed or stretched in 2D when transforming from eastings, northing to latitude, longitude

Turn-by-turn navigation directions

With the start and end nodes located from latitude and longitude, a route can be calculated as above. Then, turn-by-turn directions can then be derived by considering the geometry of the road network at each intermediate node on the route, and what instructions might be required by users, for instance:

  • Determine the compass direction of travel along a road segment to initiate travel in the correct direction,
  • Calculate the angle between entry and exit directions at intersections to provide turn directions as left, right, straight, etc,
  • Use the geometric length of road segments to provide distance guidance,
  • Consolidate to a minimum set of directions by identifying where explicit guidance is not required (e.g., continuing straight on the same road), and
  • Render the instructions into an easily (?) human-consumable form, with natural language descriptions of appropriate precision.
['travel south west on Holly Court',
 'continue for 190m on Holly Court',
 'turn right into Marigold Crescent',
 'continue for 360m on Marigold Crescent',
 'go straight into Gowanbrae Drive',
 'continue for 420m on Gowanbrae Drive',
 'turn left into Gowanbrae Drive',
 'continue for 150m on Gowanbrae Drive',
 'turn left into Lanark Way',
 'continue for 170m on Lanark Way']

This constitutes a rather neat first approximation to to commercial turn-by-turn directions, but I suspect it suffers in many edge cases, like roundabouts and slip lanes.

Next steps

With a drive-by look at the key elements, the road ahead to future “my way” hand-rolled navigation is clearer. An essential next step would be an interactive map interface. However, making this prototype roadworthy also likely needs more data wrangling under the hood (e.g., for cycling-specific data), a review of where to leverage existing open services, and polishing edge cases to a mirror finish.

22 rules of generative AI

Thinking about adopting, incorporating or building generative AI products? Here are some things to think about, depending on your role or roles.

I assume you’re bringing your own application(s) based on an understanding of an opportunity or problems for customers. These rules therefore focus on the solution space.

Solutions with generative AI typically involve creating, combining or transforming some kind of digital content. Digital content may mean text, code, images, sound, video, 3D, etc, for digital consumption, or it may mean digitized designs for real world products or services such as code (again), recipes, instructions, CAD blueprints, etc. Some of this may also be relevant for how you use other people’s generative AI tools in your own work.

Strategy and product management roles

1. Know what input you have to an AI product or feature that’s difficult to replicate. This is generally proprietary data, but it may be an algorithm tuned in-house, access to compute resources, or a particularly responsive deployment process, etc. This separates competitive differentiators from competitive parity features.

2. Interrogate the role of data. Do you need historical data to start, can you generate what you need through experimentation, or can you leverage your proprietary data with open source data, modelling techniques or SaaS products? Work with you technical leads to understand the multitude of mathematical and ML techniques available to ensure data adds the most value for the least effort.

3. Understand where to use open source or Commercial Off-The-Shelf (COTS) software for parity features, but also understand the risks of COTS including roadmaps, implementation, operations and data.

4. Recognise that functional performance of AI features is uncertain at the outset and variable in operation, which creates delivery risk. Address this by: creating a safe experimentation environment, supporting dual discovery (creating knowledge) and development (creating software) tracks with a continuous delivery approach, and – perhaps the hardest part – actually responding to change.

Design roles

5. Design for failure, and loss of vigilance in the face of rare failures. Failure can mean outputs that are nonsensical, fabricated, incorrect, or – depending on scope and training data – harmful.

6. Learn the affordances of AI technologies so you understand how to incorporate them into user experiences, and can effectively communicate their function to your users.

7. Study various emerging UX patterns. My quick take: generative AI may be used as a discrete tool with (considering #5) predictable results for the user, such as replacing the background in a photo, it may be used as a collaborator, reliant on a dialogue or back-and-forth iterative design or search process between the user and AI, such as ChatGPT, or it may be used as an author, producing a nearly finished work that the user then edits to their satisfaction (which comes with risk of subtle undetected errors).

8. Consider what role the AI is playing in the collaborator pattern – is it designer, builder, tester, or will the user decide? There is value in generating novel options to explore as a designer, in expediting complex workflows as a builder, and in verifying or validating solutions to some level of fidelity as a tester. However, for testing, remember you can not inspect quality into a product, and consider building in quality from the start.

9. Design for explainability, to help users understand how their actions influence the output. (This overlaps heavily with #6)

10. More and more stakeholders will want to know what goes into their AI products. If you haven’t already, start on your labelling scheme for AI features, which may include: intended use, data ingredients and production process, warnings, reporting process, and so on, with reference to risk and governance below.

Data science and data engineering roles

11. Work in short cycles in multidisciplinary product teams to address end-to-end delivery risks.

12. Quantify the functional performance of systems, the satisfaction of guardrails, and confidence in these measures for to support product decisions.

13. Make it technically easy and safe to work with and combine rich data.

14. Implement and automate a data governance model that enables delivery of data products and AI features to the support business strategy (i.e., a governance model that captures the concerns of other rules and stakeholders here).

Architecture and software engineering roles

15. Understand that each AI solution is narrow, but composable with other digital services. In this respect, treat each AI solution as a distinct service until a compelling case is made for consolidation. (Note that, as above, product management should be aware of how to make use of existing solutions.)

16. Consolidate AI platform services at the right level of abstraction. The implementation of AI services may be somewhat consistent, or it may be completely idiosyncratic depending on the solution requirements and available techniques. The right level of abstraction may be emergent and big up-front design may be risky.

17. Use continuous delivery for short feedback cycles and delivery that is both iterative – to reduce risk from knowledge gaps – and responsive – to reduce the risk of a changing world.

18. Continuous delivery necessitates a robust testing and monitoring strategy. Make use of test pyramids for both code and data for economical and timely quality assurance.

Risk and governance roles

19. Privacy and data security are the foundation on which everything else is built.

20. Generative AI solutions, like other AI solutions, may also perpetuate harmful content, biases or correlations in their historical training data.

21. Understand that current generative AI solutions may be subject to some or all of the following legal and ethical issues, depending on their source data, training and deployment as a service: privacy, copyright or other violation regarding collection of training data, outputs that plagiarise or create “digital forgeries” of training data, whether the aggregation and intermediation of individual creators at scale is monopoly behaviour and whether original creators should be compensated, that training data may include harmful content (which may be replicated into harmful outputs), that people may have been exposed to harmful content in a moderation process, and that storing data and the compute for training and inference may have substantial environmental costs.

22. Develop strategies to address the further structural failure modes of AI solutions, such as: misalignment with user goals, deployment into ethically unsound applications, the issue of illusory progress where small gains may look promising but never cross the required threshold, the magnification of rare failures at scale and the resolution of any liability for those failures.

Conclusion

These are the type of role-based considerations I alluded to in Reasoning About Machine Creativity. The list is far from complete, and the reader would doubtless benefit from sources and references! I intended to write this post in one shot, which I did in 90 minutes while hitting the target 22 rules without significant editing, so I will return after some reflection. Let me know if these considerations are helpful in your roles.

Synthesising Semantle Solvers

Picking up threads from previous posts on solving Semantle word puzzles with machine learning, we’re ready to explore how different solvers might play along with people while playing the game online. Maybe you’d like to play speed Semantle against an artificially intelligent opponent, maybe you’d like a left-of-field hint on a tricky puzzle, or maybe it’s just fun to spectate at a cerebral robot battle.

Animation of a Semantle game from initial guess to completion

Substitute semantics

The solvers have a view of how words relate due to a similarity model that is encapsulated for ease of change. To date, we’ve used the same model as live Semantle, which is word2vec. But as this might be considered cheating, we can now also use a model based on the Universal Sentence Encoder (USE), to explore how the solvers perform with separated semantics.

Solver spec

To recap, the key elements of the solver ecosystem are now:

  • SimilarityModel – choice of word2vec or USE as above,
  • Solver methods (common to both gradient and cohort variants):
    • make_guess() – return a guess that is based on the solver’s current state, but don’t change the solver’s state,
    • merge_guess(guess, score) – update the solver’s state with information about a guess and a score,
  • Scoring of guesses by either the simulator or a Semantle game, where a game could also include guesses from other players.
Diagram illustrating elements of the solver ecosystem. Similarity model initialises solver state used to make guesses, which are scored by game and update solver state with scores. Other players can make guesses which also get scored

It’s a simplified reinforcement learning setup. Different combinations of these elements allow us to explore different scenarios.

Solver suggestions

Let’s look at how solvers might play with people. The base scenario friends is the actual history of a game played with people, completed in 109 guesses.

Word2Vec similarity

Solvers could complete a puzzle from an initial sequence of guesses from friends. Both solvers in this particular configuration generally easily better the friends result when primed with the first 10 friend guesses.

Line chart comparing three irregular but increasing lines that represent the sequence of scores for guesses in a semantle game. The three lines are labelled friends, cohort, and gradient. Cohort finishes with fewest guesses, then gradient, then friends, with clear separation.

Solvers could instead make the next guess only, but based on the game history up to that point. Both solvers may permit a finish in slightly fewer guesses. The conclusion is that these solvers are good for hints, especially if they are followed!

Line chart comparing three irregular but increasing lines that represent the sequence of scores for guesses in a semantle game. The three lines are labelled friends, cohort, and gradient. Cohort finishes with fewest guesses, then gradient, then friends, with marginal differences.

Maybe these solvers using word2vec similarity do have an unfair advantage though – how do they perform with a different similarity model? Using USE instead, I expected the cohort solver to be more robust than the gradient solver…

USE similarity

… but it seems that the gradient descent solver is more robust to a disparate similarity model, as one example of the completion scenario shows.

Line chart comparing three irregular but increasing lines that represent the sequence of scores for guesses in a semantle game. The three lines are labelled friends, cohort, and gradient. Gradient finishes with fewest guesses, then friends, then cohort, and the separation is clear.

The gradient solver also generally offers some benefit making a suggestion for just the next guess, but the cohort solver’s contribution is marginal at best.

Line chart comparing three irregular but increasing lines that represent the sequence of scores for guesses in a semantle game. The three lines are labelled friends, cohort, and gradient. Gradient finishes with fewest guesses, then friends, and cohort doesn't finish, but the differences are very minor.

These are of course only single instances of each scenario, and there is significant variation between runs. It’s been interesting to see this play out interactively, but a more comprehensive performance characterisation – with plenty of scope for understanding the influence of hyperparameters – may be in order.

Solver solo

The solvers can also play part or whole games solo (or with other players) in a live environment, using Selenium WebDriver to submit guesses and collect scores. The leading animation above is gradient-USE and a below is a faster game using cohort-word2vec.

Animation of a Semantle game from initial guess to completion

So long

And that’s it for now! We have multiple solver configurations that can play online by themselves or with other people. They demonstrate how people and machines can collaborate to each bring their own strengths to solving problems; people with creative strategies and machines with a relentless ability to crunch through possibilities. They don’t spoil the fun of solving Semantle yourself or with friends, but they do provide new ways to play and to gain insight into how to improve your own game.

Postscript: seeing in space

Through all this I’ve considered various 3D visualisations of search through a semantic space with hundreds of dimensions. I’ve settled on the version below, illustrating a search for target “habitat” from first guess “megawatt”.

An animated rotating 3D view of an semi-regular collection of points joined by lines into a sequence. Some points are labelled with words. Represents high-dimensional semantic search in 3D.

This visualisation format uses cylindrical coordinates, broken out in the figure below. The cylinder (x) axis is the projection of each guess to the line that connects the first guess to the target word. The cylindrical radius is the distance of each guess in embedding space from its projection on this line (cosine similarity seemed smoother than Euclidian distance here). The angle of rotation in cylindrical coordinates (theta) is the cumulative angle between the directions connecting guess n-1 to n and n to n+1. The result is an irregular helix expanding then contracting, all while twisting around the axis from first to lass guess.

Three line charts on a row, with common x-axis of guess number, showing semi-regular lines, representing the cylindrical coordinates of the 3D visualisation. The left chart is x-axis, increasing from 0 to 1, middle is radius, from 0 to ~1 and back to 0, and right is angle theta, increasing from 0 to ~11 radians.

Second Semantle Solver

In the post Sketching Semantle Solvers, I introduced two methods for solving Semantle word puzzles, but I only wrote up one. The second solver here is based the idea that the target word should appear in the intersection between the cohorts of possible targets generated by each guess.

Finding the semantle target through overlapping cohorts. Shows two intersecting rings of candidate words based on cosine similarity.

To recap, the first post:

  • introduced the sibling strategies side-by-side,
  • discussed designing for sympathetic sequences, so the solver can play along with humans, with somewhat explainable guesses, and
  • shared the source code and visualisations for the gradient descent solver.

Solution source

This post shares the source for the intersecting cohorts solver, including notebook, similarity model and solver class.

The solver is tested against the simple simulator for semantle scores from last time. Note that the word2vec model data for the simulator (and agent) is available at this word2vec download location.

Stylised visualisation of the search for a target word with intersecting  cohorts. Shows distributions of belief strength at each guess and strength and rank of target word

The solver has the following major features:

  1. A vocabulary, containing all the words that can be guessed,
  2. A semantic model, from which the agent can calculate the similarity of word pairs,
  3. The ability to generate cohorts of words from the vocabulary that are similar (in Semantle score) to a provided word (a guess), and
  4. An evolving strength of belief that each word in the vocabulary is the target.

In each step towards guessing the target, the solver does the following:

  1. Choose a word for the guess. The current choice is the word with the strongest likelihood of being the target, but it could equally be any other word from the solver’s vocabulary (which might help triangulate better), or it could be provided by a human player with their own suspicions.
  2. Score the guess. The Semantle simulator scores the guess.
  3. Generate a cohort. The guess and the score are used to generate a new cohort of words that would share the same score with the guess.
  4. Merge the cohort into the agent’s belief model. The score is added to the current belief strength for each word in the cohort, providing a proxy for likelihood for each word. The guess is also masked from further consideration.

Show of strength

The chart below shows how the belief strength (estimated likelihood) of the target word gradually approaches the maximum belief strength of any word, as the target (which remains unknown until the end) appears in more and more cohorts.

Intersecting cohorts solver. Line chart showing the belief strength of the target word at each guess in relation to the maximum belief strength of remaining words.

We can also visualise the belief strength across the whole vocabulary at each guess, and the path the target word takes in relation to these distributions, in terms of its absolute score and its rank relative to other words.

Chart showing the cohort solver belief strength across the whole vocabulary at each guess, and the path the target word takes in relation to these distributions, in terms of its absolute score and its rank relative to other words

Superior solution?

The cohort solver can be (de)tuned to almost any level of performance by adjusting the parameters precision and recall, which determine the tightness of the similarity band and completeness of results from the generated cohorts. The gradient descent solver has potential for tuning parameters, but I didn’t explore this much. To compare the two, we’d therefore need to consider configurations of each solver. For now, I’m pleased that the two distinct sketches solve to my satisfaction!

Data Mesh Radio

I joined Scott Hirleman for an episode (#95) of the Data Mesh Radio podcast. Scott does great work connecting and educating the data mesh community, and we had fun talking about:

  • Fitness functions to define “what good looks like” for data mesh and guide the evolution of analytic data architecture and operating model
  • Team topologies as a system for organisational design that is sympathetic to data mesh
  • Driving a delivery program through use cases
  • Thin slicing and evolution of products

My episode is #95 Measuring Your Data Mesh Journey Progress with Fitness Functions