Synthesising Semantle Solvers

Picking up threads from previous posts on solving Semantle word puzzles with machine learning, we’re ready to explore how different solvers might play along with people while playing the game online. Maybe you’d like to play speed Semantle against an artificially intelligent opponent, maybe you’d like a left-of-field hint on a tricky puzzle, or maybe it’s just fun to spectate at a cerebral robot battle.

Animation of a Semantle game from initial guess to completion

Substitute semantics

The solvers have a view of how words relate due to a similarity model that is encapsulated for ease of change. To date, we’ve used the same model as live Semantle, which is word2vec. But as this might be considered cheating, we can now also use a model based on the Universal Sentence Encoder (USE), to explore how the solvers perform with separated semantics.

Solver spec

To recap, the key elements of the solver ecosystem are now:

  • SimilarityModel – choice of word2vec or USE as above,
  • Solver methods (common to both gradient and cohort variants):
    • make_guess() – return a guess that is based on the solver’s current state, but don’t change the solver’s state,
    • merge_guess(guess, score) – update the solver’s state with information about a guess and a score,
  • Scoring of guesses by either the simulator or a Semantle game, where a game could also include guesses from other players.
Diagram illustrating elements of the solver ecosystem. Similarity model initialises solver state used to make guesses, which are scored by game and update solver state with scores. Other players can make guesses which also get scored

It’s a simplified reinforcement learning setup. Different combinations of these elements allow us to explore different scenarios.

Solver suggestions

Let’s look at how solvers might play with people. The base scenario friends is the actual history of a game played with people, completed in 109 guesses.

Word2Vec similarity

Solvers could complete a puzzle from an initial sequence of guesses from friends. Both solvers in this particular configuration generally easily better the friends result when primed with the first 10 friend guesses.

Line chart comparing three irregular but increasing lines that represent the sequence of scores for guesses in a semantle game. The three lines are labelled friends, cohort, and gradient. Cohort finishes with fewest guesses, then gradient, then friends, with clear separation.

Solvers could instead make the next guess only, but based on the game history up to that point. Both solvers may permit a finish in slightly fewer guesses. The conclusion is that these solvers are good for hints, especially if they are followed!

Line chart comparing three irregular but increasing lines that represent the sequence of scores for guesses in a semantle game. The three lines are labelled friends, cohort, and gradient. Cohort finishes with fewest guesses, then gradient, then friends, with marginal differences.

Maybe these solvers using word2vec similarity do have an unfair advantage though – how do they perform with a different similarity model? Using USE instead, I expected the cohort solver to be more robust than the gradient solver…

USE similarity

… but it seems that the gradient descent solver is more robust to a disparate similarity model, as one example of the completion scenario shows.

Line chart comparing three irregular but increasing lines that represent the sequence of scores for guesses in a semantle game. The three lines are labelled friends, cohort, and gradient. Gradient finishes with fewest guesses, then friends, then cohort, and the separation is clear.

The gradient solver also generally offers some benefit making a suggestion for just the next guess, but the cohort solver’s contribution is marginal at best.

Line chart comparing three irregular but increasing lines that represent the sequence of scores for guesses in a semantle game. The three lines are labelled friends, cohort, and gradient. Gradient finishes with fewest guesses, then friends, and cohort doesn't finish, but the differences are very minor.

These are of course only single instances of each scenario, and there is significant variation between runs. It’s been interesting to see this play out interactively, but a more comprehensive performance characterisation – with plenty of scope for understanding the influence of hyperparameters – may be in order.

Solver solo

The solvers can also play part or whole games solo (or with other players) in a live environment, using Selenium WebDriver to submit guesses and collect scores. The leading animation above is gradient-USE and a below is a faster game using cohort-word2vec.

Animation of a Semantle game from initial guess to completion

So long

And that’s it for now! We have multiple solver configurations that can play online by themselves or with other people. They demonstrate how people and machines can collaborate to each bring their own strengths to solving problems; people with creative strategies and machines with a relentless ability to crunch through possibilities. They don’t spoil the fun of solving Semantle yourself or with friends, but they do provide new ways to play and to gain insight into how to improve your own game.

Postscript: seeing in space

Through all this I’ve considered various 3D visualisations of search through a semantic space with hundreds of dimensions. I’ve settled on the version below, illustrating a search for target “habitat” from first guess “megawatt”.

An animated rotating 3D view of an semi-regular collection of points joined by lines into a sequence. Some points are labelled with words. Represents high-dimensional semantic search in 3D.

This visualisation format uses cylindrical coordinates, broken out in the figure below. The cylinder (x) axis is the projection of each guess to the line that connects the first guess to the target word. The cylindrical radius is the distance of each guess in embedding space from its projection on this line (cosine similarity seemed smoother than Euclidian distance here). The angle of rotation in cylindrical coordinates (theta) is the cumulative angle between the directions connecting guess n-1 to n and n to n+1. The result is an irregular helix expanding then contracting, all while twisting around the axis from first to lass guess.

Three line charts on a row, with common x-axis of guess number, showing semi-regular lines, representing the cylindrical coordinates of the 3D visualisation. The left chart is x-axis, increasing from 0 to 1, middle is radius, from 0 to ~1 and back to 0, and right is angle theta, increasing from 0 to ~11 radians.

Second Semantle Solver

In the post Sketching Semantle Solvers, I introduced two methods for solving Semantle word puzzles, but I only wrote up one. The second solver here is based the idea that the target word should appear in the intersection between the cohorts of possible targets generated by each guess.

Finding the semantle target through overlapping cohorts. Shows two intersecting rings of candidate words based on cosine similarity.

To recap, the first post:

  • introduced the sibling strategies side-by-side,
  • discussed designing for sympathetic sequences, so the solver can play along with humans, with somewhat explainable guesses, and
  • shared the source code and visualisations for the gradient descent solver.

Solution source

This post shares the source for the intersecting cohorts solver, including notebook, similarity model and solver class.

The solver is tested against the simple simulator for semantle scores from last time. Note that the word2vec model data for the simulator (and agent) is available at this word2vec download location.

Stylised visualisation of the search for a target word with intersecting  cohorts. Shows distributions of belief strength at each guess and strength and rank of target word

The solver has the following major features:

  1. A vocabulary, containing all the words that can be guessed,
  2. A semantic model, from which the agent can calculate the similarity of word pairs,
  3. The ability to generate cohorts of words from the vocabulary that are similar (in Semantle score) to a provided word (a guess), and
  4. An evolving strength of belief that each word in the vocabulary is the target.

In each step towards guessing the target, the solver does the following:

  1. Choose a word for the guess. The current choice is the word with the strongest likelihood of being the target, but it could equally be any other word from the solver’s vocabulary (which might help triangulate better), or it could be provided by a human player with their own suspicions.
  2. Score the guess. The Semantle simulator scores the guess.
  3. Generate a cohort. The guess and the score are used to generate a new cohort of words that would share the same score with the guess.
  4. Merge the cohort into the agent’s belief model. The score is added to the current belief strength for each word in the cohort, providing a proxy for likelihood for each word. The guess is also masked from further consideration.

Show of strength

The chart below shows how the belief strength (estimated likelihood) of the target word gradually approaches the maximum belief strength of any word, as the target (which remains unknown until the end) appears in more and more cohorts.

Intersecting cohorts solver. Line chart showing the belief strength of the target word at each guess in relation to the maximum belief strength of remaining words.

We can also visualise the belief strength across the whole vocabulary at each guess, and the path the target word takes in relation to these distributions, in terms of its absolute score and its rank relative to other words.

Chart showing the cohort solver belief strength across the whole vocabulary at each guess, and the path the target word takes in relation to these distributions, in terms of its absolute score and its rank relative to other words

Superior solution?

The cohort solver can be (de)tuned to almost any level of performance by adjusting the parameters precision and recall, which determine the tightness of the similarity band and completeness of results from the generated cohorts. The gradient descent solver has potential for tuning parameters, but I didn’t explore this much. To compare the two, we’d therefore need to consider configurations of each solver. For now, I’m pleased that the two distinct sketches solve to my satisfaction!

Data Mesh Radio

I joined Scott Hirleman for an episode (#95) of the Data Mesh Radio podcast. Scott does great work connecting and educating the data mesh community, and we had fun talking about:

  • Fitness functions to define “what good looks like” for data mesh and guide the evolution of analytic data architecture and operating model
  • Team topologies as a system for organisational design that is sympathetic to data mesh
  • Driving a delivery program through use cases
  • Thin slicing and evolution of products

My episode is #95 Measuring Your Data Mesh Journey Progress with Fitness Functions

Slackometer Hello World

Project Slackpose gives me one more excuse for hyperlocal exercise and number crunching in lockdown. Last time, I briefly touched on balance analysis. This time, I look at tracking slackline distance walked with my newly minted slackometer.

Inferring 3D Position

I’m working only with 2D pose data (a set of pixel locations for body joints) from my single camera angle, but I can infer something about my 3D location – distance from the camera (d) – using the pinhole camera model and further data such as:

  1. Camera lens & sensor geometry, plus known real world distance between pose keypoints (geometric approach)
  2. Consistent pose data (eg same person, similar pose) acquired at known distances from the camera (regression approach)

In this first iteration shown in the notebook, I boil all the pose data down into a single feature: the vertical pixel distance (v) between the highest tracked point (eg, an eye) and the lowest tracked point (eg, a toe). I base the calculation of distance d on this measure. This measure may be shorter than my full height by crown-to-eye height and a typically lower pose when balancing on a slackline, or an elevated arm joint might make it taller.

Body pose in pixels with vertical extent indicated

Geometric Approach

The geometric approach uses a similar triangles formula where one triangle has sides lens-sensor distance (known) and object pixel height (v) scaled to sensor height (known), the other has lens-object distance (d) and object height (roughly known from my height, as above). The equation has the form d = F / v, where F is determined by those known or estimated factors. These geometry factors will be specific to each physical camera device and slackline walker combination.

Regression Approach

The regression approach uses pose data collected at known distances from the slackline anchor – 2m, 4m, 6m, and 8m – as shown in the image below. I marked the calibration distances and walked to each point, balancing there for a few seconds, then turned around and walked back through each point to collect data from both back and front views. This approach is consistent across cameras and walkers, and works with varied sets of calibration distances.

Person standing on a slackline with known distances marked

Plotting the value of v against the reciprocal of distance (1 / d), we see a fairly linear relationship. Fitting a line (with 0 intercept) to the data gives a value for 1 / F, which produces a value for F that is very close to that from the geometric approach – a neat result, or convenient choice of parameters!

charts showing correlation of vertical extent and reciprocal of distance from camera

So we have two methods for calculating distance from pose data, which produce very similar results.

Does It Work?

The figures you see in the video above match up pretty well with the reality. The video shows approximately 10m accumulated distance, while in reality I started about 2m from the camera, walked to just under 8m from the camera, then returned almost to my starting point (say 11.5m max). The discrepancy is most likely explained by an under-estimate of peak distance (at ~7m) due to decrease in precision of pixel measures for distant objects, and noise/pose-dependence in the vertical extent measure.

So this first iteration of the slackometer would be useful for approximating distance walked, and could possibly be improved with higher resolution video and by tracking more body segments, which may also reduce the dependence on smoothing. It would also be useful for comparing distances or speeds. The millimetre precision, however, is misleading, I just chose the unit so it looked better on the odometer display! (I spent some time tuning this for those cascading transitions…)

A mechanical odometer showing wheels cascading over the numbers 1998, 1999, 2000, 2001

There are certainly are other ways you could track distance, starting from pacing out the line and keeping count of laps, to using image segmentation rather than pose estimation to calculate v, to alternative sensor setups and models, but it was fun to do it this way and helped pass a few more days, and nights, in lockdown.

Light trace in a backyard; a long exposure photo at night that looks like daytime
slacklining at night under a full moon

7 Wastes of Data Production

I realised recently that this is one of the lenses through which I look at the data engineering world, but I had never expressed these (lean) wastes explicitly. This post might be useful for data engineers exploring lean concepts, or lean practitioners trying to make sense of data & analytics processes. This isn’t just a theoretical view; these wastes are real, and have a real impact on organisational success, which I try to quantify. These wastes also impact our ability to do good work, and to enjoy it!

When I talk about data production, I’m talking about building and running a factory that transforms data signals from the world into useful insights, improved operations and great experiences. This means connecting data sources from suppliers through transformations to consumers, who might be customers, team members, or partners.

Data production factory, showing bits flowing from supplier to consumer through a machine. Developers and operators change the factory and keep it running

In this post, I’ll look at lean wastes through the lens of building the factory and running the factory. Building the factory – modifying the processing pathways for data – is a software development exercise. Running the factory – propagating new inputs along processing pathways – is a manufacturing operations exercise, but for data rather than physical products, and hence the manufacturing machines are all software. NB. one team can look through multiple lenses – see data + dev + ops!

Lean wastes (muda) were originally defined with reference to a physical manufacturing, though there are analogues for knowledge work, including Mary and Tom Poppendieck’s mapping to software development. So the translation is roughly:

  • the software development analogue for the build phase, and
  • a manufacturing analogue with bits rather than atoms for the run phase

Both are illustrated with examples specific to data. The drawings only show bits flowing through a running factory, but you can imagine ideas flowing through a development team as the equivalent for build.

I talk about data products as the end result of this development and manufacturing (or data production) activity, but also considering the complementary design or marketing perspective – i.e., what problems and how well do these products solve for a consumer? (regardless of how they are made) This brings us to…

Overproduction

Overproduction is delivering things consumers don’t need and haven’t asked for.

BuildUnnecessary products, over-designed products
For instance, a report that no-one will read, a 3D widget where a table will do, or a unified data model that doesn’t suit any one consumer.
RunUnused products
A report that people stopped reading long ago. A data set no-one ever accesses.
Data production factory spewing bits that aren't consumed by a consumer

The consequence of overproduction is that productive capacity is consumed with no business impact. These products are useless and prevent us creating value.

Considering some studies have found 50% of product features are rarely or never used, the cost of overproduction may be 50% of your data budget, but many organisations have very limited visibility of overproduction in data.

Overproduction is with reference to finished goods, but until finished, they are …

Inventory

Inventory is partially completed work that causes a drain on resources while embodying little or no realisable value.

BuildWork in progress
Batching up development effort on data ingest activities, or platform features, without validating the use cases that motivate this data or functionality.
RunIncomplete pipelines (data not connected to consumers)
Data delivered to a data platform but not being used.
Data production factory accumulating bits inside the factory

The consequence of inventory is that no value is realised from effort to date, and as a result, unbounded effort may be expended before delivering value.

The cost that can be sunk into building and populating a data platform that is not connected to consumers (representing data production inventory) is, for all intents and purposes, unbounded. Without connecting to consumers, very large initiatives could reach their conclusion with substantial data inventory (which ironically, until close to that point, might be considered a success measure), but marginal to no business value delivered.

A possible cause of failing to deliver finished goods from inventory is the additional effort associated with …

Over Processing

Over-Processing is doing more work than is necessary to deliver on an objective.

BuildReinventing products, working with untrusted data
Duplicated, divergent reports. A unified data model that doesn’t suit any one consumer. Excessive logic to manage poor data quality.
RunCorrecting errors from upstream, propagating redundant data
Filling missing data and correcting schema violations with best guesses. Passing on data that isn’t valuable downstream.
Data production factory with many processing machines

The consequence of over-processing is the expenditure of unnecessary effort to realise value. This may feel like everything is harder than it should be.

Any code that exists to correct errors downstream of a source is over-processing, so is any duplicated reporting. Consider how much of this over-processing you may be doing, and what might change if you were to measure this.

Work must move between processing stages, but in doing so, we should minimise …

Transportation

Transportation is moving products around in a way that is costly and may cause damage.

BuildHandoffs between siloed teams
Source app team → ingest team → platform team → analytics team → consuming app team… each team loses context from the last.
RunData replication without reproducibility
Creating unnecessary backups because systems aren’t trusted. Copying Excel files. You can damage a digital copy by losing its provenance.
Data production factory where bits disappear and reappear at different locations

The consequence of transportation is expending further additional effort that may reduce quality. There’s no clear single place to find what you need, and the more it moves, the less you trust it.

How much time have you lost due to misunderstanding between teams or to establishing the provenance of data? It may lower productivity or may be catastrophic if auditibility is sufficiently degraded. This is the cost of transportation in data production.

In addition to transportation, the act of processing may include unnecessary …

Motion

Motion is extra activity which doesn’t add to the product, and additionally creates opportunities for defects, and takes a toll on workers.

BuildContext switching
For example, a dedicated data ingest team working across multiple sources (work in progress), which may also frequently break (incident toil).
RunManual intervention or finishing of products
Copy this here, rename file X and save over there, run script Y, …
Data production factory with workers moving between multiple streams and stages in a stream

The consequence of motion is that work is complicated, in a way that is bad for people. Every job requires more actions than it should.

How much time and energy do you lose to picking up and putting down work – this can increase dramatically as the number of concurrent tasks increases – including switching to manual intervention in data production? This is the cost of motion.

All of the above reduce the flow of build or run to an extent. Collectively and in interaction with batching up work, they cause …

Waiting

Waiting occurs when people or resources aren’t ready to pick up work as it arrives.

BuildDelays due to handoffs between siloed teams, long feedback cycles in development
Waiting for requirements or feedback on deliverables. Exacerbated in data-intensive applications by functionally-specialised handoffs (see transport), long running batch jobs, and a promote then test approach.
RunLead time to discover data, and from business event to insight or action
No-one owns this data or can tell you definitively if it exists, what’s in it, and where to find it. Reports come on a fixed schedule set by processing capabilities, not business needs.
Data production factory with waiting spinners in the flow of bits

The consequence of waiting is that business value realisation is delayed, in an environment where value decays rapidly. We might summarise this as: it would have been nice to have this data yesterday.

Where hand-offs occur between teams, tasks may take 12 times as long to complete, as a median measure, and much longer in extreme cases. Cascading scheduled batch jobs with buffers and retries due to uncontrolled variability and quality issues can quickly add up to insights lead times measured in weeks.

The factors above contribute to and are also caused by the introduction of …

Defects

Defects are the failure to do something, or the failure to do it right.

BuildDefects in processing code
Query specifies ‘m’ for minutes instead of intended ‘M’ for month.
RunDefects in data produced
People get the wrong emails.
Data production factory with some defective bits on input, a defective processing machine, and many defective bits on output

The consequence of defects is that the organisation increases risk exposure, while reducing consumer value delivered, and creating effort to remediate. Thus they have the potential to damage business and create yet more effort.

The cost of defects can be catastrophic, especially when related to personal information. If defects cause significant ongoing toil, reducing defects is a major lever for increasing productive capacity (eg, if defects are ~30% of capacity, the marginal improvement in productive capacity is ~50%).

Conclusion

We can see these wastes are inter-connected and sometimes mutually reinforcing. Look out for these wastes in your work with data; find your own examples. I have found recognising these various wastes and being able to quantify their potential impact helps identify and prioritise improvement efforts. There are approaches and solutions to reduce these wastes, but I won’t address any of those here. Instead I will just encourage you to take some time to understand the problem; there’s a lot you can do with knowledge of waste in data production to define and drive change for the better.

Thanks to Ned Letcher and Yekaterina Khomyakova for feedback on these wastes, which were included as part of their presentation on Data Mesh for the 2021 LAST Conference Melbourne.

Guiding the Evolution of Data Mesh with Fitness Functions

I presented this webinar with Zhamak Dehghani – see the recording Guiding the Evolution of Data Mesh with Fitness Functions. There was great engagement with the topic and we captured some questions and further thoughts on this mini-blog post, published a little later.

This presentation brought together the idea of architectural fitness functions from the book Building Evolutionary Architectures with the core data mesh principles and logical architecture.

Our thoughts around guiding fitness functions included the below. These high-level measurable objectives were supported by a range of proposed metrics. This table is a handy summary; check out the webinar for more.

Domain Ownership
Scaling sources and consumers
Truthfulness
Domain autonomy
Reduced accidental complexity
Data as a Product
Serving users’ needs
Ease of discovery
Evaluation of quality
Service levels
Self-Serve Data Platform
Abstraction of complexity
Domain team autonomy
Protocols enable an ecosystem
Automation
Federated Computational Governance
Governance for common good
Degree of decentralisation
Interoperability
Increasing returns from scale