Picking up threads from previousposts on solving Semantle word puzzles with machine learning, we’re ready to explore how different solvers might play along with people while playing the game online. Maybe you’d like to play speed Semantle against an artificially intelligent opponent, maybe you’d like a left-of-field hint on a tricky puzzle, or maybe it’s just fun to spectate at a cerebral robot battle.
Substitute semantics
The solvers have a view of how words relate due to a similarity model that is encapsulated for ease of change. To date, we’ve used the same model as live Semantle, which is word2vec. But as this might be considered cheating, we can now also use a model based on the Universal Sentence Encoder (USE), to explore how the solvers perform with separated semantics.
Solver spec
To recap, the key elements of the solver ecosystem are now:
SimilarityModel – choice of word2vec or USE as above,
Solver methods (common to both gradient and cohort variants):
make_guess() – return a guess that is based on the solver’s current state, but don’t change the solver’s state,
merge_guess(guess, score) – update the solver’s state with information about a guess and a score,
Scoring of guesses by either the simulator or a Semantle game, where a game could also include guesses from other players.
It’s a simplified reinforcement learning setup. Different combinations of these elements allow us to explore different scenarios.
Solver suggestions
Let’s look at how solvers might play with people. The base scenario friends is the actual history of a game played with people, completed in 109 guesses.
Word2Vec similarity
Solvers could complete a puzzle from an initial sequence of guesses from friends. Both solvers in this particular configuration generally easily better the friends result when primed with the first 10 friend guesses.
Solvers could instead make the next guess only, but based on the game history up to that point. Both solvers may permit a finish in slightly fewer guesses. The conclusion is that these solvers are good for hints, especially if they are followed!
Maybe these solvers using word2vec similarity do have an unfair advantage though – how do they perform with a different similarity model? Using USE instead, I expected the cohort solver to be more robust than the gradient solver…
USE similarity
… but it seems that the gradient descent solver is more robust to a disparate similarity model, as one example of the completion scenario shows.
The gradient solver also generally offers some benefit making a suggestion for just the next guess, but the cohort solver’s contribution is marginal at best.
These are of course only single instances of each scenario, and there is significant variation between runs. It’s been interesting to see this play out interactively, but a more comprehensive performance characterisation – with plenty of scope for understanding the influence of hyperparameters – may be in order.
Solver solo
The solvers can also play part or whole games solo (or with other players) in a live environment, using Selenium WebDriver to submit guesses and collect scores. The leading animation above is gradient-USE and a below is a faster game using cohort-word2vec.
So long
And that’s it for now! We have multiple solver configurations that can play online by themselves or with other people. They demonstrate how people and machines can collaborate to each bring their own strengths to solving problems; people with creative strategies and machines with a relentless ability to crunch through possibilities. They don’t spoil the fun of solving Semantle yourself or with friends, but they do provide new ways to play and to gain insight into how to improve your own game.
Postscript: seeing in space
Through all this I’ve considered various 3D visualisations of search through a semantic space with hundreds of dimensions. I’ve settled on the version below, illustrating a search for target “habitat” from first guess “megawatt”.
This visualisation format uses cylindrical coordinates, broken out in the figure below. The cylinder (x) axis is the projection of each guess to the line that connects the first guess to the target word. The cylindrical radius is the distance of each guess in embedding space from its projection on this line (cosine similarity seemed smoother than Euclidian distance here). The angle of rotation in cylindrical coordinates (theta) is the cumulative angle between the directions connecting guess n-1 to n and n to n+1. The result is an irregular helix expanding then contracting, all while twisting around the axis from first to lass guess.
In the post Sketching Semantle Solvers, I introduced two methods for solving Semantle word puzzles, but I only wrote up one. The second solver here is based the idea that the target word should appear in the intersection between the cohorts of possible targets generated by each guess.
A vocabulary, containing all the words that can be guessed,
A semantic model, from which the agent can calculate the similarity of word pairs,
The ability to generate cohorts of words from the vocabulary that are similar (in Semantle score) to a provided word (a guess), and
An evolving strength of belief that each word in the vocabulary is the target.
In each step towards guessing the target, the solver does the following:
Choose a word for the guess. The current choice is the word with the strongest likelihood of being the target, but it could equally be any other word from the solver’s vocabulary (which might help triangulate better), or it could be provided by a human player with their own suspicions.
Score the guess. The Semantle simulator scores the guess.
Generate a cohort. The guess and the score are used to generate a new cohort of words that would share the same score with the guess.
Merge the cohort into the agent’s belief model. The score is added to the current belief strength for each word in the cohort, providing a proxy for likelihood for each word. The guess is also masked from further consideration.
Show of strength
The chart below shows how the belief strength (estimated likelihood) of the target word gradually approaches the maximum belief strength of any word, as the target (which remains unknown until the end) appears in more and more cohorts.
We can also visualise the belief strength across the whole vocabulary at each guess, and the path the target word takes in relation to these distributions, in terms of its absolute score and its rank relative to other words.
Superior solution?
The cohort solver can be (de)tuned to almost any level of performance by adjusting the parameters precision and recall, which determine the tightness of the similarity band and completeness of results from the generated cohorts. The gradient descent solver has potential for tuning parameters, but I didn’t explore this much. To compare the two, we’d therefore need to consider configurations of each solver. For now, I’m pleased that the two distinct sketches solve to my satisfaction!
I collaborated with with some colleagues to share our experiences with data mesh and how to frame the benefits for an executive audience, written up in an article titled The Business Case for Data Mesh.
Here’s the recording of my presentation on data mesh at the Data Engineering Melbourne Meetup, on 26 August 2021. We covered architecture, building blocks and more. Lots of great questions and discussion.
Thanks as always to organisers Harmeet Sokhi, Timothy Findlay, and Andrew Jones!
Project Slackpose gives me one more excuse for hyperlocal exercise and number crunching in lockdown. Last time, I briefly touched on balance analysis. This time, I look at tracking slackline distance walked with my newly minted slackometer.
Inferring 3D Position
I’m working only with 2D pose data (a set of pixel locations for body joints) from my single camera angle, but I can infer something about my 3D location – distance from the camera (d) – using the pinhole camera model and further data such as:
Camera lens & sensor geometry, plus known real world distance between pose keypoints (geometric approach)
Consistent pose data (eg same person, similar pose) acquired at known distances from the camera (regression approach)
In this first iteration shown in the notebook, I boil all the pose data down into a single feature: the vertical pixel distance (v) between the highest tracked point (eg, an eye) and the lowest tracked point (eg, a toe). I base the calculation of distance d on this measure. This measure may be shorter than my full height by crown-to-eye height and a typically lower pose when balancing on a slackline, or an elevated arm joint might make it taller.
Geometric Approach
The geometric approach uses a similar trianglesformula where one triangle has sides lens-sensor distance (known) and object pixel height (v) scaled to sensor height (known), the other has lens-object distance (d) and object height (roughly known from my height, as above). The equation has the form d = F / v, where F is determined by those known or estimated factors. These geometry factors will be specific to each physical camera device and slackline walker combination.
Regression Approach
The regression approach uses pose data collected at known distances from the slackline anchor – 2m, 4m, 6m, and 8m – as shown in the image below. I marked the calibration distances and walked to each point, balancing there for a few seconds, then turned around and walked back through each point to collect data from both back and front views. This approach is consistent across cameras and walkers, and works with varied sets of calibration distances.
Plotting the value of v against the reciprocal of distance (1 / d), we see a fairly linear relationship. Fitting a line (with 0 intercept) to the data gives a value for 1 / F, which produces a value for F that is very close to that from the geometric approach – a neat result, or convenient choice of parameters!
So we have two methods for calculating distance from pose data, which produce very similar results.
Does It Work?
The figures you see in the video above match up pretty well with the reality. The video shows approximately 10m accumulated distance, while in reality I started about 2m from the camera, walked to just under 8m from the camera, then returned almost to my starting point (say 11.5m max). The discrepancy is most likely explained by an under-estimate of peak distance (at ~7m) due to decrease in precision of pixel measures for distant objects, and noise/pose-dependence in the vertical extent measure.
So this first iteration of the slackometer would be useful for approximating distance walked, and could possibly be improved with higher resolution video and by tracking more body segments, which may also reduce the dependence on smoothing. It would also be useful for comparing distances or speeds. The millimetre precision, however, is misleading, I just chose the unit so it looked better on the odometer display! (I spent some time tuning this for those cascading transitions…)
There are certainly are other ways you could track distance, starting from pacing out the line and keeping count of laps, to using image segmentation rather than pose estimation to calculate v, to alternative sensor setups and models, but it was fun to do it this way and helped pass a few more days, and nights, in lockdown.
I realised recently that this is one of the lenses through which I look at the data engineering world, but I had never expressed these (lean) wastes explicitly. This post might be useful for data engineers exploring lean concepts, or lean practitioners trying to make sense of data & analytics processes. This isn’t just a theoretical view; these wastes are real, and have a real impact on organisational success, which I try to quantify. These wastes also impact our ability to do good work, and to enjoy it!
When I talk about data production, I’m talking about building and running a factory that transforms data signals from the world into useful insights, improved operations and great experiences. This means connecting data sources from suppliers through transformations to consumers, who might be customers, team members, or partners.
In this post, I’ll look at lean wastes through the lens of building the factory and running the factory. Building the factory – modifying the processing pathways for data – is a software development exercise. Running the factory – propagating new inputs along processing pathways – is a manufacturing operations exercise, but for data rather than physical products, and hence the manufacturing machines are all software. NB. one team can look through multiple lenses – see data + dev + ops!
Lean wastes (muda) were originally defined with reference to a physical manufacturing, though there are analogues for knowledge work, including Mary and Tom Poppendieck’s mapping to software development. So the translation is roughly:
the software development analogue for the build phase, and
a manufacturing analogue with bits rather than atoms for the run phase
Both are illustrated with examples specific to data. The drawings only show bits flowing through a running factory, but you can imagine ideas flowing through a development team as the equivalent for build.
I talk about data products as the end result of this development and manufacturing (or data production) activity, but also considering the complementary design or marketing perspective – i.e., what problems and how well do these products solve for a consumer? (regardless of how they are made) This brings us to…
Overproduction
Overproduction is delivering things consumers don’t need and haven’t asked for.
Build
Unnecessary products, over-designed products For instance, a report that no-one will read, a 3D widget where a table will do, or a unified data model that doesn’t suit any one consumer.
Run
Unused products A report that people stopped reading long ago. A data set no-one ever accesses.
The consequence of overproduction is that productive capacity is consumed with no business impact. These products are useless and prevent us creating value.
Considering some studies have found 50% of product features are rarely or never used, the cost of overproduction may be 50% of your data budget, but many organisations have very limited visibility of overproduction in data.
Overproduction is with reference to finished goods, but until finished, they are …
Inventory
Inventory is partially completed work that causes a drain on resources while embodying little or no realisable value.
Build
Work in progress Batching up development effort on data ingest activities, or platform features, without validating the use cases that motivate this data or functionality.
Run
Incomplete pipelines (data not connected to consumers) Data delivered to a data platform but not being used.
The consequence of inventory is that no value is realised from effort to date, and as a result, unbounded effort may be expended before delivering value.
The cost that can be sunk into building and populating a data platform that is not connected to consumers (representing data production inventory) is, for all intents and purposes, unbounded. Without connecting to consumers, very large initiatives could reach their conclusion with substantial data inventory (which ironically, until close to that point, might be considered a success measure), but marginal to no business value delivered.
A possible cause of failing to deliver finished goods from inventory is the additional effort associated with …
Over Processing
Over-Processing is doing more work than is necessary to deliver on an objective.
Build
Reinventing products, working with untrusted data Duplicated, divergent reports. A unified data model that doesn’t suit any one consumer. Excessive logic to manage poor data quality.
Run
Correcting errors from upstream, propagating redundant data Filling missing data and correcting schema violations with best guesses. Passing on data that isn’t valuable downstream.
The consequence of over-processing is the expenditure of unnecessary effort to realise value. This may feel like everything is harder than it should be.
Any code that exists to correct errors downstream of a source is over-processing, so is any duplicated reporting. Consider how much of this over-processing you may be doing, and what might change if you were to measure this.
Work must move between processing stages, but in doing so, we should minimise …
Transportation
Transportation is moving products around in a way that is costly and may cause damage.
Build
Handoffs between siloed teams Source app team → ingest team → platform team → analytics team → consuming app team… each team loses context from the last.
Run
Data replication without reproducibility Creating unnecessary backups because systems aren’t trusted. Copying Excel files. You can damage a digital copy by losing its provenance.
The consequence of transportation is expending further additional effort that may reduce quality. There’s no clear single place to find what you need, and the more it moves, the less you trust it.
How much time have you lost due to misunderstanding between teams or to establishing the provenance of data? It may lower productivity or may be catastrophic if auditibility is sufficiently degraded. This is the cost of transportation in data production.
In addition to transportation, the act of processing may include unnecessary …
Motion
Motion is extra activity which doesn’t add to the product, and additionally creates opportunities for defects, and takes a toll on workers.
Build
Context switching For example, a dedicated data ingest team working across multiple sources (work in progress), which may also frequently break (incident toil).
Run
Manual intervention or finishing of products Copy this here, rename file X and save over there, run script Y, …
The consequence of motion is that work is complicated, in a way that is bad for people. Every job requires more actions than it should.
How much time and energy do you lose to picking up and putting down work – this can increase dramatically as the number of concurrent tasks increases – including switching to manual intervention in data production? This is the cost of motion.
All of the above reduce the flow of build or run to an extent. Collectively and in interaction with batching up work, they cause …
Waiting
Waiting occurs when people or resources aren’t ready to pick up work as it arrives.
Build
Delays due to handoffs between siloed teams, long feedback cycles in development Waiting for requirements or feedback on deliverables. Exacerbated in data-intensive applications by functionally-specialised handoffs (see transport), long running batch jobs, and a promote then test approach.
Run
Lead time to discover data, and from business event to insight or action No-one owns this data or can tell you definitively if it exists, what’s in it, and where to find it. Reports come on a fixed schedule set by processing capabilities, not business needs.
The consequence of waiting is that business value realisation is delayed, in an environment where value decays rapidly. We might summarise this as: it would have been nice to have this data yesterday.
Where hand-offs occur between teams, tasks may take 12 times as long to complete, as a median measure, and much longer in extreme cases. Cascading scheduled batch jobs with buffers and retries due to uncontrolled variability and quality issues can quickly add up to insights lead times measured in weeks.
The factors above contribute to and are also caused by the introduction of …
Defects
Defects are the failure to do something, or the failure to do it right.
Build
Defects in processing code Query specifies ‘m’ for minutes instead of intended ‘M’ for month.
Run
Defects in data produced People get the wrong emails.
The consequence of defects is that the organisation increases risk exposure, while reducing consumer value delivered, and creating effort to remediate. Thus they have the potential to damage business and create yet more effort.
The cost of defects can be catastrophic, especially when related to personal information. If defects cause significant ongoing toil, reducing defects is a major lever for increasing productive capacity (eg, if defects are ~30% of capacity, the marginal improvement in productive capacity is ~50%).
Conclusion
We can see these wastes are inter-connected and sometimes mutually reinforcing. Look out for these wastes in your work with data; find your own examples. I have found recognising these various wastes and being able to quantify their potential impact helps identify and prioritise improvement efforts. There are approaches and solutions to reduce these wastes, but I won’t address any of those here. Instead I will just encourage you to take some time to understand the problem; there’s a lot you can do with knowledge of waste in data production to define and drive change for the better.
Thanks to Ned Letcher and Yekaterina Khomyakova for feedback on these wastes, which were included as part of their presentation on Data Mesh for the 2021 LAST Conference Melbourne.
Our thoughts around guiding fitness functions included the below. These high-level measurable objectives were supported by a range of proposed metrics. This table is a handy summary; check out the webinar for more.