I recently had the chance to present an updated version of my 7 wastes of data production talk at DataEngBytes Melbourne 2023. I think the talk was stronger this time around and I really appreciated all the great feedback from the audience.
Thanks to Peter Hanssens and the DEB crew for having me as part of an impressive speaker lineup and for putting on a great event.
The video will be coming very soon (I’ll add it here when it’s up) and you can check out the slides meantime.
There’s a lot of ground to cover in 30 minutes with 7 wastes from both run and build lenses, plus 5 lean principles to address the waste. I’ll leave the summary here and encourage you to watch the video or read the slides if you want to know more.
The notebook demonstrates that individual rows of a database may be reconstructed, even if only summary statistics are shared, by considering the constraints that the statistics place on possible values of the data. Constraints include mean and median for all numerical values globally and for various cohorts of records determined by class values.
Note that the intent is of this notebook is not to compromise any private data, but to raise awareness of the potential for privacy breaches due to reconstruction attacks!
I was thrilled to help kick-off the GenAI Network Melbourne meetup at their first meeting recently. I presented a talk titled Semantic hide and seek – a gentle introduction to embeddings, based on my experiments with Semantle, other representation learning, and some discussion of what it means to use Generative AI in developing new products and services. It was a pleasure to present alongside Rajesh Vasa from A2I2 at Deakin University.
Thanks to Ned, Orian, Scott, Alex, Leonard & co for organising. Looking forward to more fun events in this series!
In three years of the maths whimsy repo, I’ve covered a lot of ground, and got hands-on with many Python libraries for data science. I was particularly pleased when the idea to visualise multi-stage classification pipelines as Sankey diagrams was enthusiastically picked up by one client. I’m not sure I’ll find an application for which numpig chunking of large arrays is the best solution, but I remain quite fond of it. And I could spend all day watching the skeumorphic odometer visualisation created for the slackometer.
Afterwards, someone asked me about the times it hadn’t worked. I should have covered this too! The primary time I wasn’t satisfied that I had something to share was when I was experimenting with reinforcement learning to solve mazes with disconnected walls. I bit off more than I could chew, and while I went end-to-end with Deep Q-Learning, and then backfilled some bits of the training pipeline, I didn’t reach a satisfactory conclusion within the time I had available. I took this lesson forward to re-frame the problem into smaller, achievable chunks whenever it looked like I may have dived too deep on subsequent whimsies, and it has served me well since.
I wrote an article about optimising the design of EV charging networks. It’s a story of work done by a team at Thoughtworks, demonstrating the potential of AI augmented decision-making (including some cool optimisation techniques), in this rapidly evolving but durably important space.
We were able to thread together these many [business problem, AI techniques, data sourcing, technology] concerns in such a short time because of our relentless focus — not on the code or the mathematics — but on generating value: namely making it easier for humans to make decisions according to business goals.
Though but a footnote in the epic of coding with AI, I though it worth musing on my recent experience with Bard. Bard currently uses the LaMDA model, which is capable of generating code, but not optimised for it. The story might be different with Codey as protagonist (or is that antagonist?)
I didn’t produce any software I could use in this session, but I did learn that Bard was at its most useful when prompted with open questions or very specific concepts. I also learned that generated code may be sabotaged by plausible but incorrect domain logic and conceptual fault lines that occur between contexts.
Also, given that safetydave.net contributes 0.0000005% of tokens to the C4 dataset which makes up some of Bard’s training data, I thought I should see some return!
In which we meet a domain.
I started naively with a reasonably complex ask: how do I calculate the swept path of a vehicle in python?
I’ve done swept path analysis before, but in manufacturing, and I certainly don’t have a Python solution handy. Even if I only got the broad outlines of a solution, an answer or parts of an answer to this question would save a bunch of time.
This was far too wide an ask. I got code that looks like it’s for a curve that a vehicle might follow, but that’s just one part of the solution, and the more I looked at it, the more issues I found.
# The vehicle's position
x = np.array([0, 0])
y = np.array([0, 0])
# The vehicle's orientation
theta = 0
# The vehicle's velocity
v = 1
# The vehicle's wheelbase
w = 2
# Calculate the vehicle's position at each time step
for t in np.arange(0, 10, 0.1):
x += v * np.cos(theta) * t
y += v * np.sin(theta) * t
theta += v / w * t
First, vehicle is a word that identifies a whole domain of assumptions, including how steering works. But this domain logic is weird; assuming theta represents direction of travel, that’s not how steering works, in this universe at least. Export to Colab feature is nice, so let’s see how the code runs. Spoiler: it doesn’t.
As a user of Python, I could diagnose the runtime error and fix it (the initial values of x and y could be defined with dtype=float), but on closer inspection, the += operator overwrites rather than appends, and, wait a minute, x and y are treated both as vectors and scalars. As the domain logic is still bizarre, is there any purpose in fixing these conceptual faults in the code?
A secondary prompt to provide more guidance: I would like to consider the area swept out by the corners of the vehicle, can you update the solution?
This only resulted in more trigonometric prestidigitation; more domain dislogic. I use the term dislogic a bit like the term truthiness. There is some related logic to the code, but it is “apart” from the logic we need.
I tried much more explicit guidance by elaborating on domain concepts in the third prompt. This lead to more detailed but still fundamentally flawed and unusable solution. I decided this path wouldn’t lead to a happy ending.
Morals of Act 1
Don’t assume too much domain expertise. Bard has learned on generic public examples. Use your domain expertise to break the problem into smaller chunks.
Also, don’t expect conceptual consistency throughout. LLMs like Bard, as next-token predictors, don’t necessarily ensure conceptual consistency in their output.
In which I choose a path to follow.
I decided to focus on one part of the solution; getting the curve right. I reset Bard’s context.
I want a python function to create a curve between two points. The function arguments should be the start and end points and the tangent to the curve at the start and end points
Nice linear interpolation, shame about the tangents (which, while present as arguments, were totally ignored in the function body).
And the above could only be generated after fixing more errors preventing the code from running. The affordances of tuples and numpy.ndarray were confused, and the coordinates weren’t passed correctly to the plot method. The syntax was fine, but the code was riven with conceptual fault lines between contexts – what looked OK in one or other context in isolation caused problems when the contexts were brought together. The bugs were fairly obvious in this case, but in general could be subtle and difficult to detect.
Still, after minor adjustments, it’s a curve that meets some of the requirements. This is more useful than what we got in Act 1.
I augmented the initial prompt.
The curve tangent should match the tangents of the start and end point supplied as arguments. Please define points and tangents as numpy arrays
Now we have two nice linear interpolations, but the tangent is not the tangent of the curve! (It’s more like we have linearly interpolated the position and orientation of the vehicle.)
This code, presumably due to specifying numpy, worked straight out of the box.
Maybe I can establish the link between the tangent and the curve to which it is tangent?
In the create curve function, the tangent should be calculated as unit vector in the direction from one point to the next on the curve. This calculated tangent should equal the tangent argument at both the start and end of the curve.
Nope, that was too naively optimistic. I was forgetting the Morals of Act 1.
Let’s recall the Morals of Act 1, and try to link to concepts with better public support, such as Bézier curves.
The create curve function should be updated to return a bezier curve rather than a linear interpolation between start and end point
The result looked Bézier-ish, but the formula wasn’t transcribed faithfully, with the calculated points quickly racing away to infinity (well outside the convex hull of the control points).
# Calculate the point at the current index.
point = (1 - i / num_points)**3 * start_point + 3 * (1 - i / num_points)**2 * i / num_points * start_tangent + 3 * (1 - i / num_points) * (i / num_points)**2 * end_tangent + i**3 * end_point
However, I think we’re getting closer.
Time to reset Bard’s context again and go all in on Bézier.
I would like a python function to create a bezier curve
Please update the solution to plot the curve with matplotlib
And with a few manual tweaks that are becoming familiar…
… we finally get a curve with the desired properties from Take 1!
However, this is curve is not necessarily going to represent the path of a vehicle, which has a minimum radius of curvature (the vehicle’s turning circle) that a Bézier curve can’t guarantee. Based on the experience above, it’s unclear to me how I would incorporate the curvature requirement with further prompts.
Taking stock: after much prompting, I have a method to create a class of curve I can’t directly use in the solution. (And the code for which I ended up looking up anyway.)
Moral of Act 2
Don’t forget the Morals of Act 1.
In addition, being able to directly reference the key concepts in a generic domain, rather than trying to explain the logic, may be key to getting closer to a correct solution in a few shots.
If you’re not familiar enough with the key concepts to reference them directly, you might use a more open prompt initially, such as: How do you create a smooth curve? This might point you in the right direction…
There are many ways to create a smooth curve. Here are a few examples:
* Use a Bezier curve. A Bezier curve is a parametric curve that is defined by a set of control points. The curve is smooth because it is a polynomial function of the control points.
* ... [goes on to reference Splines, Fourier series and B-Splines] ...
Because we need 3 acts. In which I back up and broaden the domain.
Considering whether the word vehicle in the initial prompt had given Bard the wrong steer, I tried the more generic prompt: how do I calculate the swept path of an object in python?
This gave helpful introductory commentary on breaking the problem down, and a nearly usable solution.
# Define the circle's geometry.
points = np.array(...)
# Define the circle's motion.
path = np.array(...)
# Calculate the swept path.
swept_path = np.zeros((len(points), len(path)))
for i in range(len(points)):
for j in range(len(path)):
swept_path[i][j] = points[i] + path[j]
But one that still needed an expert review to ensure values ended up in the all the right places.
Below we can see different coloured “circles” drawn at each position in the path.
This is pretty trivial though – it’s just organised vector addition – did I need AI for that?
Moral of Act 3
Keeping it simple increases the chance of success, but you should balance this against whether a simple solution provides sufficient value.
Concluding the saga, for now
I tried to use Bard to deliver large chunks of a complex solution, rather than as a smarter autocomplete for finer details, or as an aid to understanding existing or proposed solutions. In the time I spent prompting Bard, I would have got further writing code directly. However, I have a lot of scope to improve my prompting.
With expertise in the domain and the code, I was able to diagnose and correct the issues in Bard’s solutions, but I suspect that someone who lacked one or both of those areas of expertise couldn’t recover quickly. In some respects, developing software is about recovering quickly from errors – we can’t avoid making mistakes, but we set up feedback loops to detect them quickly, and over time we become vigilant to more of the types of mistakes we are likely to make. Does an AI coding assistant like Bard help us recover quickly from mistakes? I didn’t actually ask Bard to help much in this session, so that question needs further work to resolve, possibly taking the angle of AI-aided test-first development.
What I did learn that Bard was at its most useful when prompted with open questions or very specific concepts with public data support. I also learned that generated code is likely to be sabotaged by domain dislogic and conceptual fault lines between contexts.
Over time, we’ll figure out how to make AI a better protagonist and antagonist in our coding stories; for me, this was an interesting way to introduce a new character.
Don’t Repeat Yourself (DRY) is a tenet of software engineering, but – humour me – let’s consider some reasons Why to Repeat Yourself (WRY).
LEGO reuse lessons
In 2021, I wrote a series of posts analysing LEGO® data about parts appearing in sets to understand what it might tell us about reuse of software components in digital products. I’ve finally summarised the key findings that show both DRY and WRY forces at play. We’re strictly talking about reuse VS specialisation (not repetition), but I think the lessons on the reuse dynamic are relevant.
Exponential growth in volume
The total number of parts and sets ever created has grown exponentially over the years. The result is that in 2021, there were 10 times as many parts as 30 years ago, and about 5 times as many sets. Thus, even though parts can be re-combined to create new models, new parts are constantly introduced at an increasing rate.
While the oldest parts are 70 years old, only about 1/7 of all parts ever created are in active use in 2021 and fully 1/3 of parts have a lifespan of only one year. Over time, survival decays exponentially. In each of the first 5 years, 50% of parts don’t survive to the next year. Beyond that, remaining parts halve in number every seven years.
Some parts are heavily reused in sets offered for sale, but the vast majority of parts are never reused or only reused a little, which can be approximated with a power law. Reuse is far more uneven than a typical 80/20 distribution: 80% of reuse instances are due to only 3% of parts, and 20% of parts account for 98% of reuse instances. At the other end of the spectrum, 60% of parts are used in only one set, and only 10% of parts appear in more than 10 sets.
Given the growth and specialisation profiles, total churn of parts approached 100% in 2020, whereas in the decade centred on 1990, it was only about 20%. High churn is consistent with a small base of heavily reused parts, and ever-increasing numbers of specialised parts with short lifespans.
Reusability of components doesn’t necessarily lead to reuse. The majority of reuse will come from a few components that perform fundamental roles. Focus on getting this right.
More – and more specialised – products may drive specialisation of components. Digital product lines are never static and we may expect some components to have short lifespans and churn heavily. Good development practices and loosely-coupled architectures allow teams to work with ephemeral and idiosyncratic components. However, ongoing review can still identify opportunities to harvest patterns and consolidate specialised components.
Note that, even when we produce multiple similar code artefacts, we may see effective reuse of higher-level approaches and concepts.
These aren’t prescriptive rules, but a reflection of the patterns in the data. There are more comprehensive observations in the individual articles. We should remember that reuse is not the primary aim of producing software, but a principle that supports better organisation towards sustainablyresponsive delivery.
Discussion of data relevance
Why is LEGO data relevant? In many conversations I’ve had about software reuse, LEGO is presented as a desirable model. This may be peculiar to me, but I think it is a fairly common conversation.
The number of possible mechanical couplings of just a handful of bricks is indeed enormous, but I wanted to understand how these components had actually been assembled into products that were sold to customers over some period of time. The data is sourced from the Rebrickable API. I’ve just taken part data at face value in this analysis; if something is recorded as a distinct part, I treat it as a distinct part. There may be better ways to translate the LEGO metaphor to software components.
Maybe there’s a generational factor in LEGO as a metaphor too; in the 1980s and 1990s, you would play with a much smaller and more stable base of active parts than the 2000s and 2010s, and that could shape your thinking. I’d love to hear feedback.
LEGO® is a trademark of the LEGO Group of companies which does not sponsor, authorize or endorse this site.
A little smarter, anyway. I didn’t expect to pickthisup again, but when I occasionally run the first generation solvers online, I’m often equal parts amused and frustrated by rare words thrown up that delay the solution – from amethystine to zigging.
The solvers used the first idea that worked; can we make some tweaks to make them smarter? The code is now migrated to its own new repo after outgrowing its old home.
I measure solver performance by running multiple trials of a solver configuration against the simulator for a variety of target words. This gives a picture of how often the solver typically succeeds within a certain number of guesses.
It turns out that the vocabulary to date based on english_words_set is a poor match for the most frequently used English words, according to unigram frequency data.
So we might expect that simply replacing the solver vocabulary would improve performance, and we also get word ranking from unigram_freq.
To improve the gradient solver I tried making another random guess every so often to avoid long stretches exploring local minima. But it didn’t make things better, and probably made them worse!
In response, I made each guess the most common local word to the extrapolated semantic location, rather than just the nearest word. Still no better, and trying both “improvements” together was significantly worse!
Ah well, experiments only fail if we fail to learn from them!
I think the noise inherent in a different semantic model, plus the existing random extrapolation distance, overwhelms the changes I tried. In better news, we see a major improvement from using unigram freq vocabulary, reducing the mean from 280 (with many searches capped at 500) to 198, approximately a 30% improvement.
Here we see that the data-centric (vocabulary) improvement had a far bigger impact than any model-centric (search algorithm) improvement that I had the patience to try (though I left a bunch of further todos). Maybe just guessing randomly from the top n words will be better again! ????
At least I’ve made a substantial dent in reducing those all-too-common guesses at rare words.
Sure commercial maps app directions are great, but have you ever found the customisation options limited? What if you want to use bike paths and back streets when cycling, or avoid winding roads that might make backseat passengers car-sick on a road trip?
The paved route
OpenStreetMap and OpenRouteService do provide this type of functionality, and could be considered for use as-is or with further customisation. PostGIS and pgRouting provide capabilities if you bring your own data. Many dedicated apps support groups with particular mobility needs.
My way, off the highway
In researching these capabilities however, and because I’m a fan of maps and I wanted to understand the whole data transformation journey, I decided to hand-roll my own navigation solution using pyshp, numpy, scipy, and networkx, all visualised with matplotlib. The end result is far from polished, but it can ingest 1.1M road geometries for the Australian state of Victoria, and generate a topological graph for routing within minutes, then use that map to to generate turn-by-turn directions in real time.
See the source code and the brief write-up below if you’re interested.
The solution uses data from the Vicmap Transport data set, which provides road centrelines for highways, streets, tracks, paths, etc, for all of Victoria and some bordering regions. The spatial features are augmented with 71 attributes useful for routing, including road names, permitted directions of travel, height limits, etc. I used a GDA2020 datum grid projection shapefile export. Pyshp provides a list of geometries and attributes via shapeRecords.
Vicmap Transport road centrelines are collections of polylines. The endpoints of these polylines (aka road segments) helpfully coincide where we might infer an continuous stretch of road or intersection. This allows us to describe how roads are connected with a graph.
Each endpoint will map to a node in the graph. The node may be unique to one road segment, or shared between multiple road segments if it’s at a junction. We find the coincident endpoints for a shared node with the method query pairs. The road segments with coincident endpoints then define the edges of the graph at this node. A directed graph can be defined using the direction code attribute of each segment (forward, reverse, or both directions).
With a graph representation of the road network, we can find a route between any two nodes (if connected) with standard algorithms. The examples above and below uses Dijkstra’s algorithm to find shortest path based on edge weights that reflect our routing preferences. The orange route is “fewest hops” (count of road segments) and the green route is “shortest distance”. Geometric length of road segments is calculated in a post-processing pass over the ingested polyline data, and assigned as a weight to each edge.
Optimisation and scaling
My first spikes were hideously inefficient, but once the method was established, there was a lot of room for improvement. I addressed three major performance bottlenecks as I scaled from processing 50k road segments in 50 minutes, to 1.1M road segments in 30 seconds. These figures represent a 2017 Macbook Pro, or free Colab instance, being roughly similar.
Processing stage and end-to-end times
Coincident segment endpoints
For loop accumulating unique ids with distance test
For loop accumulating correctly directed edges case-wise and discarding topological duplicates
> 12 hrs
numpy vectorisation of loop conditional logic
Stages of optimisation to support scaling
An additional ~30s is required to post-process geometric length of segments as an attribute per edge, and I imagine similar for other derived edge attributes. For instance, we might get finer grained average cycling speed per segment, or traffic risk factors, etc.
For calculating routes, (i.e., at inference time) it takes about 1-4s to find a shortest path, depending on the length of the route (using pure Python networkx). We can now find routes of over 950km length, from Mildura in the state’s north-west to Mallacoota in the east.
More latitude for navigation
We would like to be able to find the start and end nodes of a route from latitude and longitude. However, as the nodes in our routing graph are located by VicGrid grid coordinates (eastings and northings), we first need to “wrap” this planar grid around the roughly spherical Earth. While geometrically complex (see below), it’s easy to do this transformation back and forth between coordinate systems with pyproj
Turn-by-turn navigation directions
With the start and end nodes located from latitude and longitude, a route can be calculated as above. Then, turn-by-turn directions can then be derived by considering the geometry of the road network at each intermediate node on the route, and what instructions might be required by users, for instance:
Determine the compass direction of travel along a road segment to initiate travel in the correct direction,
Calculate the angle between entry and exit directions at intersections to provide turn directions as left, right, straight, etc,
Use the geometric length of road segments to provide distance guidance,
Consolidate to a minimum set of directions by identifying where explicit guidance is not required (e.g., continuing straight on the same road), and
Render the instructions into an easily (?) human-consumable form, with natural language descriptions of appropriate precision.
['travel south west on Holly Court',
'continue for 190m on Holly Court',
'turn right into Marigold Crescent',
'continue for 360m on Marigold Crescent',
'go straight into Gowanbrae Drive',
'continue for 420m on Gowanbrae Drive',
'turn left into Gowanbrae Drive',
'continue for 150m on Gowanbrae Drive',
'turn left into Lanark Way',
'continue for 170m on Lanark Way']
This constitutes a rather neat first approximation to to commercial turn-by-turn directions, but I suspect it suffers in many edge cases, like roundabouts and slip lanes.
With a drive-by look at the key elements, the road ahead to future “my way” hand-rolled navigation is clearer. An essential next step would be an interactive map interface. However, making this prototype roadworthy also likely needs more data wrangling under the hood (e.g., for cycling-specific data), a review of where to leverage existing open services, and polishing edge cases to a mirror finish.