7 wastes of data production – when pipelines become sewers

I recently had the chance to present an updated version of my 7 wastes of data production talk at DataEngBytes Melbourne 2023. I think the talk was stronger this time around and I really appreciated all the great feedback from the audience.

Thanks to Peter Hanssens and the DEB crew for having me as part of an impressive speaker lineup and for putting on a great event.

The video will be coming very soon (I’ll add it here when it’s up) and you can check out the slides meantime.

For the earlier versions, see the original 7 wastes post and the 2021 LAST Conference version Data mesh: a lean perspective.


There’s a lot of ground to cover in 30 minutes with 7 wastes from both run and build lenses, plus 5 lean principles to address the waste. I’ll leave the summary here and encourage you to watch the video or read the slides if you want to know more.

OverproductionUnused productsUnused products
InventoryStored or processed data
not used
Development work in progress
OverprocessingCorrecting poor quality dataWorking with untrusted data
TransportationReplication without reproducibilityHandoffs between teams
MotionManual intervention or finishingContext switching
WaitingDelays in taking action on business eventsDelays due to handoffs or feedback lead time
DefectsDefects introduced into data at any pointDefects introduced into processing code
7 wastes of data production in run and build

Privacy puzzles

I contributed a database reconstruction attack demonstration to the companion repository to the excellent book Practical Data Privacy by my colleague Katharine Jarmul.

My interest was piqued by my colleague Mitchell Lisle sharing the paper Understanding Database Reconstruction Attacks on Public Data from the US Census Bureau authors Simson Garfinkel, John M. Abowd, and Christian Martindale. Mitchell and I collaborated on a pair of solutions using mathematical optimisation/satisfaction techniques. Check out Mitchell’s solution using the Z3 library. I used OR-Tools instead.

The notebook demonstrates that individual rows of a database may be reconstructed, even if only summary statistics are shared, by considering the constraints that the statistics place on possible values of the data. Constraints include mean and median for all numerical values globally and for various cohorts of records determined by class values.

Note that the intent is of this notebook is not to compromise any private data, but to raise awareness of the potential for privacy breaches due to reconstruction attacks!

A gentle introduction to embeddings at the inaugural GenAI Nework Melbourne meetup

I was thrilled to help kick-off the GenAI Network Melbourne meetup at their first meeting recently. I presented a talk titled Semantic hide and seek – a gentle introduction to embeddings, based on my experiments with Semantle, other representation learning, and some discussion of what it means to use Generative AI in developing new products and services. It was a pleasure to present alongside Rajesh Vasa from A2I2 at Deakin University.

Thanks to Ned, Orian, Scott, Alex, Leonard & co for organising. Looking forward to more fun events in this series!

Check out the slides.


Background on embeddings

Animated chart mosaic titled This wheelie does not exist. Shows a single dimension (duration) mapping to a 3D latent space which in turn generates a realistic looking 100 sample wheelie trace

The game Semantle and my solvers

  • About the game, and playing with friends
  • Live online solver demo!
  • Solver project aims: experiment with embeddings, automate solutions, explore how people and machines work together on problems
  • Modular solver design and search strategies, illustrated below
Diagram showing modular solver design including semantic model with limited vocabulary, informing search state that determines the next guess to make based on the game state, which can also be influenced by other players' guesses. The cohort and gradient search strategies are shown below.

Reflections on people and machines working together

Diagram Comparing the SECI model with naive automation (means all tacit stages are lost) and augmentation with machines that can help socialise to reinforce the cycle

Maths Whimsy with Python

At PyCon AU 2023 in Adelaide I delivered a talk titled Maths Whimsy with Python. It was a great chance to review a range of projects small and large I’ve already shared here. Check out the slides and video.

In three years of the maths whimsy repo, I’ve covered a lot of ground, and got hands-on with many Python libraries for data science. I was particularly pleased when the idea to visualise multi-stage classification pipelines as Sankey diagrams was enthusiastically picked up by one client. I’m not sure I’ll find an application for which numpig chunking of large arrays is the best solution, but I remain quite fond of it. And I could spend all day watching the skeumorphic odometer visualisation created for the slackometer.

The talk dived into the square and weird planet gravity whimsy where it all started, the Lockdown Wheelie Project, the Sankey diagrams, and touched on Semantle and Asteroid Escape solvers, then reviewed the typical project pattern, and explored how all the mini projects collectively reinforced one another.

On reflection the lessons for me were:

  • Be driven by interest
  • Be opportunistic with energy
  • Take small steps in problem & solution complexity 
  • Small steps help deliver outputs
  • Learning will cascade
  • And real world applications will become apparent

Afterwards, someone asked me about the times it hadn’t worked. I should have covered this too! The primary time I wasn’t satisfied that I had something to share was when I was experimenting with reinforcement learning to solve mazes with disconnected walls. I bit off more than I could chew, and while I went end-to-end with Deep Q-Learning, and then backfilled some bits of the training pipeline, I didn’t reach a satisfactory conclusion within the time I had available. I took this lesson forward to re-frame the problem into smaller, achievable chunks whenever it looked like I may have dived too deep on subsequent whimsies, and it has served me well since.

Thanks to Ned for the picture!

Perspectives edition #27

I was thrilled to contribute to Thoughtworks Perspectives edition #27: Power squared: How human capabilities will supercharge AI’s business impact. There are a lot of great quotes from my colleagues Barton Friedland and Ossi Syd in the article, and here’s one from me:

The ability to build or consume solutions isn’t necessarily going to be your differentiator – but the ability to integrate them into your processes and products in the best way is.

It’s also neat to be quoted in Português and Español!

Electrifying the world with AI Augmented decision-making

I wrote an article about optimising the design of EV charging networks. It’s a story of work done by a team at Thoughtworks, demonstrating the potential of AI augmented decision-making (including some cool optimisation techniques), in this rapidly evolving but durably important space.

We were able to thread together these many [business problem, AI techniques, data sourcing, technology] concerns in such a short time because of our relentless focus — not on the code or the mathematics — but on generating value: namely making it easier for humans to make decisions according to business goals.

A coding saga with Bard

Though but a footnote in the epic of coding with AI, I though it worth musing on my recent experience with Bard. Bard currently uses the LaMDA model, which is capable of generating code, but not optimised for it. The story might be different with Codey as protagonist (or is that antagonist?)

I didn’t produce any software I could use in this session, but I did learn that Bard was at its most useful when prompted with open questions or very specific concepts. I also learned that generated code may be sabotaged by plausible but incorrect domain logic and conceptual fault lines that occur between contexts.

Also, given that safetydave.net contributes 0.0000005% of tokens to the C4 dataset which makes up some of Bard’s training data, I thought I should see some return!

Act 1

In which we meet a domain.

Take 1

I started naively with a reasonably complex ask: how do I calculate the swept path of a vehicle in python?

A drawing of an overhead view of a car following a curved path

I’ve done swept path analysis before, but in manufacturing, and I certainly don’t have a Python solution handy. Even if I only got the broad outlines of a solution, an answer or parts of an answer to this question would save a bunch of time.

This was far too wide an ask. I got code that looks like it’s for a curve that a vehicle might follow, but that’s just one part of the solution, and the more I looked at it, the more issues I found.

# The vehicle's position
x = np.array([0, 0])
y = np.array([0, 0])

# The vehicle's orientation
theta = 0

# The vehicle's velocity
v = 1

# The vehicle's wheelbase
w = 2

# Calculate the vehicle's position at each time step
for t in np.arange(0, 10, 0.1):
    x += v * np.cos(theta) * t
    y += v * np.sin(theta) * t
    theta += v / w * t

First, vehicle is a word that identifies a whole domain of assumptions, including how steering works. But this domain logic is weird; assuming theta represents direction of travel, that’s not how steering works, in this universe at least. Export to Colab feature is nice, so let’s see how the code runs. Spoiler: it doesn’t.

As a user of Python, I could diagnose the runtime error and fix it (the initial values of x and y could be defined with dtype=float), but on closer inspection, the += operator overwrites rather than appends, and, wait a minute, x and y are treated both as vectors and scalars. As the domain logic is still bizarre, is there any purpose in fixing these conceptual faults in the code?

Take 2

A secondary prompt to provide more guidance: I would like to consider the area swept out by the corners of the vehicle, can you update the solution?

This only resulted in more trigonometric prestidigitation; more domain dislogic. I use the term dislogic a bit like the term truthiness. There is some related logic to the code, but it is “apart” from the logic we need.

Take 3

I tried much more explicit guidance by elaborating on domain concepts in the third prompt. This lead to more detailed but still fundamentally flawed and unusable solution. I decided this path wouldn’t lead to a happy ending.

Morals of Act 1

Don’t assume too much domain expertise. Bard has learned on generic public examples. Use your domain expertise to break the problem into smaller chunks.

Also, don’t expect conceptual consistency throughout. LLMs like Bard, as next-token predictors, don’t necessarily ensure conceptual consistency in their output.

Act 2

In which I choose a path to follow.

Take 1

I decided to focus on one part of the solution; getting the curve right. I reset Bard’s context.

I want a python function to create a curve between two points. The function arguments should be the start and end points and the tangent to the curve at the start and end points

Nice linear interpolation, shame about the tangents (which, while present as arguments, were totally ignored in the function body).

A straight line chart

And the above could only be generated after fixing more errors preventing the code from running. The affordances of tuples and numpy.ndarray were confused, and the coordinates weren’t passed correctly to the plot method. The syntax was fine, but the code was riven with conceptual fault lines between contexts – what looked OK in one or other context in isolation caused problems when the contexts were brought together. The bugs were fairly obvious in this case, but in general could be subtle and difficult to detect.

Still, after minor adjustments, it’s a curve that meets some of the requirements. This is more useful than what we got in Act 1.

Take 2

I augmented the initial prompt.

The curve tangent should match the tangents of the start and end point supplied as arguments. Please define points and tangents as numpy arrays

Now we have two nice linear interpolations, but the tangent is not the tangent of the curve! (It’s more like we have linearly interpolated the position and orientation of the vehicle.)

A series of line segments. The start point of each line in the series follows a straight line. The end point direction from the start point rotates through 90 degrees from the first to the last line in the series

This code, presumably due to specifying numpy, worked straight out of the box.

Take 3

Maybe I can establish the link between the tangent and the curve to which it is tangent?

In the create curve function, the tangent should be calculated as unit vector in the direction from one point to the next on the curve. This calculated tangent should equal the tangent argument at both the start and end of the curve.

Nope, that was too naively optimistic. I was forgetting the Morals of Act 1.

Take 4

Let’s recall the Morals of Act 1, and try to link to concepts with better public support, such as Bézier curves.

The create curve function should be updated to return a bezier curve rather than a linear interpolation between start and end point

The result looked Bézier-ish, but the formula wasn’t transcribed faithfully, with the calculated points quickly racing away to infinity (well outside the convex hull of the control points).

# Calculate the point at the current index.
point = (1 - i / num_points)**3 * start_point + 3 * (1 - i / num_points)**2 * i / num_points * start_tangent + 3 * (1 - i / num_points) * (i / num_points)**2 * end_tangent + i**3 * end_point

However, I think we’re getting closer.

Take 5

Time to reset Bard’s context again and go all in on Bézier.

I would like a python function to create a bezier curve


Please update the solution to plot the curve with matplotlib

And with a few manual tweaks that are becoming familiar…

Image of a diff between two code versions
Diff of Bard generated code (left) and my edits (right) to get the desired result

… we finally get a curve with the desired properties from Take 1!

A bezier curve with 4 control points

However, this is curve is not necessarily going to represent the path of a vehicle, which has a minimum radius of curvature (the vehicle’s turning circle) that a Bézier curve can’t guarantee. Based on the experience above, it’s unclear to me how I would incorporate the curvature requirement with further prompts.

Taking stock: after much prompting, I have a method to create a class of curve I can’t directly use in the solution. (And the code for which I ended up looking up anyway.)

Moral of Act 2

Don’t forget the Morals of Act 1.

In addition, being able to directly reference the key concepts in a generic domain, rather than trying to explain the logic, may be key to getting closer to a correct solution in a few shots.

If you’re not familiar enough with the key concepts to reference them directly, you might use a more open prompt initially, such as: How do you create a smooth curve? This might point you in the right direction…

There are many ways to create a smooth curve. Here are a few examples:

* Use a Bezier curve. A Bezier curve is a parametric curve that is defined by a set of control points. The curve is smooth because it is a polynomial function of the control points.

* ... [goes on to reference Splines, Fourier series and B-Splines] ...

Act 3

Because we need 3 acts. In which I back up and broaden the domain.

Take 1

Considering whether the word vehicle in the initial prompt had given Bard the wrong steer, I tried the more generic prompt: how do I calculate the swept path of an object in python?

This gave helpful introductory commentary on breaking the problem down, and a nearly usable solution.

# Define the circle's geometry.
points = np.array(...)

# Define the circle's motion.
path = np.array(...)

# Calculate the swept path.
swept_path = np.zeros((len(points), len(path)))
for i in range(len(points)):
    for j in range(len(path)):
        swept_path[i][j] = points[i] + path[j]

But one that still needed an expert review to ensure values ended up in the all the right places.

An image showing the diff between two versions of code
Diff of Bard generated code (left) and my edits (right) to get the desired result

Below we can see different coloured “circles” drawn at each position in the path.

5 overlapping squares (representing circles with very low resolution)

This is pretty trivial though – it’s just organised vector addition – did I need AI for that?

Moral of Act 3

Keeping it simple increases the chance of success, but you should balance this against whether a simple solution provides sufficient value.

Concluding the saga, for now

I tried to use Bard to deliver large chunks of a complex solution, rather than as a smarter autocomplete for finer details, or as an aid to understanding existing or proposed solutions. In the time I spent prompting Bard, I would have got further writing code directly. However, I have a lot of scope to improve my prompting.

With expertise in the domain and the code, I was able to diagnose and correct the issues in Bard’s solutions, but I suspect that someone who lacked one or both of those areas of expertise couldn’t recover quickly. In some respects, developing software is about recovering quickly from errors – we can’t avoid making mistakes, but we set up feedback loops to detect them quickly, and over time we become vigilant to more of the types of mistakes we are likely to make. Does an AI coding assistant like Bard help us recover quickly from mistakes? I didn’t actually ask Bard to help much in this session, so that question needs further work to resolve, possibly taking the angle of AI-aided test-first development.

What I did learn that Bard was at its most useful when prompted with open questions or very specific concepts with public data support. I also learned that generated code is likely to be sabotaged by domain dislogic and conceptual fault lines between contexts.

Over time, we’ll figure out how to make AI a better protagonist and antagonist in our coding stories; for me, this was an interesting way to introduce a new character.

Humour me – DRY vs WRY

Don’t Repeat Yourself (DRY) is a tenet of software engineering, but – humour me – let’s consider some reasons Why to Repeat Yourself (WRY).

LEGO reuse lessons

In 2021, I wrote a series of posts analysing LEGO® data about parts appearing in sets to understand what it might tell us about reuse of software components in digital products. I’ve finally summarised the key findings that show both DRY and WRY forces at play. We’re strictly talking about reuse VS specialisation (not repetition), but I think the lessons on the reuse dynamic are relevant.

Exponential growth in volume

The total number of parts and sets ever created has grown exponentially over the years. The result is that in 2021, there were 10 times as many parts as 30 years ago, and about 5 times as many sets. Thus, even though parts can be re-combined to create new models, new parts are constantly introduced at an increasing rate.

Bar chart of new lego parts each year, with a line showing total parts. The vertical scale is logarithmic, and both new and total parts follow a straight line on the chart, indicating exponential growth

Read more in LEGO as a Metaphor for Software Reuse – Does the Data Stack Up?

Exponential decay in lifespan

While the oldest parts are 70 years old, only about 1/7 of all parts ever created are in active use in 2021 and fully 1/3 of parts have a lifespan of only one year. Over time, survival decays exponentially. In each of the first 5 years, 50% of parts don’t survive to the next year. Beyond that, remaining parts halve in number every seven years.

Chart of lego part lifespans. The vertical axis is logarithmic. The scattered points can be approximated with two linear segments, one for the first five years, showing a half-life of 0;9 years, an another for the remaining ~70 years, showing a half-life of 7.2 years

Read more in LEGO and Software – Lifespans.

Power-law distribution of reuse

Some parts are heavily reused in sets offered for sale, but the vast majority of parts are never reused or only reused a little, which can be approximated with a power law. Reuse is far more uneven than a typical 80/20 distribution: 80% of reuse instances are due to only 3% of parts, and 20% of parts account for 98% of reuse instances. At the other end of the spectrum, 60% of parts are used in only one set, and only 10% of parts appear in more than 10 sets.

Log-log scatter plot of count of part inclusion in sets. Two linear segments fitted to plot show power law approximation for reuse

Read more in LEGO and Software – Part Reuse.

Churn driven by growth and specialisation

Given the growth and specialisation profiles, total churn of parts approached 100% in 2020, whereas in the decade centred on 1990, it was only about 20%. High churn is consistent with a small base of heavily reused parts, and ever-increasing numbers of specialised parts with short lifespans.

Read more in LEGO and Software – Variety and Specialisation and LEGO and Software – Lifespans.

Part roles emerge from the reuse graph

We can understand more about the roles played by specialised and reused parts though analysis of the graph of connections between parts and sets, and identify new opportunities for recombination.

Network visualisation showing association rules between common parts

Read more in LEGO and Software – Part Roles.

Lessons for software

What would I take away for software?

Reusability of components doesn’t necessarily lead to reuse. The majority of reuse will come from a few components that perform fundamental roles. Focus on getting this right.

More – and more specialised – products may drive specialisation of components. Digital product lines are never static and we may expect some components to have short lifespans and churn heavily. Good development practices and loosely-coupled architectures allow teams to work with ephemeral and idiosyncratic components. However, ongoing review can still identify opportunities to harvest patterns and consolidate specialised components.

Note that, even when we produce multiple similar code artefacts, we may see effective reuse of higher-level approaches and concepts.

These aren’t prescriptive rules, but a reflection of the patterns in the data. There are more comprehensive observations in the individual articles. We should remember that reuse is not the primary aim of producing software, but a principle that supports better organisation towards sustainably responsive delivery.

Discussion of data relevance

Why is LEGO data relevant? In many conversations I’ve had about software reuse, LEGO is presented as a desirable model. This may be peculiar to me, but I think it is a fairly common conversation.

The number of possible mechanical couplings of just a handful of bricks is indeed enormous, but I wanted to understand how these components had actually been assembled into products that were sold to customers over some period of time. The data is sourced from the Rebrickable API. I’ve just taken part data at face value in this analysis; if something is recorded as a distinct part, I treat it as a distinct part. There may be better ways to translate the LEGO metaphor to software components.

Maybe there’s a generational factor in LEGO as a metaphor too; in the 1980s and 1990s, you would play with a much smaller and more stable base of active parts than the 2000s and 2010s, and that could shape your thinking. I’d love to hear feedback.

LEGO® is a trademark of the LEGO Group of companies which does not sponsor, authorize or endorse this site.

Smarter Semantle Solvers

A little smarter, anyway. I didn’t expect to pick this up again, but when I occasionally run the first generation solvers online, I’m often equal parts amused and frustrated by rare words thrown up that delay the solution – from amethystine to zigging.

Animation of online semantle solution
An example solution with fewer than typical rare words guessed

The solvers used the first idea that worked; can we make some tweaks to make them smarter? The code is now migrated to its own new repo after outgrowing its old home.

Measuring smarts

I measure solver performance by running multiple trials of a solver configuration against the simulator for a variety of target words. This gives a picture of how often the solver typically succeeds within a certain number of guesses.

Chart showing cumulative distribution function curves for two solver configurations


It turns out that the vocabulary to date based on english_words_set is a poor match for the most frequently used English words, according to unigram frequency data.

So we might expect that simply replacing the solver vocabulary would improve performance, and we also get word ranking from unigram_freq.

Semantic models

We’ll continue with Universal Sentence Encoder (USE) to ensure search strategies are robust to different semantic models.


To improve the gradient solver I tried making another random guess every so often to avoid long stretches exploring local minima. But it didn’t make things better, and probably made them worse!

In response, I made each guess the most common local word to the extrapolated semantic location, rather than just the nearest word. Still no better, and trying both “improvements” together was significantly worse!

Ah well, experiments only fail if we fail to learn from them!

Vocabulary again

I think the noise inherent in a different semantic model, plus the existing random extrapolation distance, overwhelms the changes I tried. In better news, we see a major improvement from using unigram freq vocabulary, reducing the mean from 280 (with many searches capped at 500) to 198, approximately a 30% improvement.

Smarter still?

Here we see that the data-centric (vocabulary) improvement had a far bigger impact than any model-centric (search algorithm) improvement that I had the patience to try (though I left a bunch of further todos). Maybe just guessing randomly from the top n words will be better again! ????

At least I’ve made a substantial dent in reducing those all-too-common guesses at rare words.

I did it my way – hand-rolled navigation with open spatial data

Sure commercial maps app directions are great, but have you ever found the customisation options limited? What if you want to use bike paths and back streets when cycling, or avoid winding roads that might make backseat passengers car-sick on a road trip?

Example of cycling route options that are direct (B) and pleasant (A)

The paved route

OpenStreetMap and OpenRouteService do provide this type of functionality, and could be considered for use as-is or with further customisation. PostGIS and pgRouting provide capabilities if you bring your own data. Many dedicated apps support groups with particular mobility needs.

My way, off the highway

In researching these capabilities however, and because I’m a fan of maps and I wanted to understand the whole data transformation journey, I decided to hand-roll my own navigation solution using pyshp, numpy, scipy, and networkx, all visualised with matplotlib. The end result is far from polished, but it can ingest 1.1M road geometries for the Australian state of Victoria, and generate a topological graph for routing within minutes, then use that map to to generate turn-by-turn directions in real time.

See the source code and the brief write-up below if you’re interested.


The solution uses data from the Vicmap Transport data set, which provides road centrelines for highways, streets, tracks, paths, etc, for all of Victoria and some bordering regions. The spatial features are augmented with 71 attributes useful for routing, including road names, permitted directions of travel, height limits, etc. I used a GDA2020 datum grid projection shapefile export. Pyshp provides a list of geometries and attributes via shapeRecords.


Vicmap Transport road centrelines are collections of polylines. The endpoints of these polylines (aka road segments) helpfully coincide where we might infer an continuous stretch of road or intersection. This allows us to describe how roads are connected with a graph.

Map of Bairnsdale township showing spatial location of roads overlaid with their network connectivity
Road network modeled as a graph in Bairnsdale, Victoria (data: Vicmap Transport)

Each endpoint will map to a node in the graph. The node may be unique to one road segment, or shared between multiple road segments if it’s at a junction. We find the coincident endpoints for a shared node with the method query pairs. The road segments with coincident endpoints then define the edges of the graph at this node. A directed graph can be defined using the direction code attribute of each segment (forward, reverse, or both directions).


Two routes through Bairnsdale township overlaid on the network connectivity map

With a graph representation of the road network, we can find a route between any two nodes (if connected) with standard algorithms. The examples above and below uses Dijkstra’s algorithm to find shortest path based on edge weights that reflect our routing preferences. The orange route is “fewest hops” (count of road segments) and the green route is “shortest distance”. Geometric length of road segments is calculated in a post-processing pass over the ingested polyline data, and assigned as a weight to each edge.

Two routes through the wider Bairnsdale area overlaid on the network connectivity map
Routing options in the area surrounding Bairnsdale

Optimisation and scaling

My first spikes were hideously inefficient, but once the method was established, there was a lot of room for improvement. I addressed three major performance bottlenecks as I scaled from processing 50k road segments in 50 minutes, to 1.1M road segments in 30 seconds. These figures represent a 2017 Macbook Pro, or free Colab instance, being roughly similar.

Processing stage and end-to-end times50k segments1.1M segments
Coincident segment endpoints
For loop accumulating unique ids with distance test50 mins
numpy array calculation argwhere distance < e and numpig6 mins
scipy.spatial.KDTree.query_pairs2 mins
Shared node mapping
List comprehension elementwise mapBroke Colab
numpy materialised mapping< 2 mins
Directed edges (previously undirected)
For loop accumulating correctly directed edges case-wise and discarding topological duplicates> 12 hrs
numpy vectorisation of loop conditional logic30 sec
Stages of optimisation to support scaling

An additional ~30s is required to post-process geometric length of segments as an attribute per edge, and I imagine similar for other derived edge attributes. For instance, we might get finer grained average cycling speed per segment, or traffic risk factors, etc.

For calculating routes, (i.e., at inference time) it takes about 1-4s to find a shortest path, depending on the length of the route (using pure Python networkx). We can now find routes of over 950km length, from Mildura in the state’s north-west to Mallacoota in the east.

Two long routes overlaid on sampled road data from all of Victoria and surrounds
Two routes from Mildura to Mallacoota

More latitude for navigation

We would like to be able to find the start and end nodes of a route from latitude and longitude. However, as the nodes in our routing graph are located by VicGrid grid coordinates (eastings and northings), we first need to “wrap” this planar grid around the roughly spherical Earth. While geometrically complex (see below), it’s easy to do this transformation back and forth between coordinate systems with pyproj

A grid with axis-aligned arrow pairs at each point showing how space is squashed or stretched in 2D when transforming from eastings, northing to latitude, longitude

Turn-by-turn navigation directions

With the start and end nodes located from latitude and longitude, a route can be calculated as above. Then, turn-by-turn directions can then be derived by considering the geometry of the road network at each intermediate node on the route, and what instructions might be required by users, for instance:

  • Determine the compass direction of travel along a road segment to initiate travel in the correct direction,
  • Calculate the angle between entry and exit directions at intersections to provide turn directions as left, right, straight, etc,
  • Use the geometric length of road segments to provide distance guidance,
  • Consolidate to a minimum set of directions by identifying where explicit guidance is not required (e.g., continuing straight on the same road), and
  • Render the instructions into an easily (?) human-consumable form, with natural language descriptions of appropriate precision.
['travel south west on Holly Court',
 'continue for 190m on Holly Court',
 'turn right into Marigold Crescent',
 'continue for 360m on Marigold Crescent',
 'go straight into Gowanbrae Drive',
 'continue for 420m on Gowanbrae Drive',
 'turn left into Gowanbrae Drive',
 'continue for 150m on Gowanbrae Drive',
 'turn left into Lanark Way',
 'continue for 170m on Lanark Way']

This constitutes a rather neat first approximation to to commercial turn-by-turn directions, but I suspect it suffers in many edge cases, like roundabouts and slip lanes.

Next steps

With a drive-by look at the key elements, the road ahead to future “my way” hand-rolled navigation is clearer. An essential next step would be an interactive map interface. However, making this prototype roadworthy also likely needs more data wrangling under the hood (e.g., for cycling-specific data), a review of where to leverage existing open services, and polishing edge cases to a mirror finish.