Getting serious about testing AI

I was delighted to contribute to a Thoughtworks’ insights article on AI testing in response to Forrester’s recent report It’s Time To Get Really Serious About Testing Your AI. The report rightly raises the importance of testing in AI systems and highlighted Thoughtworks’ Continuous Delivery for Machine Learning (CD4ML) approach. The response also discusses other important elements of testing, for instance controlling sources of variation.

A system designed to be testable enables you to identify all the sources of input variation — such as random seeds, prompt construction, LLM temperature and sampling parameters — and the more of these you can hold constant in a test environment, the more you should expect deterministic behavior.

Read the full article Time to get serious about AI testing

A gentle introduction to embeddings at the inaugural GenAI Nework Melbourne meetup

I was thrilled to help kick-off the GenAI Network Melbourne meetup at their first meeting recently. I presented a talk titled Semantic hide and seek – a gentle introduction to embeddings, based on my experiments with Semantle, other representation learning, and some discussion of what it means to use Generative AI in developing new products and services. It was a pleasure to present alongside Rajesh Vasa from A2I2 at Deakin University.

Thanks to Ned, Orian, Scott, Alex, Leonard & co for organising. Looking forward to more fun events in this series!

Check out the slides.


Background on embeddings

Animated chart mosaic titled This wheelie does not exist. Shows a single dimension (duration) mapping to a 3D latent space which in turn generates a realistic looking 100 sample wheelie trace

The game Semantle and my solvers

  • About the game, and playing with friends
  • Live online solver demo!
  • Solver project aims: experiment with embeddings, automate solutions, explore how people and machines work together on problems
  • Modular solver design and search strategies, illustrated below
Diagram showing modular solver design including semantic model with limited vocabulary, informing search state that determines the next guess to make based on the game state, which can also be influenced by other players' guesses. The cohort and gradient search strategies are shown below.

Reflections on people and machines working together

Diagram Comparing the SECI model with naive automation (means all tacit stages are lost) and augmentation with machines that can help socialise to reinforce the cycle

Perspectives edition #27

I was thrilled to contribute to Thoughtworks Perspectives edition #27: Power squared: How human capabilities will supercharge AI’s business impact. There are a lot of great quotes from my colleagues Barton Friedland and Ossi Syd in the article, and here’s one from me:

The ability to build or consume solutions isn’t necessarily going to be your differentiator – but the ability to integrate them into your processes and products in the best way is.

It’s also neat to be quoted in Português and Español!

A coding saga with Bard

Though but a footnote in the epic of coding with AI, I though it worth musing on my recent experience with Bard. Bard currently uses the LaMDA model, which is capable of generating code, but not optimised for it. The story might be different with Codey as protagonist (or is that antagonist?)

I didn’t produce any software I could use in this session, but I did learn that Bard was at its most useful when prompted with open questions or very specific concepts. I also learned that generated code may be sabotaged by plausible but incorrect domain logic and conceptual fault lines that occur between contexts.

Also, given that contributes 0.0000005% of tokens to the C4 dataset which makes up some of Bard’s training data, I thought I should see some return!

Act 1

In which we meet a domain.

Take 1

I started naively with a reasonably complex ask: how do I calculate the swept path of a vehicle in python?

A drawing of an overhead view of a car following a curved path

I’ve done swept path analysis before, but in manufacturing, and I certainly don’t have a Python solution handy. Even if I only got the broad outlines of a solution, an answer or parts of an answer to this question would save a bunch of time.

This was far too wide an ask. I got code that looks like it’s for a curve that a vehicle might follow, but that’s just one part of the solution, and the more I looked at it, the more issues I found.

# The vehicle's position
x = np.array([0, 0])
y = np.array([0, 0])

# The vehicle's orientation
theta = 0

# The vehicle's velocity
v = 1

# The vehicle's wheelbase
w = 2

# Calculate the vehicle's position at each time step
for t in np.arange(0, 10, 0.1):
    x += v * np.cos(theta) * t
    y += v * np.sin(theta) * t
    theta += v / w * t

First, vehicle is a word that identifies a whole domain of assumptions, including how steering works. But this domain logic is weird; assuming theta represents direction of travel, that’s not how steering works, in this universe at least. Export to Colab feature is nice, so let’s see how the code runs. Spoiler: it doesn’t.

As a user of Python, I could diagnose the runtime error and fix it (the initial values of x and y could be defined with dtype=float), but on closer inspection, the += operator overwrites rather than appends, and, wait a minute, x and y are treated both as vectors and scalars. As the domain logic is still bizarre, is there any purpose in fixing these conceptual faults in the code?

Take 2

A secondary prompt to provide more guidance: I would like to consider the area swept out by the corners of the vehicle, can you update the solution?

This only resulted in more trigonometric prestidigitation; more domain dislogic. I use the term dislogic a bit like the term truthiness. There is some related logic to the code, but it is “apart” from the logic we need.

Take 3

I tried much more explicit guidance by elaborating on domain concepts in the third prompt. This lead to more detailed but still fundamentally flawed and unusable solution. I decided this path wouldn’t lead to a happy ending.

Morals of Act 1

Don’t assume too much domain expertise. Bard has learned on generic public examples. Use your domain expertise to break the problem into smaller chunks.

Also, don’t expect conceptual consistency throughout. LLMs like Bard, as next-token predictors, don’t necessarily ensure conceptual consistency in their output.

Act 2

In which I choose a path to follow.

Take 1

I decided to focus on one part of the solution; getting the curve right. I reset Bard’s context.

I want a python function to create a curve between two points. The function arguments should be the start and end points and the tangent to the curve at the start and end points

Nice linear interpolation, shame about the tangents (which, while present as arguments, were totally ignored in the function body).

A straight line chart

And the above could only be generated after fixing more errors preventing the code from running. The affordances of tuples and numpy.ndarray were confused, and the coordinates weren’t passed correctly to the plot method. The syntax was fine, but the code was riven with conceptual fault lines between contexts – what looked OK in one or other context in isolation caused problems when the contexts were brought together. The bugs were fairly obvious in this case, but in general could be subtle and difficult to detect.

Still, after minor adjustments, it’s a curve that meets some of the requirements. This is more useful than what we got in Act 1.

Take 2

I augmented the initial prompt.

The curve tangent should match the tangents of the start and end point supplied as arguments. Please define points and tangents as numpy arrays

Now we have two nice linear interpolations, but the tangent is not the tangent of the curve! (It’s more like we have linearly interpolated the position and orientation of the vehicle.)

A series of line segments. The start point of each line in the series follows a straight line. The end point direction from the start point rotates through 90 degrees from the first to the last line in the series

This code, presumably due to specifying numpy, worked straight out of the box.

Take 3

Maybe I can establish the link between the tangent and the curve to which it is tangent?

In the create curve function, the tangent should be calculated as unit vector in the direction from one point to the next on the curve. This calculated tangent should equal the tangent argument at both the start and end of the curve.

Nope, that was too naively optimistic. I was forgetting the Morals of Act 1.

Take 4

Let’s recall the Morals of Act 1, and try to link to concepts with better public support, such as Bézier curves.

The create curve function should be updated to return a bezier curve rather than a linear interpolation between start and end point

The result looked Bézier-ish, but the formula wasn’t transcribed faithfully, with the calculated points quickly racing away to infinity (well outside the convex hull of the control points).

# Calculate the point at the current index.
point = (1 - i / num_points)**3 * start_point + 3 * (1 - i / num_points)**2 * i / num_points * start_tangent + 3 * (1 - i / num_points) * (i / num_points)**2 * end_tangent + i**3 * end_point

However, I think we’re getting closer.

Take 5

Time to reset Bard’s context again and go all in on Bézier.

I would like a python function to create a bezier curve


Please update the solution to plot the curve with matplotlib

And with a few manual tweaks that are becoming familiar…

Image of a diff between two code versions
Diff of Bard generated code (left) and my edits (right) to get the desired result

… we finally get a curve with the desired properties from Take 1!

A bezier curve with 4 control points

However, this is curve is not necessarily going to represent the path of a vehicle, which has a minimum radius of curvature (the vehicle’s turning circle) that a Bézier curve can’t guarantee. Based on the experience above, it’s unclear to me how I would incorporate the curvature requirement with further prompts.

Taking stock: after much prompting, I have a method to create a class of curve I can’t directly use in the solution. (And the code for which I ended up looking up anyway.)

Moral of Act 2

Don’t forget the Morals of Act 1.

In addition, being able to directly reference the key concepts in a generic domain, rather than trying to explain the logic, may be key to getting closer to a correct solution in a few shots.

If you’re not familiar enough with the key concepts to reference them directly, you might use a more open prompt initially, such as: How do you create a smooth curve? This might point you in the right direction…

There are many ways to create a smooth curve. Here are a few examples:

* Use a Bezier curve. A Bezier curve is a parametric curve that is defined by a set of control points. The curve is smooth because it is a polynomial function of the control points.

* ... [goes on to reference Splines, Fourier series and B-Splines] ...

Act 3

Because we need 3 acts. In which I back up and broaden the domain.

Take 1

Considering whether the word vehicle in the initial prompt had given Bard the wrong steer, I tried the more generic prompt: how do I calculate the swept path of an object in python?

This gave helpful introductory commentary on breaking the problem down, and a nearly usable solution.

# Define the circle's geometry.
points = np.array(...)

# Define the circle's motion.
path = np.array(...)

# Calculate the swept path.
swept_path = np.zeros((len(points), len(path)))
for i in range(len(points)):
    for j in range(len(path)):
        swept_path[i][j] = points[i] + path[j]

But one that still needed an expert review to ensure values ended up in the all the right places.

An image showing the diff between two versions of code
Diff of Bard generated code (left) and my edits (right) to get the desired result

Below we can see different coloured “circles” drawn at each position in the path.

5 overlapping squares (representing circles with very low resolution)

This is pretty trivial though – it’s just organised vector addition – did I need AI for that?

Moral of Act 3

Keeping it simple increases the chance of success, but you should balance this against whether a simple solution provides sufficient value.

Concluding the saga, for now

I tried to use Bard to deliver large chunks of a complex solution, rather than as a smarter autocomplete for finer details, or as an aid to understanding existing or proposed solutions. In the time I spent prompting Bard, I would have got further writing code directly. However, I have a lot of scope to improve my prompting.

With expertise in the domain and the code, I was able to diagnose and correct the issues in Bard’s solutions, but I suspect that someone who lacked one or both of those areas of expertise couldn’t recover quickly. In some respects, developing software is about recovering quickly from errors – we can’t avoid making mistakes, but we set up feedback loops to detect them quickly, and over time we become vigilant to more of the types of mistakes we are likely to make. Does an AI coding assistant like Bard help us recover quickly from mistakes? I didn’t actually ask Bard to help much in this session, so that question needs further work to resolve, possibly taking the angle of AI-aided test-first development.

What I did learn that Bard was at its most useful when prompted with open questions or very specific concepts with public data support. I also learned that generated code is likely to be sabotaged by domain dislogic and conceptual fault lines between contexts.

Over time, we’ll figure out how to make AI a better protagonist and antagonist in our coding stories; for me, this was an interesting way to introduce a new character.

Smarter Semantle Solvers

A little smarter, anyway. I didn’t expect to pick this up again, but when I occasionally run the first generation solvers online, I’m often equal parts amused and frustrated by rare words thrown up that delay the solution – from amethystine to zigging.

Animation of online semantle solution
An example solution with fewer than typical rare words guessed

The solvers used the first idea that worked; can we make some tweaks to make them smarter? The code is now migrated to its own new repo after outgrowing its old home.

Measuring smarts

I measure solver performance by running multiple trials of a solver configuration against the simulator for a variety of target words. This gives a picture of how often the solver typically succeeds within a certain number of guesses.

Chart showing cumulative distribution function curves for two solver configurations


It turns out that the vocabulary to date based on english_words_set is a poor match for the most frequently used English words, according to unigram frequency data.

So we might expect that simply replacing the solver vocabulary would improve performance, and we also get word ranking from unigram_freq.

Semantic models

We’ll continue with Universal Sentence Encoder (USE) to ensure search strategies are robust to different semantic models.


To improve the gradient solver I tried making another random guess every so often to avoid long stretches exploring local minima. But it didn’t make things better, and probably made them worse!

In response, I made each guess the most common local word to the extrapolated semantic location, rather than just the nearest word. Still no better, and trying both “improvements” together was significantly worse!

Ah well, experiments only fail if we fail to learn from them!

Vocabulary again

I think the noise inherent in a different semantic model, plus the existing random extrapolation distance, overwhelms the changes I tried. In better news, we see a major improvement from using unigram freq vocabulary, reducing the mean from 280 (with many searches capped at 500) to 198, approximately a 30% improvement.

Smarter still?

Here we see that the data-centric (vocabulary) improvement had a far bigger impact than any model-centric (search algorithm) improvement that I had the patience to try (though I left a bunch of further todos). Maybe just guessing randomly from the top n words will be better again! ????

At least I’ve made a substantial dent in reducing those all-too-common guesses at rare words.

22 rules of generative AI

Thinking about adopting, incorporating or building generative AI products? Here are some things to think about, depending on your role or roles.

I assume you’re bringing your own product idea(s) based on an understanding of an opportunity or problems for customers. These rules therefore focus on the solution space.

Solutions with generative AI typically involve creating, combining or transforming some kind of digital content. Digital content may mean text, code, images, sound, video, 3D, etc, for digital consumption, or it may mean digitized designs for real world products or services such as code (again), recipes, instructions, CAD blueprints, etc. Some of this may also be relevant for how you use other people’s generative AI tools in your own work.

Strategy and product management roles

1. Know what input you have to an AI product or feature that’s difficult to replicate. This is generally proprietary data, but it may be an algorithm tuned in-house, access to compute resources, or a particularly responsive deployment process, etc. This separates competitive differentiators from competitive parity features.

2. Interrogate the role of data. Do you need historical data to start, can you generate what you need through experimentation, or can you leverage your proprietary data with open source data, modelling techniques or SaaS products? Work with you technical leads to understand the multitude of mathematical and ML techniques available to ensure data adds the most value for the least effort.

3. Understand where to use open source or Commercial Off-The-Shelf (COTS) software for parity features, but also understand the risks of COTS including roadmaps, implementation, operations and data.

4. Recognise that functional performance of AI features is uncertain at the outset and variable in operation, which creates delivery risk. Address this by: creating a safe experimentation environment, supporting dual discovery (creating knowledge) and development (creating software) tracks with a continuous delivery approach, and – perhaps the hardest part – actually responding to change.

Design roles

5. Design for failure, and loss of vigilance in the face of rare failures. Failure can mean outputs that are nonsensical, fabricated, incorrect, or – depending on scope and training data – harmful.

6. Learn the affordances of AI technologies so you understand how to incorporate them into user experiences, and can effectively communicate their function to your users.

7. Study various emerging UX patterns. My quick take: generative AI may be used as a discrete tool with (considering #5) predictable results for the user, such as replacing the background in a photo, it may be used as a collaborator, reliant on a dialogue or back-and-forth iterative design or search process between the user and AI, such as ChatGPT, or it may be used as an author, producing a nearly finished work that the user then edits to their satisfaction (which comes with risk of subtle undetected errors).

8. Consider what role the AI is playing in the collaborator pattern – is it designer, builder, tester, or will the user decide? There is value in generating novel options to explore as a designer, in expediting complex workflows as a builder, and in verifying or validating solutions to some level of fidelity as a tester. However, for testing, remember you can not inspect quality into a product, and consider building in quality from the start.

9. Design for explainability, to help users understand how their actions influence the output. (This overlaps heavily with #6)

10. More and more stakeholders will want to know what goes into their AI products. If you haven’t already, start on your labelling scheme for AI features, which may include: intended use, data ingredients and production process, warnings, reporting process, and so on, with reference to risk and governance below.

Data science and data engineering roles

11. Work in short cycles in multidisciplinary product teams to address end-to-end delivery risks.

12. Quantify the functional performance of systems, the satisfaction of guardrails, and confidence in these measures for to support product decisions.

13. Make it technically easy and safe to work with and combine rich data.

14. Implement and automate a data governance model that enables delivery of data products and AI features to the support business strategy (i.e., a governance model that captures the concerns of other rules and stakeholders here).

Architecture and software engineering roles

15. Understand that each AI solution is narrow, but composable with other digital services. In this respect, treat each AI solution as a distinct service until a compelling case is made for consolidation. (Note that, as above, product management should be aware of how to make use of existing solutions.)

16. Consolidate AI platform services at the right level of abstraction. The implementation of AI services may be somewhat consistent, or it may be completely idiosyncratic depending on the solution requirements and available techniques. The right level of abstraction may be emergent and big up-front design may be risky.

17. Use continuous delivery for short feedback cycles and delivery that is both iterative – to reduce risk from knowledge gaps – and responsive – to reduce the risk of a changing world.

18. Continuous delivery necessitates a robust testing and monitoring strategy. Make use of test pyramids for both code and data for economical and timely quality assurance.

Risk and governance roles

19. Privacy and data security are the foundation on which everything else is built.

20. Generative AI solutions, like other AI solutions, may also perpetuate harmful content, biases or correlations in their historical training data.

21. Understand that current generative AI solutions may be subject to some or all of the following legal and ethical issues, depending on their source data, training and deployment as a service: privacy, copyright or other violation regarding collection of training data, outputs that plagiarise or create “digital forgeries” of training data, whether the aggregation and intermediation of individual creators at scale is monopoly behaviour and whether original creators should be compensated, that training data may include harmful content (which may be replicated into harmful outputs), that people may have been exposed to harmful content in a moderation process, and that storing data and the compute for training and inference may have substantial environmental costs.

22. Develop strategies to address the further structural failure modes of AI solutions, such as: misalignment with user goals, deployment into ethically unsound applications, the issue of illusory progress where small gains may look promising but never cross the required threshold, the magnification of rare failures at scale and the resolution of any liability for those failures.


These are the type of role-based considerations I alluded to in Reasoning About Machine Creativity. The list is far from complete, and the reader would doubtless benefit from sources and references! I intended to write this post in one shot, which I did in 90 minutes while hitting the target 22 rules without significant editing, so I will return after some reflection. Let me know if these considerations are helpful in your roles.

Reasoning About Machine Creativity

With the current interest in generative AI, I wanted to write a short post updating the framing I took in my older talk Reasoning About Machine Intuition (2017), which was intended for broad audiences to understand the impact and best application of AI solutions from multiple digital delivery perspectives.

Bicycles and automobiles share some features and are used for many of the same tasks, but have important differences that must be considered by transport planners. Recently, electric bikes have created another distinct mobility category that nonetheless shares some elements with existing categories. So it is with AI solutions. While AI may share some features of human intelligence and be suitable for some of the same tasks, understanding the differences is crucial for digital professionals to be able to reason about their capabilities and applicability.

Skip right ahead to machine creativity if you want! Also listen to my podcast on Creative AI for another perspective.

Machine intuition

Considering that products and features introduced in the ML boom of the late 2010s allowed sufficiently good decisions to be made on complex data without precisely specified rules (e.g., image classification), I chose to characterise these solutions as “machine intuition”, in order to highlight that their narrow artificial intelligences were most comparable to human intuition. However, important differences remain. And of course I used “reasoning” in the title to highlight the capability of human intelligence that wasn’t present in these solutions.

Diagram illustrating a large number of emojis feeding into a decision node

Similarities to human intuition

Opportunities, tasks or problems amenable to both approaches share these characteristics:

  • Good decisions will be made, based on ambiguous inputs, but mistakes will also be made,
  • The approach is useful if solutions make enough good decisions in aggregate for a given context, and the volume and nature of mistakes is tolerable,
  • The decisions may have limited explainability, even if explainability important,
  • The decisions are based on past experience and therefore subject to bias.

(NB there are many examples of particularly egregious, discriminatory and harmful mistakes that were not detected or considered prior to release of AI solutions, and that the understanding of what constitutes a mistake, in addition to whether the decision itself is structurally discriminatory, must consider many ethical dimensions.)

Differences from human intuition

If a machine intuition approach looks suitable based on the characteristics above, we must also consider the differences below:

  • The artificial intelligence remains narrow – it can only perform one specific task and only to the degree permitted by its training data. This is different to a human, who can easily generalise to a related task or accommodate new data. However, the same or similar data may be sliced multiple ways to support multiple related narrow tasks, and individual solutions are composable – maybe embarrassingly so – and composable with other digital services, all of which may substitute as a limited form of generality.
  • Machine intuition requires vastly more training instances (many more even than any human expert might see in a lifetime) and concomitantly more computing power than human intuition. NB. These training instances also must be presented in a specific format and are also typically labelled by humans! In contrast, human intuition may only need a handful of examples, and can fall back on reasoning or inference from related experience if direct intuition fails (generalisation again). However, machines may be trained on a volume of data that no human could consume, and any trained model can be reproduced and deployed almost infinitely, so at some scale, low variable cost may compensate for high fixed cost.
  • Machine intuition is possible at superhuman scales, in particular volume of data or requests, and speed of inference. For instance, translating all of Wikipedia in fractions of a second. Machine intuition may also exceed functional human performance at the relevant task, though effective measurement of this must carefully consider the task definition and potential for bias.
  • Machine intuition will fail in some proportion of predictions as a matter of course (though we assume this is manageable) and is also subject to weird/trivial (adversarial) failure modes, such as changing a single pixel, that humans are generally robust to. Mistakes at scale from a single centralised ML solution may also be less acceptable than the aggregate mistakes made by many independent humans.

Anyone involved in delivery of AI solutions should keep these basic factors in mind in order to reason about product and engineering concerns. There is more to consider, but this is a good starting point.

Machine creativity

Considering the current generative AI boom, I think of these solutions as “machine creativity” in order to highlight that their narrow artificial intelligences are most comparable to human creativity in a given medium. However, important differences remain.

Diagram illustrating a single creative spark generating a large number of emoji

Creativity for our purposes is taking some simple input and creating a complex output from the input, an output that also incorporates other ideas, knowledge and techniques beyond the input. That output may be almost any form of digital content, from natural language text, to code, to images, to music, to movies, to 3D scenes, to animated 3D movies. AI that is embodied or with access to manufacturing may also exhibit creativity in the real world, through the materialisation of digital designs.

Some applications of generative AI look more like search, databases, or even back-ends, but they are like our creative reference in that they produce complex outputs from simple inputs, and by similar mechanisms.

(NB legal and ethical issues remain to be resolved with respect to some current mechanisms available to machine creativity to incorporate external ideas, knowledge and techniques. These include: copyright and potential for plagiarism, safety of input and output content and safety of human moderators, attribution and compensation for original creators, and so on.)

Similarities to human creativity

Opportunities, tasks or problems amenable to both approaches share these characteristics:

  • There is not a single “right” answer, multiple answers will suffice and may even be desirable to generate valuable options to pursue,
  • Assessing the goodness of the outputs may include some degree of subjectivity,
  • There may be surprising or non-obvious elements in the output, and again this may be desirable, or risky, or both,
  • The process is likely iterative, with multiple rounds of review and editing.

Differences from human creativity include

If a machine creativity approach looks suitable based on an application being sympathetic to the characteristics above, we must also consider the differences below:

  • The artificial intelligence has no agency or intent in its creativity, it simply processes inputs to generate outputs that are likely or typical based on its training data, described as “next token prediction” (where a token is an element of text, or patch of an image, etc). This may also appear as misalignment or the generation of unsafe content, which can be difficult to detect or control currently.
  • The artificial intelligence has no logically consistent model of the world. The outputs it generates have a high probability of following the prompt, but are not necessarily logically consistent with the prompt or even internally consistent, which can lead to articulate but nonsensical, incorrect or harmful answers. (i.e., It’s also missing the reasoning which is absent from intuition.)
  • The artificial intelligence remains narrow. It performs one generative task but it does not subject the output to a reasoned review or critique, as might be performed by a human to detect error. However, it is again composable, and tests could be applied after the generative step, though these too are fallible. There are many examples of creative AI tool-chains being shared by human creators to support complex creative workflows.
  • Machine creativity also requires more training instances, but is similarly almost infinitely reproducible for creating outputs. Leveraging current tools which include third party training data, it is important to understand the provenance of those training instances – whether they were used with permission, whether they were curated in an ethical manner, and so on.
  • There is by default no explicit attribution of influences on an output, although this is an area of focus and may be improved directly in creative systems or by hybrid means.
  • Machine creativity is also possible at superhuman scales of speed and volume
  • Machine creativity is also subject to weird/trivial adversarial attacks, such as prompt injection


As I’ve been guided by the set of machine intuition considerations above for a number of years, this is the initial set of considerations that I will take forward when considering applications for machine creativity, though I will continue to review their relevance in light of future developments.

In future, I’d like to address out these considerations more specifically by the various roles in a digital delivery organisation, as per the original talk.

Nerfing along

NeRFs provide many benefits for 3D content: the rendering looks natural while the implementation is flexible. So I wanted to get hands on, and build myself a NeRF. I wanted to understand what’s possible to reproduce in 3D from just a spontaneous video capture. I chose a handheld holiday video from an old iPhone X while cycling on beautiful Maria Island.

Video taken while cycling on Maria Island

The camera moves along a fairly straight path, pointing a little right of the direction of travel. This contrasts with NeRFs or scans of objects, where the camera may do one or more full orbits of the object to get every perspective and thus produce seamless renders and clean models. I expect 3D generated from the video above will be missing some detail.

My aim was to build a NeRF from the video, render alternative camera paths, explore the generated geometry, and understand the application and limitation of the results. Here’s the view from one alternative camera path, which follows the original path at first, and then swings out to the side.

Alternative camera path rendered from NeRF

Worfklow overview

I used nerfstudio via their Colab notebook running on Colab Pro with GPU to render the final and intermediate products. The table below lists the major stages, tools and products.

Process video datans-process-data video via COLMAPImages of each frame (png) with inferred camera poses (json)
Train NeRFns-train nerfactoNeRF configuration data including final model checkpoint (ckpt)
Define camera pathsnerfstudio viewerCamera path definition based on keyframes (json)
Render videosns-renderNovel video of the NeRF scene (mp4)
Export geometryns-export pointcloudPoint cloud with surface colour and estimated normals (ply)
Consume geometryMeshlabVisualised pointcloud
nerfstudio workflow overview

For reference, I consumed about 3 Colab Pro “compute units” with one end-to-end train and render (6s 480p 60fps video), but including running the install steps (for transient runtimes) and doing multiple renders on different paths has consumed about 6 “compute units” per NeRF.

Workflow details

Here’s a more detailed walkthrough. There are lots of opportunities to improve.

Process video data

This stage produces a set of images from the video, corresponding to each requested frame, and uses COLMAP to infer the pose of each image. The video was 480p and 6s at 60fps. This processed data is suitable for training a NeRF. The result is visualised below in the nerfstudio viewer.

Posed video frames

I used the `sequential option for video but haven’t evaluated any speedup. I’m not having much luck with specifying the number of frames via the command line parameter either. The resultant files could be zipped and stored outside the Colab instance (locally or on Drive) for direct input to the training stage.

Train NeRF

The magic happens here. The nerfstudio viewer provides live exploration of the radiance field as it is progressively refined through training. The landscape was recognisable very early on in the training process and it was hard to discern improvements in the later stages (at least when using the viewer interactively).

The trained model can also be zipped and stored outside the Colab instance for direct input into later stages.

Define camera paths

I defined one camera path to initially follow the camera’s original trajectory and then deviate significantly to show alternative perspectives and test the limits of scene reconstruction. This path is shown below.

Deviating camera path

I also defined a second path that reversed the original camera trajectory. I downloaded these camera paths for reuse.

Render videos

Rendering the deviating path (video above), the originally visible details are recreated quite convincingly. Noise is visible when originally hidden details are exposed, and also generally around the edges of the frame. I would like to try videos from cameras with a wider field of view to see how much more of the scene they capture.

The second, reversed, path (below) also faithfully reconstructs visible objects, but with some loss of fidelity due to the reversed camera position, and displays more of noise outside the known scene.

Reversed camera path rendered from NeRF

Export geometry

I ran ns-export pointcloud and chose to add estimated normals to the export. I downloaded the ply file to work with it locally.

Consume geometry

Meshlab provides a nice visualisation of the point cloud out of the box, including the colour of each point and shading by estimated normal, as below.

Meshlab visualisation of exported point cloud

Meshlab provides a wide range of further processing tools, such as surface reconstruction. I also tried FreeCAD and Blender. Both imported and displayed the point cloud but I couldn’t easily tune the visualisation to look as good as above.

Next steps

I’d like to try some more videos, and explore how to better avoid noise artefacts in renders.

Synthesising Semantle Solvers

Picking up threads from previous posts on solving Semantle word puzzles with machine learning, we’re ready to explore how different solvers might play along with people while playing the game online. Maybe you’d like to play speed Semantle against an artificially intelligent opponent, maybe you’d like a left-of-field hint on a tricky puzzle, or maybe it’s just fun to spectate at a cerebral robot battle.

Animation of a Semantle game from initial guess to completion

Substitute semantics

The solvers have a view of how words relate due to a similarity model that is encapsulated for ease of change. To date, we’ve used the same model as live Semantle, which is word2vec. But as this might be considered cheating, we can now also use a model based on the Universal Sentence Encoder (USE), to explore how the solvers perform with separated semantics.

Solver spec

To recap, the key elements of the solver ecosystem are now:

  • SimilarityModel – choice of word2vec or USE as above,
  • Solver methods (common to both gradient and cohort variants):
    • make_guess() – return a guess that is based on the solver’s current state, but don’t change the solver’s state,
    • merge_guess(guess, score) – update the solver’s state with information about a guess and a score,
  • Scoring of guesses by either the simulator or a Semantle game, where a game could also include guesses from other players.
Diagram illustrating elements of the solver ecosystem. Similarity model initialises solver state used to make guesses, which are scored by game and update solver state with scores. Other players can make guesses which also get scored

It’s a simplified reinforcement learning setup. Different combinations of these elements allow us to explore different scenarios.

Solver suggestions

Let’s look at how solvers might play with people. The base scenario friends is the actual history of a game played with people, completed in 109 guesses.

Word2Vec similarity

Solvers could complete a puzzle from an initial sequence of guesses from friends. Both solvers in this particular configuration generally easily better the friends result when primed with the first 10 friend guesses.

Line chart comparing three irregular but increasing lines that represent the sequence of scores for guesses in a semantle game. The three lines are labelled friends, cohort, and gradient. Cohort finishes with fewest guesses, then gradient, then friends, with clear separation.

Solvers could instead make the next guess only, but based on the game history up to that point. Both solvers may permit a finish in slightly fewer guesses. The conclusion is that these solvers are good for hints, especially if they are followed!

Line chart comparing three irregular but increasing lines that represent the sequence of scores for guesses in a semantle game. The three lines are labelled friends, cohort, and gradient. Cohort finishes with fewest guesses, then gradient, then friends, with marginal differences.

Maybe these solvers using word2vec similarity do have an unfair advantage though – how do they perform with a different similarity model? Using USE instead, I expected the cohort solver to be more robust than the gradient solver…

USE similarity

… but it seems that the gradient descent solver is more robust to a disparate similarity model, as one example of the completion scenario shows.

Line chart comparing three irregular but increasing lines that represent the sequence of scores for guesses in a semantle game. The three lines are labelled friends, cohort, and gradient. Gradient finishes with fewest guesses, then friends, then cohort, and the separation is clear.

The gradient solver also generally offers some benefit making a suggestion for just the next guess, but the cohort solver’s contribution is marginal at best.

Line chart comparing three irregular but increasing lines that represent the sequence of scores for guesses in a semantle game. The three lines are labelled friends, cohort, and gradient. Gradient finishes with fewest guesses, then friends, and cohort doesn't finish, but the differences are very minor.

These are of course only single instances of each scenario, and there is significant variation between runs. It’s been interesting to see this play out interactively, but a more comprehensive performance characterisation – with plenty of scope for understanding the influence of hyperparameters – may be in order.

Solver solo

The solvers can also play part or whole games solo (or with other players) in a live environment, using Selenium WebDriver to submit guesses and collect scores. The leading animation above is gradient-USE and a below is a faster game using cohort-word2vec.

Animation of a Semantle game from initial guess to completion

So long

And that’s it for now! We have multiple solver configurations that can play online by themselves or with other people. They demonstrate how people and machines can collaborate to each bring their own strengths to solving problems; people with creative strategies and machines with a relentless ability to crunch through possibilities. They don’t spoil the fun of solving Semantle yourself or with friends, but they do provide new ways to play and to gain insight into how to improve your own game.

Postscript: seeing in space

Through all this I’ve considered various 3D visualisations of search through a semantic space with hundreds of dimensions. I’ve settled on the version below, illustrating a search for target “habitat” from first guess “megawatt”.

An animated rotating 3D view of an semi-regular collection of points joined by lines into a sequence. Some points are labelled with words. Represents high-dimensional semantic search in 3D.

This visualisation format uses cylindrical coordinates, broken out in the figure below. The cylinder (x) axis is the projection of each guess to the line that connects the first guess to the target word. The cylindrical radius is the distance of each guess in embedding space from its projection on this line (cosine similarity seemed smoother than Euclidian distance here). The angle of rotation in cylindrical coordinates (theta) is the cumulative angle between the directions connecting guess n-1 to n and n to n+1. The result is an irregular helix expanding then contracting, all while twisting around the axis from first to lass guess.

Three line charts on a row, with common x-axis of guess number, showing semi-regular lines, representing the cylindrical coordinates of the 3D visualisation. The left chart is x-axis, increasing from 0 to 1, middle is radius, from 0 to ~1 and back to 0, and right is angle theta, increasing from 0 to ~11 radians.

Second Semantle Solver

In the post Sketching Semantle Solvers, I introduced two methods for solving Semantle word puzzles, but I only wrote up one. The second solver here is based the idea that the target word should appear in the intersection between the cohorts of possible targets generated by each guess.

Finding the semantle target through overlapping cohorts. Shows two intersecting rings of candidate words based on cosine similarity.

To recap, the first post:

  • introduced the sibling strategies side-by-side,
  • discussed designing for sympathetic sequences, so the solver can play along with humans, with somewhat explainable guesses, and
  • shared the source code and visualisations for the gradient descent solver.

Solution source

This post shares the source for the intersecting cohorts solver, including notebook, similarity model and solver class.

The solver is tested against the simple simulator for semantle scores from last time. Note that the word2vec model data for the simulator (and agent) is available at this word2vec download location.

Stylised visualisation of the search for a target word with intersecting  cohorts. Shows distributions of belief strength at each guess and strength and rank of target word

The solver has the following major features:

  1. A vocabulary, containing all the words that can be guessed,
  2. A semantic model, from which the agent can calculate the similarity of word pairs,
  3. The ability to generate cohorts of words from the vocabulary that are similar (in Semantle score) to a provided word (a guess), and
  4. An evolving strength of belief that each word in the vocabulary is the target.

In each step towards guessing the target, the solver does the following:

  1. Choose a word for the guess. The current choice is the word with the strongest likelihood of being the target, but it could equally be any other word from the solver’s vocabulary (which might help triangulate better), or it could be provided by a human player with their own suspicions.
  2. Score the guess. The Semantle simulator scores the guess.
  3. Generate a cohort. The guess and the score are used to generate a new cohort of words that would share the same score with the guess.
  4. Merge the cohort into the agent’s belief model. The score is added to the current belief strength for each word in the cohort, providing a proxy for likelihood for each word. The guess is also masked from further consideration.

Show of strength

The chart below shows how the belief strength (estimated likelihood) of the target word gradually approaches the maximum belief strength of any word, as the target (which remains unknown until the end) appears in more and more cohorts.

Intersecting cohorts solver. Line chart showing the belief strength of the target word at each guess in relation to the maximum belief strength of remaining words.

We can also visualise the belief strength across the whole vocabulary at each guess, and the path the target word takes in relation to these distributions, in terms of its absolute score and its rank relative to other words.

Chart showing the cohort solver belief strength across the whole vocabulary at each guess, and the path the target word takes in relation to these distributions, in terms of its absolute score and its rank relative to other words

Superior solution?

The cohort solver can be (de)tuned to almost any level of performance by adjusting the parameters precision and recall, which determine the tightness of the similarity band and completeness of results from the generated cohorts. The gradient descent solver has potential for tuning parameters, but I didn’t explore this much. To compare the two, we’d therefore need to consider configurations of each solver. For now, I’m pleased that the two distinct sketches solve to my satisfaction!