Rebooting AI Review

I was excited to read Rebooting AI (website), to find inspiration and tools for doing things better. Here is the book in one great quote:

For now, we are in a kind of interregnum: narrow but networked intelligences with autonomy, but too little genuine intelligence to be able to reason about the consequences of that power.

There is a lot to like. Marcus & Davis clearly map out the history and landscape of AI challenges, plus plausible elements of future solutions. They provide useful tools for thinking about problems with partial solutions to intelligence, such as the “fundamental over attribution error” and the “illusory progress gap”. They show how current ML solutions based on big data can be opaque and brittle. They demonstrate how key attributes of human intelligence instead allow the development of rich cognitive models – such as how language and the real world work – and how solutions incorporating such models would address current shortcomings, enabling AI to tackle open-ended tasks. This is great material for a general reader.

Where I felt the book fell short was that it didn’t build many bridges between our current “narrow but networked intelligences” and the authors’ posited future state capabilities. The future state reads like Artificial General Intelligence (AGI) by another name, fleshed out by scenarios that are short on implementation detail. Though sometimes mundane, from our current perspective, Arthur C Clarke might describe as them “indistinguishable from magic” and hence Rodney Brooks would say they are “no longer falsifiable”.  We know there’s a massive chasm between current ML solutions and AGI, but I didn’t find much to close or bridge the chasm in Rebooting AI.

Some of these future capabilities are illustrated by domain-specific modelling techniques – like formal logic – that would be familiar to many computer science students. But I found this a little incongruous because these techniques have also failed to deliver on promises of realising intelligence, and not done any more to squash the “long tail of edge cases” than other narrow intelligences. Given the diverse facets of intelligence, maybe the paradigm of “narrow but networked intelligences” is the best way to achieve or approximate intelligence, or maybe it’s ultimately illusory progress, but these illustrations didn’t help me resolve that.

There is undeniable value in the current generation of ML solutions. How do we build on these? A detailed analysis of a number of key avenues of short to medium term progress was lacking. For instance, starting with current ML solutions, the authors could have explored:

  • various designs of hybrid human-machine decision-making systems that augment human abilities while remaining resilient to new scenarios that stump machines;
  • transfer learning, few-shot learning and sophisticated representation learning like transformers, that have potential to increase the representative and reasoning power of solutions;
  • the role of ecosystem design and governance, including ongoing monitoring and data curation to correct issues (for instance bias testing, CD4ML, etc).

Instead, ML was stereotyped as fully automated, tabula rasa, E2E.

Finally, to know things are getting better, we need the right baseline and measures. While the language examples clearly demonstrated superficial artificial understanding, and self-driving vehicles have a ways to go, some issues raised were not assessed against incumbent human capabilities on narrow tasks in a like-for-like comparison, but rather against posited capabilities of a future AGI system. I would agree that humans can individually reflect and introspect to recognise their mistakes, but it is still the case that, in operational scenarios, humans make mistakes like artificial systems do. These operational mistakes are moderated by the wider ecosystem in which humans operate, in the same way as predictive inference mode is moderated by a wider human-machine ecosystem. I felt the core issue in some instances was structurally unsound or concentrated decision-making without proper governance, rather than whether or not mistakes were made, and this confounded the analysis. I would have liked to have seen these factors teased out so comparisons could be made in a way that would help to measure progress.

Marcus & Davis do lay out a helpful framework for building trust in AI systems, including stress testing, understanding costs of failures, building in modularity and maintainability, etc. This is good guidance but it would be really helpful to see more detail or case studies under these headlines, to the specificity of other works like Weapons of Math Destruction and Made by Humans.

So, maybe I was hoping for “Refactoring AI” rather than “Rebooting AI”. The book certainly clearly describes problems with the current state, and desirable characteristics of the future state. On balance, the technical arguments may indeed be sounder than my concerns. If you’re curious, I would encourage you to read it and draw your own conclusions. Ultimately, however, I’m disappointed because I didn’t leave inspired and equipped with new insight and new tools for improving AI today, tomorrow, and the day after.

The Lockdown Wheelie Project, Part 3

In Melbourne’s COVID-19 lockdown, I’ve wheelied over 17km. Not all at once, though.

Over three months, I’ve spent 90 minutes with my front wheel raised. I’d like to keep it up, but as lockdown has gradually relaxed, and routines have changed, so have I landed the wheelie project, for now.

Read the full article over on Medium at The Lockdown Wheelie Project, Part 3.

Data-Driven Responses to Changing Behaviour, auf Deutsch

I’m pleased to see the German translation of the article I wrote with Sue Visic now live on Digitale Welt magazine: Mit Datenanalyse schnell auf Nachfragewandel reagieren.

This is translated from the original article Data-driven responses to new patterns of customer behaviour, published on ThoughtWorks Insights, 16 April 2020.

More Sankey for Less Confusion?

Confusion Matrixes are essential for evaluating classifiers, but for some who are new to them, they can cause, well, confusion.

Sankey Diagrams are an alternative way of representing matrix data, and I’ve found some people – who are new to matrix data, like business domain experts who are not experienced data scientists – find them easier to understand. Also, some machine learning researchers find Sankey diagrams useful for analysing data and classifiers.

So, I have posted simple code for visualising classifier evaluation or comparisons as Sankey diagrams. Maybe it will be useful for others, as well as fun for me.

The code combines large portions of Plotly Sankey Diagrams with essence of scikit-learn confusion matrix and a lashings of list comprehension code golf.

The scenarios supported are:

  1. Evaluating a binary classifier against ground truth or as champion-challenger,
  2. Evaluating a multi-class classifier against ground truth or as champion-challenger,
  3. Comparing multiple versions of a binary classifier, for instance over time, or hyper-parameter sweeps, and
  4. Comparing multiple versions of a multi-class classifier.
Example confusion matrixes as Sankey diagrams

See the code on Github.

Applying Software Engineering Practices to Data Science

I had fun recording this podcast on Applying Software Engineering Practices to Data Science with Zhamak Dehghani, Mike Mason and Danilo Sato.

The need for high quality information at speed has never been greater thanks to competition and the impact of the global pandemic. Here, our podcast team explores how data science is helping the enterprise respond: What new tools and techniques show promise? When does bias become a problem in data sets? What can DevOps teach data scientists about how to work?

ThoughtWorks Technology Podcasts

The Lockdown Wheelie Project, Part 2

I now have an AI coach for my wheelie project. Coach has seen over 1,500 of my wheelies, and reckons they can tell pretty quickly whether my effort will be wheelie good or bad. Coach also fits on my phone, so they come on rides when I want real-time advice.

Read the full article over on Medium at The Lockdown Wheelie Project, Part 2.

ML Interpretability with Ambient Visualisations

I produced some ambient visualisations as background to short talks on the topic of Interpreting the Black Box of ML from ThoughtWorks Technology Radar Volume 21. The talks were presented in breaks at the YOW Developer Conference.

Animation of linear to non-linear model selection

Here are my speaker notes.

Theme Intro

The theme I’m talking about is Interpreting the Black Box of ML.

It’s a theme because the radar has a lot of ML blips – those are the individual tools, techniques, languages and frameworks we track, and they all have an aspect of interpretability.

I’m going to talk first about Explainability as a First Class Model Concern.

Explainability as a First Class Model Concern

ML models make predictions. They take some inputs and predict an output, based on the data they’ve been trained on. Without careful thought, those predictions can be black boxes

For example – predicting whether someone should be offered credit. A few people at the booth have mentioned this experience[the] black box algorithm thinks I deserve 20x the credit limit [my wife] does – and the difficulty in getting an explanation from the provider [this was a relevant example at the time].

Elevated to a first class concern, however, ML predictions are interpretable and explainable to different degrees – it’s not actually a question of black box or white box, but many shades of grey.

Spectrum

Interpretable means people can reason about a model’s decision-making process in general terms while, explainable means people can understand the factors that led to a specific decision. People are important in this definition – a data scientist may be satisfied with the explanation that the model minimises total loss, while a declined credit applicant probably requires and deserves a reason code. 

And those two extremes can anchor our spectrum – at one end we can explain a result as a general consequence of ML, at the other end explaining the specific factors that contributed to an individual decision.

Dimensions – What

As dimensions of explainability , we should consider:

  • The choice of modelling technique as intrinsically explainable
  • Model agnostic explainability techniques
  • Whether global or just local interpretability is required

Considering model selection – a decision tree is intrinsically explainable – factors contribute sequentially to a decision. A generic deep neural network is not. However, in between, we can architect networks to use techniques such as embeddings, latent spaces or transfer learning, which create representations of inputs that are distinct and interpretable to a degree, but not always in human terms.

And so model specific explainability relies on the modelling technique, while model agnostic techniques are instead empirically applicable to any model. We can create surrogate explainable models for any given model, such as a wide network paired with a deep network, and we can use ablation to explore the effect of changing inputs on a model’s decisions.

For a given decision, we might only wish to understand how that decision would have been different had the inputs changed slightly. In this case we are only concerned about local interpretability and explainability, but not the model as a whole, and LIME is an effective technique.

Reasons – Why

As broader business concerns, we should care about explainability because:

  • Knowledge management is crucial for organisations – an interpretable model, such as the Glasgow Coma Scale, may be valued more for people’s ability to use it than its pure predictive performance
  • We must be compliant to local laws, and it is in all stakeholders’s interests that we act ethically
  • And finally, models can always make mistakes, so a challenge process must be considered, especially as vulnerable people are disproportionately subject to automated decision making

And explainability is closely linked to ethics, and hence the rise of ethical bias testing.

Ethical Bias Testing

Powerful, but Concerning

There is rising concern that powerful ML models could cause unintentional harm. For example, a model could be trained to make profitable credit decisions by simply excluding disadvantaged applicants. So we’re seeing a growing interest in ethical bias testing that will help to uncover potentially harmful decisions, and we expect this field to evolve over time.

Measures

There are many statistical measures we can use to detect unfairness in models. These measures compare outcomes for privileged and unprivileged groups under the model. If we find a model is discriminating against an unprivileged group, we can apply various mitigations to reduce the inequality.  

  • Equal Opportunity Difference is the difference in true positive rates between an unprivileged group and a privileged group. A value close to zero is good.
  • The Disparate Impact is the ratio of the selection rate between the two groups.  The selection rate is the number of individuals selected for the positive outcome divided by the total number of individuals in the group. The ideal value for this metric is 1.

These are just two examples of more than 70 different metrics for measuring ethical bias. Choosing what measure or measures to use is an ethical decision itself, and is affected by your goals. For example, there is the choice between optimising for similarity of outcomes across groups or trying to optimise so that similar individuals are treated the same. If individuals from different groups differ in their non-protected attributes, these could be competing goals.

Correction

To correct for ethical bias or unfairness, mitigations can be applied to the data, to the process of generating the model, and to the output of the model.

  • Data can be reweighted to increase fairness, before running the model.
  • While the model is being generated, it can be penalised for ethical bias or unfairness.
  • Or, after the model is generated, it’s output can be post-processed to remove bias. 

As for explainability, the process of removing ethical bias or improving fairness will likely reduce the predictive performance or accuracy of a model, however, we can see that there is a continuum of tradeoffs possible.

What-if Tool

What is What if

I mentioned tooling is being developed to help with explainability and ethical bias testing, and you should familiarise yourself with these tools and the techniques they use. One example is the What if Tool – an interactive visual interface designed to help you dig into a model’s behaviour. It helps data scientists understand more about the predictions their model is making and was launched by the Google PAIR lab.

Features

You can do things like:

  •  Compare models to each other
  •  Visualize feature importance
  •  Arrange datapoints by similarity
  •  Test algorithmic fairness constraints

Risk

But by themselves tools like this won’t give you explainability or fairness, and using them naively won’t remove the risk or minimize the damage done by a misapplied or poorly trained algorithm. They should be used by people who understand the theory and implications of the results. However, they can be powerful tools to help communicate, tell a story, make the specialised analysis more accessible, and hence motivate improved practice and outcomes.

CD4ML

The radar also mentions for the second time CD4ML – using Continuous Delivery practices for delivering ML solutions. CD in general encourages solutions to evolve in small steps, and the same is true for ML solutions. The benefit of this is that we can more accurately identify the reasons for any change in system behaviour if they are the result of small changes in design or data. So we also highlight CD4ML as a technique for addressing explainability and ethical bias