Slackometer Hello World

Project Slackpose gives me one more excuse for hyperlocal exercise and number crunching in lockdown. Last time, I briefly touched on balance analysis. This time, I look at tracking slackline distance walked with my newly minted slackometer.

Inferring 3D Position

I’m working only with 2D pose data (a set of pixel locations for body joints) from my single camera angle, but I can infer something about my 3D location – distance from the camera (d) – using the pinhole camera model and further data such as:

  1. Camera lens & sensor geometry, plus known real world distance between pose keypoints (geometric approach)
  2. Consistent pose data (eg same person, similar pose) acquired at known distances from the camera (regression approach)

In this first iteration shown in the notebook, I boil all the pose data down into a single feature: the vertical pixel distance (v) between the highest tracked point (eg, an eye) and the lowest tracked point (eg, a toe). I base the calculation of distance d on this measure. This measure may be shorter than my full height by crown-to-eye height and a typically lower pose when balancing on a slackline, or an elevated arm joint might make it taller.

Body pose in pixels with vertical extent indicated

Geometric Approach

The geometric approach uses a similar triangles formula where one triangle has sides lens-sensor distance (known) and object pixel height (v) scaled to sensor height (known), the other has lens-object distance (d) and object height (roughly known from my height, as above). The equation has the form d = F / v, where F is determined by those known or estimated factors. These geometry factors will be specific to each physical camera device and slackline walker combination.

Regression Approach

The regression approach uses pose data collected at known distances from the slackline anchor – 2m, 4m, 6m, and 8m – as shown in the image below. I marked the calibration distances and walked to each point, balancing there for a few seconds, then turned around and walked back through each point to collect data from both back and front views. This approach is consistent across cameras and walkers, and works with varied sets of calibration distances.

Person standing on a slackline with known distances marked

Plotting the value of v against the reciprocal of distance (1 / d), we see a fairly linear relationship. Fitting a line (with 0 intercept) to the data gives a value for 1 / F, which produces a value for F that is very close to that from the geometric approach – a neat result, or convenient choice of parameters!

charts showing correlation of vertical extent and reciprocal of distance from camera

So we have two methods for calculating distance from pose data, which produce very similar results.

Does It Work?

The figures you see in the video above match up pretty well with the reality. The video shows approximately 10m accumulated distance, while in reality I started about 2m from the camera, walked to just under 8m from the camera, then returned almost to my starting point (say 11.5m max). The discrepancy is most likely explained by an under-estimate of peak distance (at ~7m) due to decrease in precision of pixel measures for distant objects, and noise/pose-dependence in the vertical extent measure.

So this first iteration of the slackometer would be useful for approximating distance walked, and could possibly be improved with higher resolution video and by tracking more body segments, which may also reduce the dependence on smoothing. It would also be useful for comparing distances or speeds. The millimetre precision, however, is misleading, I just chose the unit so it looked better on the odometer display! (I spent some time tuning this for those cascading transitions…)

A mechanical odometer showing wheels cascading over the numbers 1998, 1999, 2000, 2001

There are certainly are other ways you could track distance, starting from pacing out the line and keeping count of laps, to using image segmentation rather than pose estimation to calculate v, to alternative sensor setups and models, but it was fun to do it this way and helped pass a few more days, and nights, in lockdown.

Light trace in a backyard; a long exposure photo at night that looks like daytime
slacklining at night under a full moon

7 Wastes of Data Production

I realised recently that this is one of the lenses through which I look at the data engineering world, but I had never expressed these (lean) wastes explicitly. This post might be useful for data engineers exploring lean concepts, or lean practitioners trying to make sense of data & analytics processes. This isn’t just a theoretical view; these wastes are real, and have a real impact on organisational success, which I try to quantify. These wastes also impact our ability to do good work, and to enjoy it!

When I talk about data production, I’m talking about building and running a factory that transforms data signals from the world into useful insights, improved operations and great experiences. This means connecting data sources from suppliers through transformations to consumers, who might be customers, team members, or partners.

Data production factory, showing bits flowing from supplier to consumer through a machine. Developers and operators change the factory and keep it running

In this post, I’ll look at lean wastes through the lens of building the factory and running the factory. Building the factory – modifying the processing pathways for data – is a software development exercise. Running the factory – propagating new inputs along processing pathways – is a manufacturing operations exercise, but for data rather than physical products, and hence the manufacturing machines are all software. NB. one team can look through multiple lenses – see data + dev + ops!

Lean wastes (muda) were originally defined with reference to a physical manufacturing, though there are analogues for knowledge work, including Mary and Tom Poppendieck’s mapping to software development. So the translation is roughly:

  • the software development analogue for the build phase, and
  • a manufacturing analogue with bits rather than atoms for the run phase

Both are illustrated with examples specific to data. The drawings only show bits flowing through a running factory, but you can imagine ideas flowing through a development team as the equivalent for build.

I talk about data products as the end result of this development and manufacturing (or data production) activity, but also considering the complementary design or marketing perspective – i.e., what problems and how well do these products solve for a consumer? (regardless of how they are made) This brings us to…

Overproduction

Overproduction is delivering things consumers don’t need and haven’t asked for.

BuildUnnecessary products, over-designed products
For instance, a report that no-one will read, a 3D widget where a table will do, or a unified data model that doesn’t suit any one consumer.
RunUnused products
A report that people stopped reading long ago. A data set no-one ever accesses.
Data production factory spewing bits that aren't consumed by a consumer

The consequence of overproduction is that productive capacity is consumed with no business impact. These products are useless and prevent us creating value.

Considering some studies have found 50% of product features are rarely or never used, the cost of overproduction may be 50% of your data budget, but many organisations have very limited visibility of overproduction in data.

Overproduction is with reference to finished goods, but until finished, they are …

Inventory

Inventory is partially completed work that causes a drain on resources while embodying little or no realisable value.

BuildWork in progress
Batching up development effort on data ingest activities, or platform features, without validating the use cases that motivate this data or functionality.
RunIncomplete pipelines (data not connected to consumers)
Data delivered to a data platform but not being used.
Data production factory accumulating bits inside the factory

The consequence of inventory is that no value is realised from effort to date, and as a result, unbounded effort may be expended before delivering value.

The cost that can be sunk into building and populating a data platform that is not connected to consumers (representing data production inventory) is, for all intents and purposes, unbounded. Without connecting to consumers, very large initiatives could reach their conclusion with substantial data inventory (which ironically, until close to that point, might be considered a success measure), but marginal to no business value delivered.

A possible cause of failing to deliver finished goods from inventory is the additional effort associated with …

Over Processing

Over-Processing is doing more work than is necessary to deliver on an objective.

BuildReinventing products, working with untrusted data
Duplicated, divergent reports. A unified data model that doesn’t suit any one consumer. Excessive logic to manage poor data quality.
RunCorrecting errors from upstream, propagating redundant data
Filling missing data and correcting schema violations with best guesses. Passing on data that isn’t valuable downstream.
Data production factory with many processing machines

The consequence of over-processing is the expenditure of unnecessary effort to realise value. This may feel like everything is harder than it should be.

Any code that exists to correct errors downstream of a source is over-processing, so is any duplicated reporting. Consider how much of this over-processing you may be doing, and what might change if you were to measure this.

Work must move between processing stages, but in doing so, we should minimise …

Transportation

Transportation is moving products around in a way that is costly and may cause damage.

BuildHandoffs between siloed teams
Source app team → ingest team → platform team → analytics team → consuming app team… each team loses context from the last.
RunData replication without reproducibility
Creating unnecessary backups because systems aren’t trusted. Copying Excel files. You can damage a digital copy by losing its provenance.
Data production factory where bits disappear and reappear at different locations

The consequence of transportation is expending further additional effort that may reduce quality. There’s no clear single place to find what you need, and the more it moves, the less you trust it.

How much time have you lost due to misunderstanding between teams or to establishing the provenance of data? It may lower productivity or may be catastrophic if auditibility is sufficiently degraded. This is the cost of transportation in data production.

In addition to transportation, the act of processing may include unnecessary …

Motion

Motion is extra activity which doesn’t add to the product, and additionally creates opportunities for defects, and takes a toll on workers.

BuildContext switching
For example, a dedicated data ingest team working across multiple sources (work in progress), which may also frequently break (incident toil).
RunManual intervention or finishing of products
Copy this here, rename file X and save over there, run script Y, …
Data production factory with workers moving between multiple streams and stages in a stream

The consequence of motion is that work is complicated, in a way that is bad for people. Every job requires more actions than it should.

How much time and energy do you lose to picking up and putting down work – this can increase dramatically as the number of concurrent tasks increases – including switching to manual intervention in data production? This is the cost of motion.

All of the above reduce the flow of build or run to an extent. Collectively and in interaction with batching up work, they cause …

Waiting

Waiting occurs when people or resources aren’t ready to pick up work as it arrives.

BuildDelays due to handoffs between siloed teams, long feedback cycles in development
Waiting for requirements or feedback on deliverables. Exacerbated in data-intensive applications by functionally-specialised handoffs (see transport), long running batch jobs, and a promote then test approach.
RunLead time to discover data, and from business event to insight or action
No-one owns this data or can tell you definitively if it exists, what’s in it, and where to find it. Reports come on a fixed schedule set by processing capabilities, not business needs.
Data production factory with waiting spinners in the flow of bits

The consequence of waiting is that business value realisation is delayed, in an environment where value decays rapidly. We might summarise this as: it would have been nice to have this data yesterday.

Where hand-offs occur between teams, tasks may take 12 times as long to complete, as a median measure, and much longer in extreme cases. Cascading scheduled batch jobs with buffers and retries due to uncontrolled variability and quality issues can quickly add up to insights lead times measured in weeks.

The factors above contribute to and are also caused by the introduction of …

Defects

Defects are the failure to do something, or the failure to do it right.

BuildDefects in processing code
Query specifies ‘m’ for minutes instead of intended ‘M’ for month.
RunDefects in data produced
People get the wrong emails.
Data production factory with some defective bits on input, a defective processing machine, and many defective bits on output

The consequence of defects is that the organisation increases risk exposure, while reducing consumer value delivered, and creating effort to remediate. Thus they have the potential to damage business and create yet more effort.

The cost of defects can be catastrophic, especially when related to personal information. If defects cause significant ongoing toil, reducing defects is a major lever for increasing productive capacity (eg, if defects are ~30% of capacity, the marginal improvement in productive capacity is ~50%).

Conclusion

We can see these wastes are inter-connected and sometimes mutually reinforcing. Look out for these wastes in your work with data; find your own examples. I have found recognising these various wastes and being able to quantify their potential impact helps identify and prioritise improvement efforts. There are approaches and solutions to reduce these wastes, but I won’t address any of those here. Instead I will just encourage you to take some time to understand the problem; there’s a lot you can do with knowledge of waste in data production to define and drive change for the better.

Thanks to Ned Letcher and Yekaterina Khomyakova for feedback on these wastes, which were included as part of their presentation on Data Mesh for the 2021 LAST Conference Melbourne.

Guiding the Evolution of Data Mesh with Fitness Functions

I presented this webinar with Zhamak Dehghani – see the recording Guiding the Evolution of Data Mesh with Fitness Functions. There was great engagement with the topic and we captured some questions and further thoughts on this mini-blog post, published a little later.

This presentation brought together the idea of architectural fitness functions from the book Building Evolutionary Architectures with the core data mesh principles and logical architecture.

Our thoughts around guiding fitness functions included the below. These high-level measurable objectives were supported by a range of proposed metrics. This table is a handy summary; check out the webinar for more.

Domain Ownership
Scaling sources and consumers
Truthfulness
Domain autonomy
Reduced accidental complexity
Data as a Product
Serving users’ needs
Ease of discovery
Evaluation of quality
Service levels
Self-Serve Data Platform
Abstraction of complexity
Domain team autonomy
Protocols enable an ecosystem
Automation
Federated Computational Governance
Governance for common good
Degree of decentralisation
Interoperability
Increasing returns from scale