Hard problems in highly agentic coding

Highly agentic coding with LLMs has great promise: automatically generating software to solve a wide range of problems. But it comes with its own hard problems to solve.

With my experience in product design search and optimisation, software development, robotics and manufacturing, it’s an area I’m very interested in understanding better. What I share here is my take – based on that experience – on the harder problems to solve for successful adoption of highly agentic coding. When I say highly agentic, I’m talking about going beyond demonstrated “word calculator” transformations used in discrete tasks, to having harnessed LLMs drive multiple significant feedback cycles in the SDLC with limited human involvement.

I usually like to talk about solutions to problems too. Some of these problems have existing solutions in software engineering practice, some solutions might already be here from the future, just not yet widely distributed, and others might be solved by you!

Problems are not a reason not to do something. I’ve often noted that good software engineering practice (like orienteering), is not based on avoiding errors entirely, but on anticipating and recovering quickly from the errors you will inevitably make.

May we can reconcile problems and progress with the adage (Gall’s Law) “a complex system that works is invariably found to have evolved from a simple system that worked”.

Maths problems

The space of computer programs has effectively infinite degrees of freedom – just add another line of code – and it’s highly non-linear – one misplaced ! character (logical not) will dramatically change the functionality. This means that any desirable program is only a very small change from uncountable undesirable programs, a change that may be difficult for agents (as for humans) to detect.

The agent (which from here I will call a robot) searches this space with a randomised approach, trying to land close to a desirable program with supplied information and pre-trained pattern matching, then iterating still closer with feedback. Given the high dimensionality and non-linearity, this is billions of times harder than dropping into the Himalayas in zero-visibility cloud with a rough mental map and trying to summit an 8,000m peak by touch (though LLMs do have billions of parameters).

But this space is also adversarial; the mountains are littered with bear traps. The whole information supply chain for this program search, from LLM brain, to software libraries, to research documents, to test data may contain difficult to detect prompt injections that guide the robot to also difficult to detect undesirable programs. Treating the space as random instead of adversarial will expose us to potential catastrophic failures. Since drafting this piece, NIST published proof that adversarial prompts are theoretically unsolvable, though may be practically ameliorated through continuous defence in depth. However, the transferability of open-box attacks remains problematic.

In certain common domains, the robot can make a good first guess based on supplied information and pre-training, but any novel, differentiating business or technical capability may be out of the robot’s training distribution, in which case the initial guess and subsequent iteration will be much less efficient, if effective at all, and possibly confidently wrong. While the search may be capable on occasion, its random nature makes it hard to assure reliability. While this is the mathematical landscape, as biased humans, we may conflate isolated success with general performance, and downplay these issues.

These fundamental maths problems underlie and amplify further technical, social-interaction and economic challenges.

Robotics problems

We’re trying to reward our robot to help it iterate faster to desirable programs (or punish it for doing the wrong thing), through directing it to use harness capabilities, write and pass tests, and so on, and as such, especially in this high-dimensional space, we’ll constantly face issues of reward design and reward hacking, where the robot satisfies the rewards but not the task for which the rewards were designed.

Long-term, deferred rewards are particularly hard to model, such as avoiding the accumulation of technical debt, or drift from the business model. Over longer time horizons and more complex problems (where we might now expect to see more benefit?), the robot gets less explicit direction and has to build more on its own prior output. Our robot is now searching for desirable programs based on dead reckoning, with all its accumulated errors. It should accumulate error slowly, I find it hard enough to locate the bed by dead reckoning when I’ve turned off the light!

Robotics or cybernetics control loops generally boil down to sense-plan-act-repeat (which again we might think of as solution search and optimisation). In the real world, sensors and actuators are imprecise and error prone. Hence we design the planning step to be robust to sensor failure and resilient to actuator failure. I don’t know how much we’ve approached coding robots from this perspective, as we call them agents and hence impute all sorts of characteristics like resilient agency. One particular concern is the haphazard approach to sensor fusion in which we smoosh the whole information supply chain into a prompt for an LLM, call it context engineering, and hope attention is all we need.

I’m also not sure we’ve agreed what the robots actuators should do and how this might vary by circumstances. Are we building spec-anchored, where the primary robot action is to author a diff to the existing code, and thus must have sensors to understand the existing codebase, or are we building spec-as-source, in which case the primary action is to regenerate the entire codebase, and sensing the existing, disposable codebase is redundant? To stress test a spec-as-source approach, are we confident regenerating “code” (more on this below) synchronously in the user interaction flow, rather than asynchronously in the development cycle?

Code robots might also actuate code changes and supporting actions via harnesses (fka makefiles), which do dual duty in defining the robot’s planning framework, and thus sensor fusion too. Further code robot loops might sense and act on these harness actuators, planners and sensors themselves, creating yet further feedback loops. In a world of perfect rewards and stable drift compensation, this is the self-improving code robot dream, but small deviations from this can rapidly magnify.

More broadly, control theory is the study of managing dynamical systems, and I’d love to see the applied to coding robots. I think the current trend towards harnesses and feedback loops is moving in the right direction, but I fear a lot of wheels undergoing painful reinvention without understanding of the systems fundamentals, including human factors below.

Code concepts

Clarifying these concepts, which may or may not present problems in themselves, will at least allow you to agree on the problems that follow.

Agentic coding shares some of the hallmarks of previous waves of no-code/low-code tools that made it easy to get started with a simple model of the world but ultimately crumbled under specific, precise, cross-cutting requirements that can only be expressed in a general coding language. Scaling the development of these systems can also only be managed by moving beyond content management to software engineering workflows. No/low-code may be great for distributed innovation and rapidly validating problem-solution fit, but at some level, we must mature to a full-code, full-engineering solution.

But what is “code”? Both instructions for a machine and a way to reason about a solution. We’ve long had DSLs, declarative and transpiled assets as part of our codebases – are specs just the latest incarnation? While specs in spec-anchored might be considered requirements documentation, specs in spec-as-source must be considered (low) code. We can envisage a point in spec-as-source development when complexity and specialisation force natural language specs to cede to general coding languages as a better model to reason about the solution. In my own strategic AI consulting, I started delivering Python notebooks to leaders when slide decks were not sufficiently expressive.

So don’t expect to be able to develop complex products with simple specs alone. Ultimately, it is likely your specs will become a defacto programming language. And the reason for this will be to maintain agility in the face of rising complexity (the very agility you enjoyed at the start).

Of course, any codebase needn’t follow one pattern exclusively, this is where architecture, modularity, and patterns such as durable core, ephemeral shell can help achieve the best from multiple concepts of code.

Workflow problems

I think a lot of these workflow problems could equally be characterised as illusory progress, or “easy start” problems.

First up, the classic 90-90 rule. It’s easy to have a code robot build a prototype or a local solution to a common class of problems – the first 90% of the work – but it can be infeasible or at least nonviable to scale this solution further – the second 90% of the work.

Because it’s easy to get started with code robots, we may work in bigger batches with poorer knowledge of the problem to be solved and longer user feedback cycles, charmed by our ability to spit out lines of code, PRs and features. This is counterproductive when the key problem is product-market fit, in which case, producing code was never the bottleneck, testing in the real world was. So we must make sure code robots accelerate the validation cycle (did we build the right thing) and not just verification cycle (did we build it right).

Easy starts can also lead to duplication and fragmentation, because it’s more effort to understand how to reuse existing solutions. This is fine if solutions are ephemeral but problematic if we have multiple durable contenders. Easy starts can delay the structural work required to make it easy to evolve (or continuously re-start) durable solutions over time. Now that starts are easier than ever, they highlight the deeper problems below of knowledge management and organisational alignment.

Bigger batches can lead to a product delivery death spiral. CI/CD practices remain primary in delivering value to customers and shortening all our robotic feedback loops. Martin Fowler’s 2000s observation that if you’re not doing continuous integration, you’re doing deferred integration, which is ultimately harder and more painful, is no less relevant in the 2020s.

Bigger batches lead to higher interaction costs, which grow quadratically as the changes become coupled. Easy code generation can in fact work against all the levers for managing technical debt: we screen fewer changes up front, invest less to socialise new changes (partly because we don’t understand the norms, as below), don’t evolve our architecture to seclude changes (as this lives in the know-how-know-why loop), and have less incentive to surrender the obsolete changes.

Easy, approximately correct code generation also works against Lean practices. Deming (crediting Harold F Dodge) famously warned: “Quality can not be inspected into a product or service; it must be built into it”. However, when we build before we think, we are doomed to try to inspect quality in later. There’s a whole essay about Lean and AI bubbling away.

A paradox here is that better coding robots can hide these easy start problems for longer, allowing us to make more illusory progress and be more deeply committed to the sunk cost of a big batch than we would be otherwise be. For a trip down memory lane, see how much more obvious these problems were in my coding saga with Bard.

In all our workflows, we should be thinking how we might stop starting and start finishing.

Human problems

Consider this quote from Our Robots, Ourselves (ORO):

Seth Teller, formerly an MIT roboticist, perceptively observed that urban driving consists of hundreds of “short-lived social contracts between people,” as we scan the streets, make eye contact, let people in and wave “thank you.”
David A Mindell. Our Robots, Ourselves: Robotics and the Myths of Autonomy

This 2015 observation about social challenges of self-driving cars seems very prescient after a decade of failed FSD promises and as we witness 2026 Waymos clog streets, interfere with emergency vehicles and pass stopped schoolbuses.

I’ve been thinking about robotic coding in the these terms too – less a mechanical problem of generating code and more as “a series of micro social interactions”, mediated by code but also other channels, to align and coordinate the behaviour of actors in the codebase.

I’m not convinced our current generation of coding robots are this socially aware.

ORO (it really is gold) also deals with the challenge of handing off a previously automated task to a human. If automation has failed, it’s often in a complex, ambiguous and uncertain situation, and it can be very difficult for a human to rapidly acquire the context they need to handle this situation. In the case of disengaging self-driving or auto-pilots, slow or incorrect building of context can be crucial and deadly. Mixing automated with human code means these handoffs happen very frequently in the course of development. The consequences of any individual handoff are obviously not as catastrophic, but without enough shared context the cumulative impact can take its toll on people and projects. ORO also describes the human challenge of maintaining vigilance to detect mistakes in systems that are often right.

Automated code generation can also be addictive, and AI experiences are optimised for engagement. Generative AI has often been likened to a slot machine – just one more prompt and we’ll win big! This might be fine and dandy for personal projects, but in a professional capacity, we need to critically assess the odds and our expected value across a development portfolio, not just the big flashing lights and sirens that accompany the occasional jackpot, lest the house win.

Generative AI is great at creating artefacts like code, but often the process of creation is as important as the artefact itself. If you want to contextualise, influence, adapt and evolve in a social environment, all of which might happen in a hall conversation at the drop of a hat, you need to understand at a deeper level the objectives and constraints driving any solution. This only comes from engaging deeply with the problem, where we can’t substitute AI for human effort. If we skip this step often enough, we lose our ability to assess the merits of any new artefact, and this is referred to as cognitive debt.

Thus we also see the expertise paradox in generative AI – that the skills required to identify a good solution may be the same as the skills required to develop a good solution – our era’s Dunning-Kruger – the converse being any specialised output looks plausible to a layperson.

This happens at the individual and also the organisational level. The Nonaka-Takeuchi (SECI) cycle shows how coding robots short-circuit the knowledge management cycle and may weaken it critically. By skipping the “know how” (to change our code), we miss the deeper insight or “know why” (we follow certain norms) that drives the next round of explicit “know what” (in future iterations of code). Yes we should externalise as much of our understanding as we can, which helps both humans and robots, but this is just one step in a cycle that collapses if we don’t treat it as a human-centric cycle. If we leave all the “know why” insights to coding robots, we fall victim to the expertise paradox.

These human problems make life harder (extra cognitive load, etc) and less rewarding (whither mastery, autonomy and purpose?) but also leave organisations at risk of critically degrading core competencies in technology development.

Economic problems

The risk of degraded core competencies is one, but not the only economic or strategic problem to be solved for success with highly agentic coding.

Michael Porter would also be disappointed if we concentrate suppliers for LLMs and hence cede our bargaining power. We should avoid undue dependency on any one LLM provider, by ensuring all our agentic tooling remains reasonably portable to alternative providers. As a business continuity concern too, we want LLM downtime to be minimally disruptive to our development process. And as above, the more human capability we retain, the more we can bargain with any LLM provider.

There is a risk inherent in regression to attractors with AI-driven solutions, which tend to reproduce the most common patterns in their training data. As above, software supply chains may become more economical for adversaries to exploit if more organisations are vulnerable. Solutions influenced or crafted by LLMs will be less sensitive to our unique circumstances, and hence less efficient, but also insidiously erode the value we might create through differentiation in the market. Finally, if we believe our development culture or practices to be a competitive advantage, we must also be careful that highly agentic coding does not commoditise culture or practice.

How much value does AI create itself and how much is as a result of repackaging human toil and ingenuity? The Stone Soup metaphor highlights that a lot of the value is derived from high quality data and tools used by AI, so a cynical response is “very little”. Even if this is the case, AI might nonetheless provide the spark for enhancing data and tools (indeed use of code robots is renewing interest in elements of demonstrated human-centric coding practices like XP, CD and DevOps) and be valuable in that regard as a catalyst rather than reactant. At a minimum, I think it’s now clear there is qualitative value in information search and “word calculator” text transformation applications of LLMs, even if not yet fully quantified compared to alternatives. As a final note on repackaging human toil and ingenuity, LLM providers may yet face an ethical and legal reckoning for IP appropriation and worker exploitation, which is at the very least an economic risk for them and their customers.

We’re yet to see the full cost of consuming LLM tokens. To date, they have been subsidised by investors (in sometimes circular arrangements) and disregard for the externalities they generate when operational in data centres, such as greenhouse gas emissions, water consumption and impact on communities. Artificially low cost has served to drive up usage across the board and drive it into uneconomical use cases. Coding robots are particularly high consumers of tokens as they iterate through multiple feedback loops (aka inference-time scaling). As major LLM providers prepare to IPO or otherwise recoup investments, we can all expect to pay more for tokens from these providers. At this point we’ll discover which use cases are really economical, and we hope we can gracefully back out of those that aren’t.

Finally, as both a mathematical and economic concern, distillation provides the means and access to input and output tokens the technical opportunity for platform providers to Sherlock product companies’ differentiating features (a strategy supermarket home brands have used for decades). Whether they have sufficient motivation may depend on how the growth story plays out.

So our final problem to consider is: whether adoption of highly agentic coding is a 1-way door? We could reach a point where it’s prohibitively expensive to continue consuming tokens AND prohibitively expensive to recover degraded core competencies. There’s a lot of economic value in avoiding that outcome.

Problem checklist

Here are the problems to acknowledge or think about solving en-route to highly agentic coding:

Maths
- High-dimensional, non-linear search space
- Adversarial space, theoretically unsolvable
- Rarity problem
- Capability != reliability
Robotics
- Reward hacking
- Long-term rewards
- Sensor, planner and actuator design
- Feedback loops and dynamical control
Code
- Relationship of spec and code
Workflow
- 90-90 rule aka “easy start” problem
- Validating product-market fit
- Ephemeral duplication or durable consolidation
- Big batch death spiral and technical debt
- Quality must be built in
- Better robots hide problems longer
Human
- Coding as micro social interactions
- Handing automated context to humans
- Addictive nature of Gen AI
- Expertise paradox
- Nonaka cycle preservation
Economic
- LLM supplier bargaining power
- Regression to attractors means vulnerability and commoditisation
- Stone soup assessment
- Ethical and legal assessment
- Strategy for increasing cost
- Distillation-based Sherlocking
- Avoiding 1-way doors

Footnote: Product design search and optimisation

What do I mean by this phrase? Product design is how we design products, and services. Design search is how we parameterise and explore all the possible, desirable, feasible or viable designs for products. Optimisation is how we find the product designs that produce the best outcomes by some measure, be it customer satisfaction, strength-to-weight ratio, or cost.