Generative AI

•

Mar 24, 2026

Is the Future of Software Pay-to-Win?

You can trace the birth of the agentic coding hype machine back to the “ralph loop” in mid-2025, but it really took off in November with the release of Anthropic’s Opus 4.5, followed by OpenAI’s release of GPT-5.2 two weeks later.

Developers started speaking of a qualitative difference in the models, an increase in intelligence and agency. Anthropic published articles about using Opus to write a compiler. Harness engineering replaced context engineering (which had replaced prompt engineering) as the new focus for getting the best performance out of coding agents.

People started talking about software factories, where agents churned away tirelessly implementing specs, and these specs were the new programming language: a step up in abstraction from coding, where you told the agent what to build and the agents turned your intentions into code.

But it appears that turning an 800 line document into 25,000 lines of code is a lot easier than turning 25,000 lines of code into a reliable, working application.

Developers, and companies, are reporting hitting the limits of agents and returning to a more hands-on approach, but still supported by agents. Research is showing agents can’t be trusted to maintain codebases and then there is the nature of the data agents are trained on.

Yet the software factory still has its proponents. Is this difference in approach a skill issue or a budget issue?

The cracks appear in agentic coding

On March the Financial Times reported that Amazon was suffering from AI-related outages:

The online retail giant said there had been a “trend of incidents” in recent months, characterised by a “high blast radius” and “Gen-AI assisted changes” among other factors, according to a briefing note for the meeting seen by the FT.

Having such a high profile tech organisation, who is also a provider of AI services, call out these issues made developers commiserating on X realise they weren’t the only ones having problems with coding agents.

There is the counter-intuitive math that if an agent has a 95% chance of completing any step in a process correctly, then after about 13 steps you’re down to a coin flip if the agent is going to complete successfully at all.

Agents now execute tens or even hundreds of steps from a single prompt.

This “accumulation of errors” showed up in recent research from Alibaba. They tested 18 AI coding agents on 100 codebases, each test running over 233 days. They all failed.

The benchmark the researchers created, SWE-CI, measures code maintenance rather than single code fixes and involves 71 commits based on accumulated changes to the codebase.

SWE-CI shows that long-running code maintenance is still brittle for all current models. Even the best model, Claude Opus 4.6, broke code in 1 out of 4 runs, while the worst models broke code in 3 out of 4 runs.

Sturgeon’s Law meets code repositories

The explanations for failures in benchmarks and in actual products comes down to the nature of agentic coding assistants and LLMs. One factor is the training of the LLMs. They are trained on billions of lines of code, mostly from public repositories from sources like Github.

Like everything, most code is mediocre. And a large portion is just bad – beginners’ projects, abandoned projects, early AI-generated slop, etc. Training based on public code is the latest version of computing’s “Garbage in = Garbage out” maxim.

LLMs have been trained to generate code that runs, but training them to write code, including complex projects, that are well structured and maintainable, is harder. While adding typos to lines of code is easy to detect and train away, qualitative and structural defects cannot be detected and thus cannot be trained away.

The other factor is that coding agents are doing much less reasoning than people imagine. This was demonstrated by a recent paper where frontier models failed to pass simple coding tests using languages that were functionally equivalent to popular languages like Python and Javascript, but whose presence in model training data would be orders of magnitude less.

Even with few-shot examples and in context learning (i.e. providing documentation) the models failed to write even simple programs that a human developer would find easy to do with a novel language under the same circumstances.

The dark factory approach to agentic coding

The term “dark factory” comes from a Chinese manufacturing trend. Certain industries have reached a level of automation such that entire factories are populated only by robots and so don’t need to be illuminated unless humans are present for maintenance. Thus “dark factories”.

The dark software factory, brought to wider attention by StrongDM and Dan Shapiro, works on the same idea, but for agents. You build your software factory so humans aren’t present in the process. If you find you are needing to participate you stop and work out how you can get an agent to do it on your behalf.

The key is validation. Any harness you are using to drive your agents needs some way of testing the code being produced. If the code passes the tests you don’t care what it looks like. Worried about performance? Test it and reject it if it runs too slow or uses too many resources. The agent will keep trying until it passes.

For StrongDM, they used their own dark factory methods to build their agents’ harness – digital twins of every major application their code interacts with:

He had a Google Spreadsheet open. Columns, rows, formatting – it looked exactly like Sheets. Except the URL bar said localhost…

Gsuite was not alone. Slack was there, Jira, Okta… all running locally? A digital twin of the entire enterprise SaaS universe, there on that desktop, faithful enough that the Python client libraries couldn’t tell the difference. Jay confirmed that it was in fact what it looked like; he built it himself. It took a couple of weeks. He used their Dark Factory. [source]

This follows their philosophy:

Code must not be written by humans

Code must not be reviewed by humans

And an important enabler of this philosophy is their mantra:

If you haven’t spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement.

And this “$1,000 in tokens per day per engineer” could be what makes it work.

The handmade approach to agentic coding

Alongside the dark factory approach is the day-to-day experience of developers. There are plenty of conversations happening online about the limits in agent coding ability that they are running into.

The consensus there is that agents are okay at coding, but terrible at software engineering. Software engineering being not just the “big picture” but the required consistency and discipline to keep software working and maintainable.

These developers are advocating working with smaller changes to the codebase, and even returning to the tab completion model popularised by Cursor in 2024.

Much like the SWE-CI benchmark showed models failing as multiple changes accumulated, “slop creep” can occur when using agents manually in a codebase. The code can still continue to work, and even pass tests, up to a point.

But without consistent human reviews it will deteriorate in quality until it finally fails and the agents themselves are unable to fix the errors they’ve created.

And when it fails, the codebase has often evolved to a state where no-one understands it and debugging it is manual, slow and painful.

This is the reality Amazon ran into.

Is the future of software development splitting in two?

Two camps in agentic coding practices are the dark software factory and the hands-on engineer. Both camps are using the same models. Both camps believe in testing and validation.

The dark software factory is declaring rapid delivery of software, while the hands-on engineer is seeing modest gains in productivity.

The dark software factory is spending “$1,000 on tokens today per human engineer” while the hands-on engineer is spending $200/month on a Claude Code or OpenAI Codex subscription.

What will the future of software look like under these two regimes? Will one dominate in the long run? Will they each have their niche? Can you spend your way to competitiveness in the software market?

At SoftwareSeni we lean more towards “hands-on engineer”. But we are always watching how software development is evolving and taking on the best practices as they become clear.

If you’d like to chat about software development or building businesses around software get in touch.

Is the Future of Software Pay-to-Win?

The cracks appear in agentic coding

Sturgeon’s Law meets code repositories

The dark factory approach to agentic coding

The handmade approach to agentic coding

Is the future of software development splitting in two?

Related Articles

Get up to speed on Agents

Creating opportunities with AI to transform service industries

Why we all should want Junior Devs working smarter more than Senior Devs working faster

Need a reliable team to help achieve your software goals?

BUSINESS HOURS

SYDNEY

YOGYAKARTA

BANDUNG