The chart below provides a new take on the “SaaS-pocalypse” talk that has been happening recently. What the chart shows – an explosion in new apps occurring alongside no net increase in app usage. What the means for software-based businesses is what we will be discussing.

The chart comes out of a study by MIT’s Mert Demirer (and co-authors): “Writing Code vs Shipping Code – Productivity Effects Across Generations of AI Coding Tools”. They used Github data to track the evolution of AI usage across lines of code written, number of files edited, the number of projects and features worked on and the actual releases of software.
This span of activities let them compare AI coding tools from first wave of auto-complete tools (eg Cursor and Github Copilot) through coding assistants to coding agents. What they found was that AI had increased productivity at the beginning of the process, with coders creating or editing 300% more files. But once they traced through the product development process to software releases, they found only a 30% increase.
The reason for this precipitous drop – humans in the loop. Amdahl’s Law says a system’s speed is constrained by its slowest step, and in the era of agentic coding we are responsible for all those slow steps. And it is likely that that for some software products, that rate of software releases may not increase much.
While line counts can increase at agentic speed, most processes in this world are still constrained by real world dynamics. You release features 30% faster, but your users aren’t moving 30% faster. Feedback trickles in, it’s contradictory, some users are seeing bugs. Whether or not the feature was a success will take time to resolve. And all the decisions that hang on that feature’s performance will have to wait.
These points where code meets the real, messy world are not going to go away and will continue to place an upper bound on how quickly we can progress.
And as that app chart shows, progressing quickly is not always progressing successfully.
A common talking point that was part of the SaaS-pocalypse was that anyone could take your SaaS and reproduce it in a weekend. At the same time, “prompt an app” services like Bolt.new and Base44 promised product development for everyone. It did feel like the competition was going to rise to an unprecedented level.
But what has happened in the app stores is very different to the narrative. More apps than ever were created, but very few were being used. There are three broad explanations that alone and together might explain this: the apps work poorly (bad architecture), the apps look unappealing (bad design), no-one knows about them (bad marketing).
Another frequent talking point around the use of AI is “taste”. “Taste” is the knowledge and experience needed to extract professional results from what are supposed to be frontier-level intelligence models. If you don’t have a working knowledge of application architecture can you get AI to build you an app that can grow and improve while handling edge cases and increased traffic? If you don’t understand UX and UI how can you know if your app is well designed and easy to use to others? And if you don’t understand marketing, how do you think people find apps and are onboarded?
To the surprise of no-one, being able to prompt an app (and we’re including web-based services as well in this) is not the same as being be able to run a software-based business. The rush of new competitors into the market is mostly a temporary inconvenience, but other vectors of competition have intensified.
For every 100 or 1000 new quick-and-dirty apps in a market there are going to be 2-3 apps built by very small teams who have built exactly the same app for an ex-employer in the past. These new agent-native teams have the ability to reach feature parity faster than bigger orgs with larger teams that haven’t fully integrated AI. They lack the incumbents’ customer relationships and vendor networks, but they can only gain and the customer count they need is much lower than everyone else’s.
The other competitor is the one that has gone all in on AI and is making it work. An example of this is Fin, which you once knew as Intercom. They changed names mid-May. Last June, Darragh Curran, CTO and Head of Engineering at Fin/Intercom announced a goal to 2x their productivity. In April 2026 he announced their success and gave an outline of how they did it. Yes, they are one of those notorious companies that burns through tokens.
Over 9 months Fin 3x-ed the number of PRs. This was their measure, number of PRs, which Curran admits is imperfect for many reasons but served them well.
That 3x, 300%, is 10 times larger than the MIT study’s 30% increase in PRs. But the MIT study was a survey across random Github projects. This was an R&D focused tech company with an explicit goal to pursue.
They succeeded through continuous iteration and experimentation. They settled on a single provider – Anthropic. They built company-wide infrastructure for sharing and automatically updating Claude Skills, so any improvements propagated throughout the ~500 member team. They worked out what they could safely automate away (eg PRs with less than 20 lines of code) and built the infrastructure to make it safe.
This increase in PRs now represents an increase in speed across the company. Their revenue growth is accelerating. And in terms of competition, Curran says:
“we are able to say yes much more frequently, which translates into deals closing that would have been blocked, or accounts churning because we can’t support their evolving needs”
This is an incumbent moving at AI-native speeds but with reputation and relationships on their side.
This article was a little unfair. It started out making you feel better about the increase in competition AI has created, and has now probably left you worried about your existing competitors.
But, as they used to say, forewarned is forearmed. At SoftwareSeni we are constantly exploring best practices in software development across dozens of projects and in every corner of our operation.
If you ever want to chat about what we’re learning or if you’re interested in extending your team with developers trained in AI-assisted coding and embedded in an organisation dedicated to doing quality work with AI, get in touch.
SaaS Are Moving to Usage-based Pricing to Survive AIThe rise of AI agents is shifting how software services are consumed and so it is also changing how they need to be priced.
The traditional seat-based model relies on a simple idea: one human user equals one license.
Seat-based pricing doesn’t make sense in an economy with a growing population of agents that work 24/7 and outnumber the customers that run them against your APIs. More and more companies are following Mintlify’s lead – moving away from seats to usage-based pricing.
Usage-based pricing is a challenge to bolt on to an existing product. So we’re going to give you a quick overview of what you’ll be looking at, starting with – does usage-based pricing even make sense for your product?
The first step in deciding if you’re going to move to usage-based pricing (UBP to save typing) is to look at your industry and your competitors. Is it already common? Are others making the shift? If they are, adopting a similar structure can help you remain competitive. If they’re not, you might have a strategic window to capture some market share if your UBP can give potential customers the cost flexibility and lower adoption costs then you’re competitors can.
Next, you have to look at your product and ask yourself these three questions:
1. Can you actually break down usage into units?
What you’re going to be billing needs to relate directly with the value your customers are getting from your product. If you’re MailChimp you charge for sending emails and give templates away for free. Paying more money to send more emails makes sense to their customers.
Time can be a unit. If you’re AWS or Google you charge your customers based on the time they spend running their services on your hardware (plus many other things).
2. Can customers easily predict their usage requirements?
While customers want flexibility, they also hate unpredictable bills. This is why there is an interesting trend in enterprise software: multi-year contracts growing from 23% to 38% of agreements. They aren’t doing this for discounts, they’re locking in predictability, especially if AI-powered features are involved.
3. Are the usage and value of your products/services increasing?
If your customers are getting more value the more they use your tool, UBP allows your revenue to scale with their success.
Finally, consider your product. Do you need to develop entirely new features or services around your current offerings to make usage-based pricing make sense or are you ready to go?
If you decide to move forward on UBP, you need to choose the right pricing structure. There are four primary variations to consider:
Variable Pricing: This model is based entirely on consumption. The most common example today is token-based pricing for AI models, but it could also apply to the number of reports generated, resolved customer support complaints (a model used by Intercom), or fraud predictions. This is typically managed via a credit-based system. Customers are given a default monthly credit allowance to help smooth out your revenue fluctuations, and they can buy more credits as their consumption increases. This is how platforms like Lovable handle their billing.
Tiered Pricing: Under this model, different unit costs apply as usage passes specific thresholds. Tiered pricing allows customers to maintain control over their budgets while ensuring your margins remain positive. It also lets you offer volume discounts to your largest accounts. The downside is that too many tiers confuse buyers, while too few make the price jump between tiers painful.
Dynamic Pricing: In this structure, prices shift in real time based on market conditions and demand. Uber’s surge pricing is the most familiar example of this model.
Per-Feature Pricing: Similar to variable pricing, this model charges customers only for the specific features they activate and use, often tracked via a credit-based system. This gives your customers complete autonomy over their costs, allowing them to balance the price-to-value equation themselves.
Building the technical infrastructure to support usage-based pricing is a significant engineering challenge. You will face three distinct hurdles: metering, storing, and implementation.
1. Metering
You must record consumption data exactly where it happens, whether that is API requests, token counts, or raw compute time. The recommended approach is “fire-and-forget”: your application code should emit a single billable event to your metering system the moment a transaction occurs.
For continuous workloads, like long-running background jobs or continuous compute, tracking start and end events is risky because data can be lost if a system crashes mid-job. Instead, use a “heartbeat” approach, where the active workload sends a heartbeat record to the metering system at regular intervals.
2. Storing and Processing
This is where you enter the world of data warehousing and complex stream processing. To provide real-time billing updates, you cannot simply dump records into a database and run batch jobs overnight; that leads to stale data. Most modern platforms rely on Kafka-based streaming architectures to process usage events in real time.
This data must be auditable. You cannot do this by halves, accuracy is paramount. If you are sharing revenue with partners or dealing with billing disputes, you need an audit trail that captures and tracks every single billable event.
Also, don’t treat usage data purely as an input for invoices. Your sales, product, customer success, and finance teams can all use access to this data. Real-time usage patterns can help sales spot upsell opportunities, help product teams track feature adoption, and help customer success flag accounts where dropping usage signals a risk of churn.
If you’re already tracking usage data for these purposes then lucky you, you’re already part way there.
3. Implementation
You do not have to build all of this billing infrastructure from scratch. There are lots of service providers out there already. You just need to instrument your code with those “fire-and-forget” messages to hit their databases and they give all the other tooling you need.
Here’s 3 examples:
Moesif: This platform specializes in metered API billing. It connects directly to API gateways like Kong or Tyk and integrates with payment processors to automate usage tracking and billing.
Stripe: In addition to standard subscription billing, Stripe supports complex usage catalogs (including models used by companies like Anthropic. They also offer an LLM proxy endpoint (in private preview), which automatically tracks token usage, applies your pricing markup, and handles invoicing in a single request.
Chargebee: offers usage-based billing features designed to separate your raw usage data from your pricing logic, making it easy to run experiments and iterate on your pricing tiers.
While the benefits of UBP are clear, the transition introduces real business risks that you need to manage.
The biggest challenge is revenue fluctuation. Unlike flat-rate subscriptions, your MRR will go up and down. To mitigate this, most companies use a hybrid pricing model. For example, you might charge a base subscription of $20 per month that includes 1,000 credits, and then charge a variable rate for extra credit purchases.
Another major risk occurs when you bill for consumption in arrears (charging after the usage has occurred, similar to AWS or Google Cloud). If an AI agent or a poorly written loop in a client’s code runs out of control, it can burn through thousands of dollars of spend. This leads to “bill shock,” which can ruin customer relationships or even bankrupt a small client.
Finally, remember that you are exposed to broader economic conditions. If your customers experience a business slowdown, their software usage—and your revenue—will drop alongside theirs. You may also face higher delinquency rates during market downturns.
Even if AI agents are driving the usage, the person signing the contract and paying the bill is still a human. Humans require transparency, predictability, and clear communication. If you transition to usage-based billing, you must provide your customers with the tools, dashboards, and real-time alerts they need to monitor and manage their own consumption.
AI agents are changing how SaaS are run and valued. The traditional seat-based model is on its way out. While building the infrastructure to support usage-based billing is complex, we should be relieved we have a challenge that is straightforwardly solvable.
The alternative, in a world where people talk about using AI to build inhouse versions of the services they pay for, is not having a business at all.
Agile in the Age of AIAgile is how software is built. Its conceptualisation, its practices, its strategies have permeated software development, even in teams who are not Agile practitioners.
The determination of Agile was to keep the developers of software aligned with the users of the software. Alignment was maintained through feedback loops, sync points with the stakeholders, and the feedback loops were short: software was built incrementally and iteratively, and those increments were kept small so iteration could happen quickly and developers and stakeholders could never drift too far out of sync. This is what the Agile product stories and sprints grew out of, and technical practices like CI/CD developed to support them.
Now AI has broken Agile. Coding assistants and agents have changed the flow of software development: where time is spent, where costs are generated, where the sync points are.
We’re going to take a quick look at what AI is doing to Agile and what can be done to get the best of both tools. Rather than swapping between AI assistant, AI agent, coding agent, etc, we’re just going to call it AI.
Getting humans to iterate on code is expensive, so you want to get it right. You want your developers building the right pieces. This is why user stories, sprint planning, standups, ticket grooming and so on exist. It’s all done to reduce risk; the risk that your developers just spent weeks on the wrong code.
Getting AI to iterate on code is cheap and fast compared to getting humans to do it. It becomes the cheapest and fastest part of the process. A two week sprint can be completed in an hour or two. This changes project cadence, and project scheduling, and messes with the stakeholder sync points.
Do you have meetings every few hours to discuss the new feature implementation? Do you discuss forty new features at the next stakeholder meeting? What do you cover in your stand-ups?
AI shifts effort from coding to reviewing. Except the review schedule has been decoupled from human effort and timing. It is quite easy to instruct a few agents that go on to generate a constant flow of PRs, with each PR encompassing thousands of lines of changes across hundreds of files.
Review fatigue is real and results in developers skimming a fraction of the changes in a PR before accepting it. And they are going to accept it because they have been accepting PRs for weeks and their knowledge of the codebase is stale and getting back up to speed plus doing a proper review of the PR would take just as long as implementing the changes themselves.
The consequence of review fatigue is technical debt. Without pushback on its output, AI accumulates poorly architected code on top of poorly architected code. Eventually, an error occurs that overwhelms the context and the understanding of the AI. Slop can’t fix slop, as they say. And developers need to go back in and spend a schedule-breaking amount of time and effort to understand the codebase and implement fixes manually.
You can make AI work with Agile. You can tweak the methodology, you can use adopt new tools, and apply some old-fashioned discipline.
The first step in tweaking the methodology is to rethink your sync points with stakeholders. What should the unit of work look like? When does it make sense to meet and review progress?
And your sync points will depend on how you define when a unit of work is done. What does done look like when AI is generating your code?
Done should be when all tests pass and all metrics are met. All the tests and all the metrics. Because AI is trained to pass tests to the exclusion of all else (that is Reinforcement Learning in a nutshell – learning to pass tests), it can’t be trusted in how it passes tests. It will take shortcuts, it will try to shift the bar it is supposed to be clearing. This means that for every metric you need a secondary metric that detects cheating.
Your code test coverage needs to be paired with mutation testing. Performance benchmarks need to be paired with realistic data fixtures so you’re not surprised when customers start hammering your product. Find a counter-test for every test.
And once the tests pass and the metrics are met, then you still need to have humans do the review. This will be the cap on the unit of work and the limit on your throughput. It is an old fashioned “stitch in time saves nine” solution and one that should be bypassed only very grudgingly and only by deep testing and constant review (see, it’s inescapable) of the tools and processes you replace it with. And let’s admit it, the tools and processes you replace humans reviewing AI code with will always be AI reviewers of AI code.
Finally, consider what you’re going to cover in your daily scrums. Instead of individual status updates you want to be covering what reviewing needs to be done, what testing strategies are being put in place, what holes are showing up in your process that need patching.
This is a shift from building the product to managing the building of the product. And that shift is for everyone. Your developers will still write code, just much less. They will be managers of agents and monitors and arbiters of their agents’ outputs. AI flattens the workflow to two steps: Design → QA.
This is just a single step forward in what is a time of rapid constant change. While it does appear that AI abilities are plateauing, the software development industry is still evolving rapidly in its use of AI.
What is clear is that AI is not a super genius. It’s decisions can’t be relied on and its code cannot be trusted and we must verify, verify, verify. But sometimes we can use tools to handle that verification, and sometimes we can even use more AI.
And this is impacting where software developers spend their time, and where the bottlenecks are in building software products. It is an ongoing challenge to find the new best practices for the Agile development of software in the age of AI. We hope this gives you some ideas on where you can look to improve and optimise your practices.
The AI Job Replacement Calculator
| Rack | GB200 NVL72 | 36 Grace CPUs + 72 Blackwell GPUs, liquid-cooled |
| GPUs per rack | 72 × B200 | |
| Serving nodes | 9 × 8-GPU | 8-GPU tensor-parallel group |
| IT power | 120 kW | 115 kW liquid + 5 kW air |
| PUE | 1.2 | Modern liquid-cooled facility |
| Facility power | 144 kW | 120 × 1.2 |
| Racks per GW | 6,944 | 1,000,000 ÷ 144 |
It might be actual industry interest, or it might be the buying power of two corporate juggernauts’ marketing machines, but the new Starbucks app within OpenAI’s ChatGPT that let’s you create a coffee based on “vibes” is generating a lot of media.
For you, the interest is that the Starbuck’s app is an MCP App that’s letting their customers access their products from inside the UI that more and more people are spending more and more time in. When in Rome…
We’re going to give you a quick overview of MCP Apps and help you decide if you should build one.
MCP is Model Context Protocol. It’s a standard announced, and open sourced, by Anthropic in November 2024 that provides a way to give AI access to APIs. It provides detailed description of what API endpoints do and expected data and return types. This is enough information for AI agents to understand how to request data, and the necessary information for the agent harness (ChatGPT, Claude Code, Cursor, Copilot, etc) to pass the data back and forth between the API and the agent.
MCP was an instant hit. Now you could connect AI to anything – databases, Stripe payments, coffee machines… Best practices in security and management followed and with the addition of proper authentication and permissioning, it has been embraced by enterprise. It provides the perfect interface for giving employees secure and monitored access to inhouse resources through the ubiquitous ChatGPT/Claude clients.
In October 2025 OpenAI announced their OpenAI Apps SDK. It was MCP with added on UI. That UI is built and rendered like ordinary web content, and so can provide users with any kind of interface and feature you’d like.
In November 2025, the App extension to MCP was launched as a provisional addition to the standard. This wasn’t a competing standard. It was heavily aligned with OpenAI’s Apps SDK, and in January this year it became a part of the Model Context Protocol.
OpenAI’s App SDK still has its own proprietary elements. OpenAI App SDK let’s your MCP App access local files, trigger checkout flows and use a bunch of other ChatGPT specific APIs.
It is interesting that despite OpenAI’s App SDK providing checkout flows, Starbuck’s MCP App opens their own website (or app if on a phone) to actually complete the order. This might be to avoid the inevitable fees going through OpenAI’s payment gateway incurs. We can only wonder if this option will remain available or if they will pull an Apple require all payments to go through their gateway so they can take their cut.
MCP Apps rely on the harness. Whether it is ChatGPT, Claude, Claude Code, Copilot, etc, etc. The harness is the middleman between everyone:
User ↔ Harness ↔ Agent
MCP App widget ↔ Harness ↔ Agent
MCP App widget ↔ Harness ↔ User
MCP server ↔ Harness ↔ Agent
MCP server ↔ Harness ↔ Widget
This is to ensure that MCP Apps are safe to use. The UI itself runs in a sandboxed iframe. It can only talk to the outside world, including the server that delivered it to the harness, via the harness. If it wants to load data, it calls a function in the harness. If it wants to provide the agent with fresh information for its context, it calls a function in the harness.
Microsoft has some good examples of MCP Apps with complex UIs that allow the user to navigate information manually, while also asking the agent to take actions on the data.
The user can explore with the UI. The UI can send updates to the agent. The user can ask the agent to take actions based on what is happening in the UI. And if the MCP App provides the right tools, the agent can carry out actions that are reflected in the UI.
What an MCP App can do is limited by what tools (ie API access points) and data you expose to them.
That depends on your users and your product. Are your users heavy users of AI? Have you checked?
Is your product part of a broader process or workflow? Is it a “mission control” style system?
Does your product generate data of any kind that needs to be communicated outside of your product? Do you already have some export functionality?
Then your users might appreciate an MCP App.
The MCP group have all the documentation you need to build an MCP App. And they recommend you start by installing a skill in your AI coding agent and asking it to build it for you.
Of course, you need to give some thought and planning to authentication and permissions and payments and all the other complexities that make a product viable on top of how you are going to implement your UI.
You probably already have a UI. Can you retarget it to the MCP App interface easily?
One of the major shifts in building web-based products was the separation of apps into APIs and UIs. It meant your backend could drive an Android App as well as a website. Or an iPhone App. Or an AppleTV app.
MCP Apps are the next target for your APIs. And given the way AI is eating software the way software was supposed to eat the world, it might be the last target. At least the last target you need to write yourself.
Open Source Exploits And How To Protect Your Codebase From ThemOpen source software has become the foundation that the Internet and contemporary software – from phone apps to SaaS – is built on. It’s a vast library of code created and maintained, mostly, by volunteers. It’s the world of software development’s greatest asset, and it is becoming its greatest risk.
Popular open source libraries can be used in millions of projects, even without the project owner’s knowledge. Open source libraries are built on open source libraries, which are often built on open source libraries. There’s a chain of dependency, and if any link is compromised, the attackers win.
What the attackers win is generally access to cryptocurrency, if you have any. It seems to be the motivating factor for lots of exploits, going by the payloads they install. But stealing credentials and taking over accounts to enable ransom bids on businesses is also on the cards.
We’re going to look at the two most recent open source exploits and how they were accomplished. Then we’ll give you the basic advice for staying safe while still being able to participate in and reap the advantages of open source software.
Axios is one of those foundational libraries that everyone uses because it simplifies common operations. Axios makes pulling data into browsers from servers more pleasant. It is part of the Node ecosystem, which means its part of the modern Javascript/React/SaaS world. It’s used everywhere.
This exploit was performed through social engineering. The attackers cloned a real company – including deep fakes of individuals for video calls and a Slack workspace with channels containing chatter and links to LinkedIn posts. On March 30 this year they invited the maintainer in, then started a video call. The video call webpage announced it needed an update installed. And the maintainer allowed the update to run.
But it wasn’t an update. It was a RAT – a Remote Access Trojan. It grabbed his credentials and updated the Axios repository to include a new dependency – a third party library the attackers had developed and uploaded to the npmjs package repository, where every client of the Axios library would be able to download it.
The attackers’ library ran a script that downloaded another RAT to every machine that updated their Axios installation. This gave the attackers remote access and total control over those machines.
LiteLLM’s exploit was the result of an earlier successful exploit against Trivy, ironically a security scanner.
LiteLLM is a Python package rather than a Node package, and credentials accessed as a result of the Trivy compromise allowed attackers to access LiteLLM’s software publishing pipeline. This allowed them to add a credential stealer to the codebase on March 24 that would launch every time the library was accessed. The credential stealer grabbed everything from remote machine logins to cryptowallets.
LiteLLM makes it easy to connect to hundreds of AI models and providers. It is downloaded 3.4 million times a day (note – the bulk of these are automated downloads as part of testing and building software that uses LiteLLM). During the 46 minute period the exploit was live there were 46,996 downloads of the compromised software.
The exploit was found because it had a bug that resulted in any machine that downloaded it grinding to a halt within seconds. But there were still real consequences:
“AI hiring startup Mercor confirmed it was ‘one of thousands of companies’ affected by the LiteLLM supply-chain attack as the fallout from the Trivy compromise continues to spread. … The company’s admission follows claims by extortion crew Lapsus$ … that it stole 4 TB, including 939 GB of Mercor source code, plus other data, from the AI recruiting firm, and offered to sell the purloined files to the highest bidder.” — The Register, 2026-04-02
While there are bad elements out there working to take advantage of open source, they are outnumbered by the people working to make open source safer. The open source world is going through a transition period at the moment, mostly driven by AI changing what can be accomplished with software and how fast. But this works just as well for defense as it does for attack. Scanning for exploits and more secure practices for package sites is coming online.
In the meantime, while dedicated teams are working at detecting and blocking exploits as quickly as possible,there are basic steps you can take that will greatly reduce your exposure.
Software has never been easier to bring into existence. Sadly, this includes exploits as well as beneficial tools. Open source software is and will remain an essential resource for every software developer out there. This is why individuals and organisations are pouring resources and ingenuity into keeping it secure and safe, and they have the numbers on their side.
Even with the speed and power of AI coding agents no-one can expect to build and maintain every piece of the software stack their business relies on. But by moving a little bit slower and setting some sensible defaults for how third party software is incorporated into your codebase, you can reduce your risk to the minimum while still getting all the benefits participating in the open source ecosystem brings.
The Three Trends Shaping the Future of AIWhen you look at how the frontier AI companies are talking and operating, the future looks like it will be filled with giant data centres within which all jobs are performed. It doesn’t appear to leave much for the rest of us.
Looking beyond OpenAI and Anthropic talking up their future prospects, there are some trends in software, hardware and AI that are bootstrapping off each other and pointing to a different direction things might go.
We’re going to look at these trends and show how they fit together and what it might mean, starting with software.
In May 2025 Claude Code became generally available. This agentic coding tool was one of the first tools to clearly demonstrate how capable models had become. Using the chat interface within a terminal (and the code open in an IDE), a developer could tell Claude to make some changes and then sit back and watch as the agent performed all the steps to find and read files, make edits and run tests. It might even do an online search if it needed more information.
The uptake of Claude Code was rapid, and subscriptions to it drove up Anthropic’s revenues. A whole new product category had been born. And it was a category that people were happy to pay for – unlike AI chat. Other model providers – Google, OpenAI – took notice and launched their own versions.
Anthropic continued to refine their models, training them to perform more consistently and diligently within the Claude Code “harness” and across the kinds of tasks developers needed completed.
In October of the same year Peter Steinberger, was working on a tool to connect Claude Code to a WhatsApp chat so he could keep working from his phone when he was away from his computer. It was simple middleman that ran commands from a WhatsApp chat in Claude Code and passed the results back to the chat.
Steinberger “discovered” that the Claude Code instance running back at home on his computer could use any command line tools it had available, and would even download and install tools if it needed them.
All the training Anthropic had performed to improve the coding workflow ability of the model had made it usable as a general agent.
Steinberger went on to build tools so Claude Code could access Google services, including mail and calendars, and the first “ClawdBot” was born. Three months later it was the fastest growing open source project in history, renamed to OpenClaw to avoid litigation, and Steinberger was hired by OpenAI.
Another new product category had been stumbled upon. And this one was not limited to developers. Ignoring all the security issues, people were using OpenClaw to run their businesses. People were installing OpenClaw for their parents. In China tech firms were having OpenClaw days where they helped consumers install it and get it connected to their services. OpenClaw as user lock-in.
Agents were no longer just for coders, they were for anyone. Agents were becoming the universal voice/chat interface to the complexities and drudgeries of the services everyone is forced to use.
Once a model is released, the providers continue to train them to improve performance. Training data is cumulative. The more training data you can collect over time (especially via free products where training on interactions is the price of using it) the more high quality data you can accumulate. Model providers have been collecting data from millions of users for a few years now.
At the same time, optimisation of model training is a global research focus in the field. As the data curation improves so do the techniques to make use of that data. The effects of this can be seen in the leap of ability in Claude Opus in late November 2025 and OpenAI’s GPT-5.2 a few weeks later. Developers reported both as the latest step change in model ability.
Post-training techniques and data curation are also benefitting open source models. These models are now only 3 to 6 months behind frontier models in performance and the intelligence gap between them has dropped to 5-7% according to the Artificial Analysis Intelligence Index.
Optimised training and curated data is resulting in small models that can be run on local hardware, like the Qwen3.5 9 Billion parameter model (released March 2026) that can beat GPT-4 on instruction following and reasoning.
The GPT-4-0613 model, though official numbers have never been released, was estimated to be a 1.7 Trillion parameter MOE model with 220 Billion parameters active at any one time. 200x larger than Qwen3.5 9B.
Qwen3.5 also comes in 27B, 35B-A3B, 122B-A10B, 397B-A17B sizes, as well as smaller 0.8B, 2B, 4B parameter versions.
The 27B model is smarter than the MOE 35B-A3B, but the MOE model, due to having only 3 billion parameters active at a time, can be used in lower memory environments and respond faster.
There has also been a shift towards 4 bit parameters, both on the training side (via Nvidia NVPF4) and on the local model side (eg Unsloth’s Qwen3.5 4 bit quants). Normally model parameters are a 16 bit floating point number (eg 0.29847). This means a 9 billion parameter model would need 18 gigabytes (16 bits = 2 bytes, 2 * 9B = 18B) just to be loaded into memory. Quantised to 4 bits in a way that minimises loss of quality, it can be loaded into 4GB of RAM.
The take away is that there exists smart, useful models that can match the resources available for compute.
In 2017 Apple announced the A11 Bionic chip for the new iPhoneX. It was a system-on-a-chip (SoC) that featured integrated CPU, memory, GPU, and a Neural Processing Unit (NPU). The NPU was used to power Face ID, computation photography (needed by the tiny camera), speech recognition and more.
Apple’s drive to bring more compute to the power-constrained environments of the iPhone and iPad translated nicely to their volume-constrained and slightly less power-constrained laptops. In November 2020 they launched the M1 – a laptop SoC that set new benchmarks in speed and efficiency, but was still just a higher-powered version of the chip powering the iPhone.
But in building a SoC that could support the iPhone’s best-in-class features, Apple ended up with an architecture that looks like it was designed for the AI boom. The memory tightly integrated with the processor instead of connected via copper traces running across circuit boards, the GPU and NPU cores – it all made running local AI models fast and efficient.
And because the memory architecture did not separate main memory from the memory available to the GPU and NPU, it meant M-series chips could run any model that could fit into the total memory available – up to 512GB on M3 Ultra chips.
Videos of developers running trillion parameter models on Apple hardware (Kimi K2.5 on M3 Ultra Mac Studios) made developers on PCs with 16 GB GPU cards take notice. So did CPU manufacturers. They have all started bringing out their own SoCs targeting local AI:
This move from designing and building general purpose CPUs to AI-optimised SoCs is a seismic shift in the marketplace. It’s bigger than PCs arriving with graphics capabilities or WiFi.
Nvidia is still a great believer in the RTX AI PC, which is simply the current PC+GPU set up. These are step up in performance but come at a price, and that price currently limits the size of models that can be run on them.
Over the next few years, between hardware improvements and model intelligence increasing, most people’s needs for any kind of AI agent will be able to be met by a model running locally on their own machine.
Most people don’t build front ends for websites. They don’t need frontier coding models. And the people who do build front ends for websites: they have all the same administrative, scheduling and service navigation needs everyone else does.
Agents, the product that OpenAI, Anthropic and Nvidia want to sell to enterprise and everyone else; and what we expect is the start of general AI usage spreading out into the world, will not be able to be kept locked behind a paywall.
While the giant data centres look like a bid to corner the supply of compute, and thus control the price of access to AI (not at all helped by Sam Altman stockpiling RAM), trends in small local model intelligence and improvements in the hardware they run on suggest choice and control might remain in the hands of individual users and businesses.
Is the Future of Software Pay-to-Win?You can trace the birth of the agentic coding hype machine back to the “ralph loop” in mid-2025, but it really took off in November with the release of Anthropic’s Opus 4.5, followed by OpenAI’s release of GPT-5.2 two weeks later.
Developers started speaking of a qualitative difference in the models, an increase in intelligence and agency. Anthropic published articles about using Opus to write a compiler. Harness engineering replaced context engineering (which had replaced prompt engineering) as the new focus for getting the best performance out of coding agents.
People started talking about software factories, where agents churned away tirelessly implementing specs, and these specs were the new programming language: a step up in abstraction from coding, where you told the agent what to build and the agents turned your intentions into code.
But it appears that turning an 800 line document into 25,000 lines of code is a lot easier than turning 25,000 lines of code into a reliable, working application.
Developers, and companies, are reporting hitting the limits of agents and returning to a more hands-on approach, but still supported by agents. Research is showing agents can’t be trusted to maintain codebases and then there is the nature of the data agents are trained on.
Yet the software factory still has its proponents. Is this difference in approach a skill issue or a budget issue?
On March the Financial Times reported that Amazon was suffering from AI-related outages:
The online retail giant said there had been a “trend of incidents” in recent months, characterised by a “high blast radius” and “Gen-AI assisted changes” among other factors, according to a briefing note for the meeting seen by the FT.
Having such a high profile tech organisation, who is also a provider of AI services, call out these issues made developers commiserating on X realise they weren’t the only ones having problems with coding agents.
There is the counter-intuitive math that if an agent has a 95% chance of completing any step in a process correctly, then after about 13 steps you’re down to a coin flip if the agent is going to complete successfully at all.
Agents now execute tens or even hundreds of steps from a single prompt.
This “accumulation of errors” showed up in recent research from Alibaba. They tested 18 AI coding agents on 100 codebases, each test running over 233 days. They all failed.
The benchmark the researchers created, SWE-CI, measures code maintenance rather than single code fixes and involves 71 commits based on accumulated changes to the codebase.
SWE-CI shows that long-running code maintenance is still brittle for all current models. Even the best model, Claude Opus 4.6, broke code in 1 out of 4 runs, while the worst models broke code in 3 out of 4 runs.
The explanations for failures in benchmarks and in actual products comes down to the nature of agentic coding assistants and LLMs. One factor is the training of the LLMs. They are trained on billions of lines of code, mostly from public repositories from sources like Github.
Like everything, most code is mediocre. And a large portion is just bad – beginners’ projects, abandoned projects, early AI-generated slop, etc. Training based on public code is the latest version of computing’s “Garbage in = Garbage out” maxim.
LLMs have been trained to generate code that runs, but training them to write code, including complex projects, that are well structured and maintainable, is harder. While adding typos to lines of code is easy to detect and train away, qualitative and structural defects cannot be detected and thus cannot be trained away.
The other factor is that coding agents are doing much less reasoning than people imagine. This was demonstrated by a recent paper where frontier models failed to pass simple coding tests using languages that were functionally equivalent to popular languages like Python and Javascript, but whose presence in model training data would be orders of magnitude less.
Even with few-shot examples and in context learning (i.e. providing documentation) the models failed to write even simple programs that a human developer would find easy to do with a novel language under the same circumstances.
The term “dark factory” comes from a Chinese manufacturing trend. Certain industries have reached a level of automation such that entire factories are populated only by robots and so don’t need to be illuminated unless humans are present for maintenance. Thus “dark factories”.
The dark software factory, brought to wider attention by StrongDM and Dan Shapiro, works on the same idea, but for agents. You build your software factory so humans aren’t present in the process. If you find you are needing to participate you stop and work out how you can get an agent to do it on your behalf.
The key is validation. Any harness you are using to drive your agents needs some way of testing the code being produced. If the code passes the tests you don’t care what it looks like. Worried about performance? Test it and reject it if it runs too slow or uses too many resources. The agent will keep trying until it passes.
For StrongDM, they used their own dark factory methods to build their agents’ harness – digital twins of every major application their code interacts with:
He had a Google Spreadsheet open. Columns, rows, formatting – it looked exactly like Sheets. Except the URL bar said localhost…
Gsuite was not alone. Slack was there, Jira, Okta… all running locally? A digital twin of the entire enterprise SaaS universe, there on that desktop, faithful enough that the Python client libraries couldn’t tell the difference. Jay confirmed that it was in fact what it looked like; he built it himself. It took a couple of weeks. He used their Dark Factory. [source]
This follows their philosophy:
Code must not be written by humans
Code must not be reviewed by humans
And an important enabler of this philosophy is their mantra:
If you haven’t spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement.
And this “$1,000 in tokens per day per engineer” could be what makes it work.
Alongside the dark factory approach is the day-to-day experience of developers. There are plenty of conversations happening online about the limits in agent coding ability that they are running into.
The consensus there is that agents are okay at coding, but terrible at software engineering. Software engineering being not just the “big picture” but the required consistency and discipline to keep software working and maintainable.
These developers are advocating working with smaller changes to the codebase, and even returning to the tab completion model popularised by Cursor in 2024.
Much like the SWE-CI benchmark showed models failing as multiple changes accumulated, “slop creep” can occur when using agents manually in a codebase. The code can still continue to work, and even pass tests, up to a point.
But without consistent human reviews it will deteriorate in quality until it finally fails and the agents themselves are unable to fix the errors they’ve created.
And when it fails, the codebase has often evolved to a state where no-one understands it and debugging it is manual, slow and painful.
This is the reality Amazon ran into.
Two camps in agentic coding practices are the dark software factory and the hands-on engineer. Both camps are using the same models. Both camps believe in testing and validation.
The dark software factory is declaring rapid delivery of software, while the hands-on engineer is seeing modest gains in productivity.
The dark software factory is spending “$1,000 on tokens today per human engineer” while the hands-on engineer is spending $200/month on a Claude Code or OpenAI Codex subscription.
What will the future of software look like under these two regimes? Will one dominate in the long run? Will they each have their niche? Can you spend your way to competitiveness in the software market?
At SoftwareSeni we lean more towards “hands-on engineer”. But we are always watching how software development is evolving and taking on the best practices as they become clear.
If you’d like to chat about software development or building businesses around software get in touch.
BMAD Method – Turning Vibe Coding Into Software EngineeringWhen Andrej Karpathy came up with the term vibe coding he was talking about coding that involved instructing an AI and never looking at the code.

As you can see, that tweet had 5.1 million views at one point. It was responsible for opening people’s eyes to how capable AI had become at coding and set off an inrush into app development.
No surprise, but there were thousands of people who wanted to build apps but didn’t know how to code. It felt like their time had finally come. They could forget no-code/low-code tools and get AI to write the code for them.
But the challenges of vibe coding quickly became apparent.

The issues with vibe coding were immediately apparent to developers. But the fact that AI was getting code 80% complete, and for simple things 100%, meant there was promise.
The goal became not to get AI to write entire apps, but to write as much of an app as possible.
The failures in vibe coding can be divided into model ability and information availability.
AI is now quite good at understanding code and writing code. The problem is context – the amount of information that an AI successfully work with.
This includes instructions, any files it needs, results of its own thinking, research, and the file changes it makes as it works.
AI doesn’t know what a project is about unless that information has been added to its context.
It doesn’t know what database schema it should be following unless that information has been added to its context
It doesn’t know anything about the API endpoints it should be using unless that information is in its context
There is so much knowledge in a software developer’s head for even the smallest project. And it’s always more information than can fit in an AI’s context.
This drove the focus on “context engineering” – getting the right information into the AI’s context, and setting tasks that could be completed within the effective length of the AI’s context, as performance noticeably reduced as the context lengthened.
The BMAD Method, started and open sourced by senior engineer Brian Madison (thus the BMAD) and now developed by a team of contributors, is currently one of the most effective approaches to dealing with the challenges of AI-assisted software development.
It works by automating the generation of the detailed, modular documentation an AI needs to be effective across 4 main areas of development:
This is handled by treating the process as a series of workflows instead of one ongoing conversation. The workflows are handled by “agents”, which are focused prompts that target a particular outcome, such as a PRD, a system design doc, a task list for coding, a file of code, or the results of a test run.
Developing software becomes working with the agents at each stage of the process to specify, record and review the documentation at each stage of development.
The power of this approach is that the amount of documentation an AI needs is immense compared to what a human software developer requires, but you can use AI to create it. Or recreate it if new constraints or issues arise as development progresses.
And an AI agent is naturally relentless and unceasing in its requests for the required information to create all the necessary documents. It won’t take shortcuts or skip steps unless ordered to.
BMAD is by design modular to avoid the problems of overflowing context, and robust to interrupted workflows and restarts. Each module/workflow is designed to load just the documentation it needs, and the documentation they produce is designed to be as concise as possible while still serving its purpose. Where documents are long, BMAD can shard them so only the necessary portions need to be loaded into the AI’s context.
When you kick off the BMAD Method it will hold your hand every step of the way. It is designed to guide you through decision making based on standard software development planning, design and execution practices. You can “vibe code” it and get it to make all decisions for you (it even has a #YOLO mode) and never look at any of the documentation it produces. But that won’t work.
The devil is in the details, and software development is all details, and most of those details start out in your or your PM’s head. Or in your codebase templates and runbooks.
If you use BMAD Method diligently, and provide it with your templates and runbooks (at the appropriate point), and get it to research the answers you’re not sure of and answer the questions you are sure of and review the documents it produces and make it fix any issues you find, then it will work much better.
But it isn’t easy work. Even with an AI to ask questions and turn answers into documents, going from an idea for a software product to a full set of design, technical, and implementation documents is mentally challenging.
Normally this work is split across multiple people, each with specialised skill sets and knowledge. They have meetings. They cover whiteboards with sticky notes in different colours. BMAD Method can be run by a single person sitting at a laptop. But we find this doesn’t give the best results.
You want the relevant experts involved at each stage. AI has shifted the burden in software development from production to review, even for documentation. Review is where errors are spotted. When AI can work autonomously for hours, generating or changing hundreds of files, you want to catch all errors as early as you can. And it’s the experts that are best at this.
Once you’ve completed documentation with the BMAD Method – including generating epics and stories for agile development, you eventually reach the actual code generation.
In BMAD Method v6, currently in alpha but the recommended version to work with, they have integrated the lightweight, AI-friendly beads issue tracker into their code generation phases.
beads was created by Steve Yegge, ex-Amazon, ex-Googler and well known blogger in tech circles. He developed it as antidote to what he called “inter-session amnesia”.
“Inter-session amnesia” is another side effect of the limited context that AI has to work with. They have limited memory and that memory is empty every time you start a new session. And if you fill up the context of a coding agent, the current strategy for most tools (eg Claude Code, OpenAI Codex, Google Gemini) is to “compact” the context by removing some items and summarising others. Most developers find this results in poor performance and the recommended practice is to never let a task reach the point of triggering compaction and to always start with an empty context.
This empty context means at the start of a task the AI needs to be told what to do. Using the beads issue tracker, the AI can be instructed to add issues to it, and to query it for any outstanding issues that need to be worked on.
Coupled with BMAD Method’s documentation of epics and stories to guide implementation, beads enables agentic coding assistants like Claude Code to run for longer and accomplish more.
Given instructions to log errors and other problems that occur, and given the tools to address them (linters, debuggers and test harnesses), this combination of BMAD Method and beads can greatly increase the amount of working, tested code agents can produce.
Of course the quality depends on the documentation (including QA requirements), and it still needs to be reviewed. But the reviews are less about “does this code work” and “does this code fulfill requirements”.
Vibe coding was only coined in February 2025.
Agentic coding has only been a “thing” since March 2025.
Spec driven development, of which BMAD Method is a flexible and modular implementation, has only been around since May 2025.
All these changes have happened in tandem with increasing model capabilities and the continued experimentation of developers trying to get the most out of them.
It may be that in 6 months there is a new paradigm for software development. At SoftwareSeni we will be ready to move from BMAD Method when that something better comes along.
But for now, we’re seeing how far and how fast we can take the BMAD Method to drive our projects forward.