Generative AI

Technology

•

Sep 29, 2025

Claude Cut Token Quotas In August – Will AI Coding Costs Keep Rising?

From prompt boxes to pull requests

Twelve months ago, most teams were nudging GitHub Copilot for suggestions inside an editor. Now, agent-first tools pick up a ticket, inspect a repo, run tests, and open a pull request. GitHub’s own coding agent can be assigned issues and will work in the background, creating branches, updating a PR checklist, and tagging you for review when it’s done. The company has been rolling out improvements through mid-2025 as part of a public preview.

Third-party platforms are pushing the same end-to-end loop. Factory’s documentation describes “agent-driven development,” where the agent gathers context, plans, implements, validates, and submits a reviewable change. The pitch is not a smarter autocomplete; it’s a teammate that runs the inner loop.

This shift explains why a consumer-style subscription can’t last. In late July, Anthropic said a small slice of users were running Claude Code essentially nonstop and announced weekly rate limits across Pro and Max plans, effective August 28, 2025. Tech coverage noted cases where a $200/month plan drove “tens of thousands” of dollars in backend usage, and the company framed the change as protecting access from 24/7 agent runs and account sharing.

The message is simple: agents turn spiky, prompt-driven sessions into continuous workloads. Prices and quotas are following suit.

Why provider pricing is moving

For the last two years, flat fees and generous quotas have worked as a growth hack. But compute spend dominates model economics. Andreessen Horowitz called the boom compute-bound and described supply as constrained relative to demand. In this environment, heavy users on flat plans are a direct liability. Once agents enter the mix, metering becomes a necessity.

That also changes how vendors justify price. If a workflow replaces part of a developer’s output, pricing gravitates toward a share of value rather than a token meter. The recent quota shift around Claude Code is one of the first visible steps in that direction.

Why prices won’t float without limit

Open-source models put a ceiling on what providers can charge. DeepSeek-Coder-V2 reports GPT-4-Turbo-level results on code-specific benchmarks and expands language coverage and context length substantially over prior versions. And other models like Qwen3-235B-A22B-Instruct-2507, GLM 4.5, and Kimi K2 showing strong results across language, reasoning, and coding, with open-weight variants that teams can run privately. These are not perfect substitutes for every task, but they’re increasingly serviceable for everyday work.

Local serving stacks have also improved, but hardware remains expensive, particularly hardware with the memory and bandwidth capable of serving the largest open-weight models. The trend towards Mixture Of Expert (MOE) structured models is reducing hardware requirements, while smaller models (70B parameters and below) are rapidly improving. Together, these trends make it realistic to move a large share of routine inference off the cloud meter.

The part everyone shares: a finite pool of compute

The bigger constraint isn’t just what developers will pay, it’s what the market will pay for inference across all industries. As agent use spreads to finance, operations, legal, customer support, and back-office work, demand converges on the same GPU fleets.

Analyses continue to describe access to compute, at a workable cost, as a primary bottleneck for AI products. In that setting, the price developers see will drift towards the price the most profitable agent workloads are prepared to pay.

McKinsey’s superagency framing captures the shift inside companies: instead of a person asking for a summary, a system monitors inboxes, schedules meetings, updates the CRM, drafts follow-ups, and triggers next actions. That turns interactive usage into base-load compute.

There’s also a directional signal on agent capability. METR measured the length of tasks agents can complete with 50% reliability and found that “task horizon” has roughly doubled every seven months over several years. As tasks stretch from minutes to days, agents don’t just spike usage; they run continuously in the background, consuming compute.

A clearer way to think about the next three years

In the near term, expect more quota changes, metering, and tier differentiation for agent-grade features. The Copilot coding agent’s rollout is a good reference point: it runs asynchronously in a cloud environment, opens PRs, and iterates via review comments. That’s not a coding assistant, that’s a service with an API bill.

As the market matures, usage will bifurcate. Long-horizon or compliance-sensitive work will sit on premium cloud agents where reliability and integrations matter. Routine or privacy-sensitive tasks will shift to local stacks on prosumer hardware. Most teams will mix the two, routing by difficulty and risk. The ingredients are already there: competitive open models, faster local runtimes, and agent frameworks that run in IDEs, terminals, CI, or headless modes. (arXiv, vLLM Blog, NVIDIA Developer, Continue Documentation)

Over the longer run, per-token costs will likely keep falling, while total spend rises as agents become part of normal operations—much like cloud spending grew even as VM prices dropped. The economics track outcomes, not tokens.

What to do now

First, stabilise access. If you rely on a proprietary provider for agent workflows, and you’re big enough, you may want to consider negotiating multi-year terms. Investigate other providers like DeepSeek, GLM and Kimi and the third party inference providers that serve them (eg via OpenRouter). The Claude Code decision shows consumer-style plans can change with short notice.

Second, stand up local inference servers. A single box with a modern GPU (or 2 or 4) will run the best open models like Qwen3-Coder-30B-A3B-Instruct. Measure what percentage of your usual tasks they handle before escalation to a frontier model.

Third, wire in routing. Tools like Continue.dev make it straightforward to default to local models and switch to the big providers only when needed.

Other tools, like aider, let you split coding between architect and coder models, allowing you to drive paid frontier models for planning and architecture, and local models (via local serving options like litellm) to handle the actual code changes.

Finally, measure outcomes. Track bugs fixed, PRs merged, and lead time. That’s what should drive your escalation choices and budget approvals, not tokens burned.

Closing thought

Two things are happening at once. Providers are moving from growth pricing to profit. Open models and local runtimes are getting good enough to cap runaway bills. And over the top of both sits market-wide demand from agents in every function, all drawing from the same pool of compute.

Teams that treat agents as persistent services, secure predictable access, and run a local-based hybrid approach will keep costs inside a band as prices move. Teams that depend on unlimited plans will keep running into the same quota notices that landed last month.

Claude Cut Token Quotas In August – Will AI Coding Costs Keep Rising?

From prompt boxes to pull requests

Why provider pricing is moving

Why prices won’t float without limit

The part everyone shares: a finite pool of compute

A clearer way to think about the next three years

What to do now

Closing thought

Related Articles

Which of the top 5 AI coding assistants is right for you?

10 SaaS startups that can cut months off your runway

How to spot opportunities in your business to use AI

Need a reliable team to help achieve your software goals?

SYDNEY

JAKARTA

BANDUNG

YOGYAKARTA