Imagine discovering a $130,000 monthly LLM bill with no alert having fired. It happens. In products without per-tenant token governance, bills compound invisibly until finance emails you a very direct question.
Token attribution — recording every LLM token consumed against the specific tenant, user, and task that generated it — is operational control infrastructure. The difference between a multi-tenant product with healthy gross margins and one spiralling into deficit is almost always whether this was in place before the first paying customer shipped.
Cost attribution must be built in before you scale, not retrofitted after scale has already eroded margin. This sits within the full observability discipline that cost governance belongs to, but warrants its own treatment. Here’s what follows: four-layer token accounting, prompt cache accuracy, per-tenant attribution, and kill-switch enforcement.
How does a $130,000 monthly bill appear without a single alert firing?
Teams ship the first agent feature, defer attribution to “once we have traffic,” then spend a quarter retroactively joining logs to customer records trying to work out why gross margin ticked down four points. By then, the pricing conversation with the runaway customer is already awkward.
Three failure patterns come up again and again.
Averages hide distributions. Reporting cost-per-customer as an average papers over a long tail where 3% of tenants consume 60% of tokens. Aggregate gross margin looks fine. Individual tenant economics are collapsing underneath it.
Retry and tool-call loops go uncounted. Agents looping through failed tool calls burn hundreds of prompt-cached reads. If your telemetry only counts the user-facing request, you’re invisible to the actual burn rate.
Free tiers are under-instrumented. Treat free-tier traffic as a lump sum and it can silently become the largest single line item on the provider invoice.
You wouldn’t ship a web service without rate limiting. Token cost governance belongs in the same category. Monitoring doesn’t stop the bill — enforcement does. And enforcement requires attribution.
What are the four token layers your current billing dashboard is almost certainly collapsing into one?
Provider invoices return a combined input token count. Most dashboards multiply total tokens by list price and report a number that is both wrong and uninformative. The right approach is what we’d call Four-Layer Token Accounting — four distinct categories per span, each with different cost characteristics and different optimisation levers.
Layer 1 — Prompt Layer (system prompt + user input + few-shot examples): The largest fixed cost per request. Track separately to spot prompt bloat over time.
Layer 2 — Tool Layer (tool schemas + tool-call results): The most common source of silent bloat in agent harnesses. An agent with 15 tools can inject several thousand tokens before a single word of user content is included.
Layer 3 — Memory Layer (RAG chunks + conversation history + scratchpad): Scales with session length. Track separately so RAG retrieval budgets can be tuned independently.
Layer 4 — Response Layer (completion tokens including reasoning): Priced 4–5× input tokens on most providers. The highest-unit-cost layer — and reducing reasoning depth on lower-stakes tasks gives the fastest cost reduction.
Think about a B2B inbox-triage agent running across 40 tenants at 81% gross margin — but six of those 40 tenants are consuming 57% of total token spend. Without four-layer attribution, the product team optimises the wrong prompt. With it, the account team has an upgrade list and engineering has a clear fix priority.
Track the four layers as separate counters per span at the harness layer — before packing the prompt, not after the API returns its combined total.
Why does prompt caching make your cost attribution wrong by 35–50% on agentic workloads?
The problem is naive cost attribution: sum all input tokens, multiply by list input price, report the number. On workloads with high cache hit rates, this over-reports spend by 35–50%.
On Anthropic Claude, cached-read tokens price at 10% of list input price. At an 80% cache hit rate the effective per-token cost drops to around 18% of list. The discount varies by provider:
- Anthropic Claude — Standard input: 1× list price. Cached read: 0.10×. Cached write: 1.25×.
- OpenAI GPT-4o — Standard input: 1× list price. Cached read: 0.50×. Cached write: 1.00×.
If you invoice under pass-through pricing with naive token counting, you’re over-charging clients. Migrate from OpenAI to Anthropic without updating your pricing model and it gets worse — Anthropic’s 10% cached-read rate versus OpenAI’s 50% means the correction factor is five times larger on the same workload.
The fix is straightforward: ingest cached-read and cached-write counters as separate values, price each at its provider-specific rate, and derive cost at query time from a pricing table. Pricing changes. Counters do not.
A healthy agentic workload shows 70% cache hit rate or better. Below 40% usually means prompt drift or a cache TTL issue — fix those before optimising anything else.
How do you attribute token costs per tenant, per user, and per task at instrumentation time?
Attribution is only honest when tags are attached at request creation time. Retroactive tagging from logs misses tool-call sub-spans, retry chains, and background agent tasks — the difference between billing data you can defend in a contract dispute and billing data you cannot.
Three dimensions cover almost every business question you’ll need to answer.
Per-tenant answers “Is this customer profitable?” It’s the foundation of unit economics and renewal conversations. Every agent span must carry a tenant identifier — no exceptions.
Per-user answers “Who inside this account is driving consumption?” Essential for seat-priced products. Use a stable user identifier from your auth layer, not a session identifier that rotates.
Per-task answers “Which product surfaces are expensive?” Tag every agent run with a task identifier and route — inbox-triage, summary, report-gen. Prioritise optimisation against hard dollars, not feature instinct.
The standard hook here is the OTel gen_ai.* schema that makes token attribution standardised. Extend the standard token counters with your own dimensions for tenant, user, and task. Emit raw counters, not pre-computed cost figures — counters are immutable; pricing tables are not.
What is a spend cap and kill switch, and why is it the highest-leverage control you can add this week?
A spend cap is a daily token budget ceiling per tenant. A kill switch is the enforcement action that fires when that ceiling is crossed: rate-limit tightening, model downgrade, cached response serving, or hard pause. The spend cap is the policy. The kill switch is the mechanism.
Without enforcement, a single runaway tenant consuming 90% of your token budget is a mathematical certainty. Monitoring gives you the numbers. It does not stop the bill.
The three-tier enforcement stack works like this:
Tier 1 — Rate limits per tenant per minute. Set at 2–3× expected peak. Surface a clear HTTP 429 with a Retry-After header rather than silently queuing.
Tier 2 — Daily spend caps per tenant. Set at 1.5–3× contracted usage ceiling. On cap crossing: rate-limit tightening plus an alert to account management.
Tier 3 — Kill switch on spend z-score greater than 4. Auto-pauses the tenant and pages on-call. False-positive rate is low with a seven-day baseline.
When enforcement fires, graceful degradation beats a hard failure. Route to a cheaper model at 80% of cap, serve cached responses for read-heavy queries, and return a quota error with upgrade context when quality cannot be compromised. A clear quota error is a conversion event. A hard cut-off is a support ticket.
Why does retrofitting cost governance after you hit scale cost three times what building it in does?
Three compounding costs arrive when you defer.
Engineering cost. Re-instrumenting a production agent harness means touching every request path and updating span schemas across all downstream consumers. That’s a multi-sprint project on a mature codebase. It’s a two-day investment at birth.
Business cost. Every month without attribution means pricing and renewal decisions made on aggregates that hide distributions. Slow margin erosion is only visible in retrospect.
Trust cost. The first conversation with a runaway tenant you cannot explain with data is the hardest pricing conversation. Clients who receive incorrect invoices once rarely trust the billing model again.
The retrofitting paradox is this: attribution data is only valuable proportional to its history. Start in month six and you cannot answer “was this client profitable in month one?” That matters for pricing validation and contract renewals.
Here is what “built in” looks like before the first paying tenant ships:
- Instrument the four token layers as separate counters per span at the harness layer
- Attach tenant, user, and task identifiers to every LLM request at creation time
- Configure a daily spend cap per tenant with automated rate-limit tightening on cap crossing
- Set up a daily alert for per-tenant spend versus 7-day rolling baseline
- Ingest cached-read and cached-write token counters separately with provider-specific pricing applied
Two days of work. For tooling, open-source platforms like Uptrace and Langfuse that support self-hosted cost monitoring are the practical options for a 50–500 person company — no licensing cost.
Cost governance belongs to the production AI maturity framework as a foundation-layer control — in place before the first paying customer, reviewed quarterly, and connected to the renewal conversations that decide whether your LLM product is actually a business.
Frequently Asked Questions
What is token attribution in a multi-tenant LLM product?
It’s the process of recording every LLM token consumed — broken into prompt, tool, memory, and response layers — against the specific tenant, user, and task that generated it. The goal is to know the true cost of serving each customer, not the average across all of them.
Why can’t I just use my provider’s billing dashboard?
Provider dashboards show total spend, not per-tenant or per-task spend. They also apply a blended rate that ignores cache tier pricing. Useful for reconciliation, but not a substitute for in-product attribution.
What is the difference between a spend cap and a kill switch?
A spend cap is a daily token budget ceiling per tenant. A kill switch is the enforcement action when that ceiling is crossed. Policy vs mechanism. Both need to be in place before your first paying tenant ships.
How do I account for prompt caching in my cost attribution?
Ingest cached-read and cached-write counters separately. Price each at its provider-specific rate — Anthropic at 10% of list, OpenAI GPT-4o at 50%. Summing all input tokens at list price over-reports spend by 35–50% on cache-heavy workloads.
How do I set the right level for a per-tenant daily spend cap?
1.5–3× the tenant’s contracted usage ceiling. Below 1.5× generates too many false positives; above 3× leaves too much exposure. For bundled pricing, use 90th-percentile spend over a 30-day baseline. Review quarterly.
Is graceful degradation always better than a hard cut-off?
Yes, for most workloads. A cheaper model fallback or cached response serving preserves user experience while protecting margin. Reserve the hard quota error for workloads where quality cannot be compromised. Hard cut-off is a support ticket. Clear quota error with upgrade context is a conversion event.
How does this apply if I’m using both Anthropic and OpenAI?
Run separate counters and pricing calculations for each provider, then aggregate. Don’t blend them — cache tier differences (10% vs 50% for cached reads) mean blended rates will be wrong for every cache-hitting workload.
Who should own LLM cost governance?
Finance sees the bill; engineering controls the dials. Practically, it’s whoever sets the per-tenant spend caps. If that person hasn’t been named yet, the CTO should claim it — the margin consequence of not owning it is a surprise on the monthly invoice.
Do I need a paid observability platform?
No. Attribution instrumentation uses OpenTelemetry, which is free and open source. Uptrace and Langfuse both provide self-hosted cost dashboards at no licensing cost.
What is the minimum viable cost governance setup?
Five elements: (1) four token layers as separate counters per span; (2) tenant, user, and task identifiers on every request at creation time; (3) daily spend cap per tenant with automated rate-limit tightening; (4) daily alert for per-tenant spend versus 7-day baseline; (5) cached-read and cached-write counters with provider-specific pricing. One to two days of instrumentation work.