What's the right unit to bill on — tokens, requests, or conversations?

Tokens are the most defensible unit because they're what the LLM provider charges you. Requests are coarser and easier to game. Conversations conflate length and complexity. Best practice: bill in tokens, mark up by a margin you can defend, and show customers the breakdown so they trust the math.

How do I handle prompt caching in the bill?

Track input tokens, output tokens, cache-read tokens, and cache-write tokens as separate dimensions. Bill each at its provider rate. Anthropic's prompt cache reads are 10x cheaper than fresh input tokens; you can pass that savings through to your customer or keep it as margin — your call.

What about reasoning / thinking tokens?

Bill them separately too. Anthropic's `cache_creation_input_tokens` and `cache_read_input_tokens`, plus `output_tokens` and (for o-series / Claude with extended thinking) `reasoning_tokens` are all separate billing lines from the provider. Mirror that structure in your meter.

Deep dive

Per-tenant LLM cost accounting: meter, attribute, charge

If you're charging customers for an AI product, you need per-tenant token tracking. Here's the data model, the rate-card pattern, and the integration with Stripe Billing or your invoicing of choice.

By Dipankar Sarkar May 22, 2026 4 min read View raw .md

billing
LLM cost
per-tenant
Stripe
metering

If you’re charging customers for an AI product, the question “what does this customer cost me?” needs a precise answer. Per-tenant LLM cost accounting is the boring infrastructure that makes that precision possible.

This article: the data model, the rate-card pattern, the integration paths to invoicing systems.

What you need to capture

Every LLM call should write a billing row with:

timestamp — when the call happened.
tenant_id — which tenant made it.
model — which model (claude-opus-4-7, gpt-4o, etc.).
input_tokens — fresh input tokens (un-cached).
cache_read_tokens — input tokens served from prompt cache (cheap).
cache_write_tokens — input tokens written to the prompt cache (expensive, one-time).
output_tokens — completion tokens.
reasoning_tokens — extended-thinking / o-series reasoning tokens.
tool_calls — count of tool invocations triggered in this turn.
request_id — provider’s ID for the call (audit trail).
cost_usd — computed at write time, snapshotted against the rate card in effect.

The cost is computed at write time and stored, not recomputed on read. This is critical for audit: if you change your rate card in July, June’s reports still reflect June’s prices.

The rate card

The rate card is a YAML map from (model, token-class) → USD-per-1M-tokens. Each LLM provider you use needs a row; reasoning and cache classes are separate from base rates.

billing:
  currency: USD
  rate_card:
    "claude-opus-4-7":
      input:        15.00
      output:       75.00
      cache_read:    1.50
      cache_write:  18.75
      reasoning:    75.00
    "claude-sonnet-4-6":
      input:         3.00
      output:       15.00
      cache_read:    0.30
      cache_write:   3.75
    "gpt-4o":
      input:         2.50
      output:       10.00
    "gpt-5":
      input:         5.00
      output:       25.00

The rate card lives in the gateway’s admin-only config. Tenant overlays can’t override it — otherwise a tenant could quietly make their own bills cheaper.

Margin on top

You’re billing customers; you need a margin. Two patterns:

Markup at meter time: multiply the provider rate by 1.3x (or whatever) when computing cost_usd. The customer sees an aggregated cost that includes your margin; provider rate is invisible to them.

Pass-through + flat fee: bill exactly the provider rate, add a flat per-tenant per-month platform fee. Customers like the transparency; you take less variance risk.

The hybrid (5–15% markup + small platform fee) is common.

Quotas — the hard stop

Three knobs per tenant:

Tokens per day — sum of input + output + reasoning, reset at UTC midnight.
Cost per day (USD) — rate-card-driven, reset at UTC midnight.
Requests per minute — sliding window.

Exceed any and the gateway returns 429 Too Many Requests with Retry-After: <seconds>. This is the difference between “a runaway tenant blows up your AWS bill” and “a runaway tenant gets throttled at 5 minutes past midnight.”

Quotas live in the tenant’s admin-only config; the tenant can request an increase but can’t set their own.

Reports

A single CSV row per tenant per day, with the rate-card-snapshotted cost:

date,tenant,model,tokens_in,tokens_out,tokens_cached,reasoning_tokens,tool_calls,cost_usd
2026-06-03,acme,claude-opus-4-7,142500,38200,12300,5400,42,4.78
2026-06-03,acme,claude-sonnet-4-6,891200,201400,84300,0,128,5.92
2026-06-03,globex,claude-sonnet-4-6,512300,98400,32100,0,67,3.41

Aggregate to month-end for invoicing. The CSV columns are the contract — keep them stable so customer downstream pipelines don’t break when you add a new dimension.

Stripe Billing integration

If you’re invoicing via Stripe:

Create a Stripe Product per offering (“OpenClawMU Pro”, “OpenClawMU Team”).
Create a usage-based Price linked to a metered usage record.
Nightly cron: for each tenant, sum the day’s cost_usd, post to Stripe via usageRecords.create({ subscription_item, quantity: cost_cents, timestamp }).
Stripe handles the invoice and the credit card charge at month-end.

The mapping tenant → Stripe subscription lives in your customer database; the gateway only knows tenant IDs and dollar amounts.

// nightly cron, in your invoicing service
import Stripe from "stripe";
import { openclaw } from "./openclaw-client";

const stripe = new Stripe(process.env.STRIPE_KEY!);

for (const customer of await db.customers.findAll()) {
  const usage = await openclaw.billing.report(customer.tenantId, "yesterday");
  const costCents = Math.round(usage.total_cost_usd * 100);
  if (costCents > 0) {
    await stripe.subscriptionItems.createUsageRecord(
      customer.stripeSubItemId,
      { quantity: costCents, timestamp: usage.date_unix }
    );
  }
}

Showing customers their usage

Customers want to see what they’re paying for. Build a simple dashboard backed by the same CSV / API:

Daily tokens (input vs output vs cached, stacked).
Daily cost (with model breakdown).
Top 10 sessions by cost.
Quota progress (current period).

OpenClawMU’s control-plane API exposes GET /v1/tenants/<id>/usage?period=current-month returning the same data the CSV does, JSON-formatted. Front-end query, render charts, ship.

Audit-friendly trail

Every billing row carries the rate-card snapshot. If a customer disputes a charge, you can reproduce the math:

“On 2026-06-03 you used 142,500 input tokens of claude-opus-4-7 at $15/M = $2.138. Plus 38,200 output tokens at $75/M = $2.865. Plus 12,300 cached at $1.50/M = $0.018. Plus 5,400 reasoning at $75/M = $0.405. Plus our 10% margin = $5.97. Charge: $5.97.”

Customers respect that level of transparency. Plus your audit team likes it.

What not to do

Don’t sample. Recording one in ten LLM calls and extrapolating is a lawsuit waiting to happen.
Don’t bill on requests instead of tokens. Requests are a coarser unit; tokens are what your provider charges you for.
Don’t change the rate card retroactively. If you raise prices, snapshot the new rate going forward; historical reports stay at the old rate.
Don’t bury margin in opaque per-message fees. Customers find out, and they don’t appreciate it.
Don’t share LLM keys across tenants without a meter. Even if it’s “cheaper” in the short term, you’ve lost the cost attribution.

Build vs adopt

You can build all of this in a sprint. The interesting work is your invoicing UX, your customer dashboard, your pricing strategy. The plumbing (meter every LLM call, snapshot a rate card, emit a CSV) is identical across every multi-tenant AI gateway. OpenClawMU ships it; LiteLLM has a thinner version; you can absolutely roll your own.

The point is: don’t ship the product before you ship the meter. Customers without bills become very expensive customers very quickly.

Frequently asked

What's the right unit to bill on — tokens, requests, or conversations?: Tokens are the most defensible unit because they're what the LLM provider charges you. Requests are coarser and easier to game. Conversations conflate length and complexity. Best practice: bill in tokens, mark up by a margin you can defend, and show customers the breakdown so they trust the math.
How do I handle prompt caching in the bill?: Track input tokens, output tokens, cache-read tokens, and cache-write tokens as separate dimensions. Bill each at its provider rate. Anthropic's prompt cache reads are 10x cheaper than fresh input tokens; you can pass that savings through to your customer or keep it as margin — your call.
What about reasoning / thinking tokens?: Bill them separately too. Anthropic's `cache_creation_input_tokens` and `cache_read_input_tokens`, plus `output_tokens` and (for o-series / Claude with extended thinking) `reasoning_tokens` are all separate billing lines from the provider. Mirror that structure in your meter.