What's the difference between a multi-tenant AI gateway and an LLM proxy like LiteLLM?

An LLM proxy multiplexes calls to different model providers and adds rate limiting / caching at the model API layer. A multi-tenant AI gateway adds the tenant boundary on top — isolated session/memory/sandbox state per tenant, plus channel adapters (WhatsApp, Slack, etc.), plus per-tenant cost accounting. The proxy is one layer; the gateway is a stack.

Can't I just put tenant_id in every query?

You can, and many teams start there. The problem is enforcing it everywhere — every path-taking API, every cache key, every cron schedule, every plugin install, every backup file. Multi-tenant gateways promote the tenant ID from 'an extra column' to 'the root of the directory tree' so the enforcement is structural, not vigilance.

Is multi-tenant the same as multi-account?

No. 'Account' usually means a billing entity with one or many users inside it. 'Tenant' in this article means an isolation boundary — sessions, memory, sandboxes, channels all separated. One account can map to one tenant (typical) or to many (e.g., parent org with project sub-tenants).

Cornerstone

What is a multi-tenant AI gateway? The architecture, explained.

A multi-tenant AI gateway is the layer between your messaging channels and your LLM that isolates per-customer state. Here's how the pattern works, why it matters now, and what a defensible implementation looks like.

By Dipankar Sarkar June 3, 2026 5 min read View raw .md

multi-tenant
AI gateway
architecture
infrastructure
LLM

A multi-tenant AI gateway is the layer that sits between your messaging channels (WhatsApp, Slack, your in-app chat widget) and your LLM provider, and that gives each customer their own isolated workspace inside a single deployment. Sessions, memory, sandboxes, channel pairings, cron jobs, and cost accounting are all scoped to a tenant boundary.

This article walks the architecture: what a multi-tenant AI gateway actually does, what the alternatives are, why the pattern matters now, and what a defensible implementation looks like.

The problem the pattern solves

You’re building an AI product. The MVP works: one user, one chat, one assistant. To productize you need to support N customers from the same deployment. Each customer has their own conversation history, their own preferences, possibly their own LLM key, their own scheduled jobs, their own channel pairing (their WhatsApp number, their Slack workspace).

You can grow into this from three directions, none of them great:

Spin up a new server per customer. Works for 5 customers, breaks at 50.
Add tenant_id to every table and pray your code path doesn’t forget it. This is the classic mistake — it survives until the day the cache key, the cron schedule, or the file path doesn’t carry the ID and a customer sees another customer’s data.
Buy a closed-source bot platform. Solves it, but they take a margin forever and your customer data lives in their database.

The multi-tenant AI gateway is the fourth option: one process, structural isolation at the directory/sandbox boundary, no per-customer server cost, no platform fee.

What gets isolated

A defensible multi-tenant AI gateway isolates everything that bears state.

Sessions. Each tenant has its own chat-session store. A session ID is meaningful within a tenant but never across tenants.
Memory. Vector embeddings + content store, per tenant. Sharing memory across tenants would leak whatever the previous user told the assistant.
Plugins / skills. Tenant A can install a custom tool without Tenant B seeing it.
Sandboxes. When the agent executes code, it does so in a sandbox whose root is the tenant’s directory. No cross-tenant filesystem access.
Cron jobs. Scheduled “remind me on Friday” tasks belong to the tenant that created them.
Channel credentials. Tenant A’s WhatsApp pairing and Tenant B’s Slack OAuth are stored separately, encrypted at rest.
Devices and nodes. Paired clients for distributed setups, per tenant.
Config overlay. Each tenant has a YAML overlay that can override model choice, max tokens, system prompt — but cannot override admin-only keys (API credentials, rate cards).

What’s shared across tenants is everything stateless: the Node process, the LLM HTTP client (calls tagged with tenant ID for metering), the channel adapter classes, the dispatcher logic.

The shape of the tenant boundary

The cleanest implementation makes the tenant ID structural — the root of the filesystem tree, the prefix of the auth token, the dimension of every billing row. Treating the tenant as a namespace rather than a column eliminates whole categories of bugs.

data/
├── tenants/
│   ├── acme/
│   │   ├── sessions/
│   │   ├── memory/
│   │   ├── plugins/
│   │   ├── sandbox/
│   │   ├── cron/
│   │   ├── channels/
│   │   └── config.yaml
│   ├── globex/
│   │   └── (same layout)
│   └── initech/
│       └── (same layout)
└── gateway.log

If every code path that takes a tenant-relevant input also resolves a tenant-rooted path, then forgetting the tenant becomes a type error rather than a runtime data leak.

The token model

Each tenant gets a token. The token authenticates every inbound request — JSON-RPC, the OpenAI-compatible HTTP shim, channel webhooks, terminal WebSocket. Three properties matter:

Hashed at rest. Store SHA-256(token), never the plaintext. A gateway-disk compromise should not leak live tokens.
Constant-time compared. Use crypto.timingSafeEqual. Any short-circuit on prefix mismatch enables token-fishing attacks.
Tenant-prefixed. Token format tk_<tenant_id>_<32 hex chars> lets you grep log lines without leaking the secret half.

Rotation should be a single command — tenants token rotate acme — and instant. There’s no recovery if you lose a token; rotate to issue a new one.

Path-traversal protection

Every API that takes a path string is an opportunity for one tenant to escape into another. The defense is uniform:

Resolve the path with path.resolve against the tenant root.
Assert the resolved path is a descendant of the tenant root.
Reject symlinks that point outside the root.
Reject absolute paths in user input.

Apply this to: file_read / file_write tools, plugin loaders, sandbox mount specs, S3 backup target keys, S3 restore source keys, config overlay file paths. It’s repetitive code; it’s the most important repetitive code in the system.

Per-tenant cost accounting

If you’re charging customers, you need to know what each customer cost you. The gateway records a billing row for every LLM call: tenant, model, input tokens, output tokens, cached tokens, reasoning tokens, timestamp, rate-card snapshot.

The rate-card snapshot is critical — if you change your pricing later, historical reports still reflect what was billed at the time. Audit-friendly.

date,tenant,model,tokens_in,tokens_out,cost_usd
2026-06-03,acme,claude-opus-4-7,142500,38200,4.78
2026-06-03,acme,claude-sonnet-4-6,891200,201400,5.92

Pipe the CSV into Stripe Billing’s usage-record API on a cron and you have automatic invoicing for your customers.

Quota enforcement

Three knobs, hard-stop semantics:

Tokens per day. Sum of input + output + reasoning, reset at UTC midnight.
Cost per day (USD). Rate-card-driven, reset at UTC midnight.
Requests per minute. Sliding-window count of inbound calls.

Exceed any quota and the gateway returns 429 Too Many Requests with Retry-After set to the next reset boundary. Quotas are the difference between “a runaway tenant blows up your AWS bill” and “a runaway tenant gets throttled and pages you to investigate.”

Why now

The pattern is suddenly important because the underlying ingredients are suddenly cheap. LLM API costs are falling 4x/year. Channel SDKs (Baileys for WhatsApp, grammY for Telegram, Bolt for Slack) are mature. Sandboxing primitives (bubblewrap, Docker, gVisor) are battle-tested. The hard part used to be the LLM; now the hard part is the multi-tenant glue.

The teams that ship this glue cleanly will eat the bot-platform incumbents. The teams that don’t will end up paying the bot-platform margin or building a fragile in-house version.

What good looks like

A defensible multi-tenant AI gateway has:

Structural tenant isolation — per-tenant directories, not per-row columns.
Hashed token auth with constant-time comparison and rotation.
Sandboxed code execution per tool call, with the sandbox rooted in the tenant directory.
Path-traversal protection on every path-taking API.
Admin / tenant key separation — config overlay cannot override credentials.
Per-tenant quotas and cost accounting with rate-card snapshotting.
Backup/restore for tenant portability.
A web terminal for operator and tenant access (because someone always needs a shell).
An audit log capturing every state-changing operation.

Build it yourself in 6–8 weeks, or adopt one and ship the product on top. Either way, the multi-tenant AI gateway is the layer you need.

Frequently asked

What's the difference between a multi-tenant AI gateway and an LLM proxy like LiteLLM?: An LLM proxy multiplexes calls to different model providers and adds rate limiting / caching at the model API layer. A multi-tenant AI gateway adds the tenant boundary on top — isolated session/memory/sandbox state per tenant, plus channel adapters (WhatsApp, Slack, etc.), plus per-tenant cost accounting. The proxy is one layer; the gateway is a stack.
Can't I just put tenant_id in every query?: You can, and many teams start there. The problem is enforcing it everywhere — every path-taking API, every cache key, every cron schedule, every plugin install, every backup file. Multi-tenant gateways promote the tenant ID from 'an extra column' to 'the root of the directory tree' so the enforcement is structural, not vigilance.
Is multi-tenant the same as multi-account?: No. 'Account' usually means a billing entity with one or many users inside it. 'Tenant' in this article means an isolation boundary — sessions, memory, sandboxes, channels all separated. One account can map to one tenant (typical) or to many (e.g., parent org with project sub-tenants).