All posts
SaaSAIArchitecture

Multi-Tenant SaaS Architecture for an AI Reporting Engine at 48K Req/Min

March 5, 202613 min readBy Yogendra Singh

Enterprise multi-tenancy sounds like a deployment detail until you're on a 2am call explaining to Tenant A why their reports are slow because Tenant B's finance team decided to regenerate a year of quarterly data at month-end. Noisy-neighbor problems are not edge cases in a shared SaaS platform — they're the defining architectural challenge. Everything in the design of this reporting engine was shaped by a single constraint: one tenant's workload cannot degrade another's, full stop.

The platform served 6 enterprise tenants — each with distinct data schemas, SLA tiers, and user counts — at a combined throughput of 48,000 requests per minute, sustaining 99.9% uptime over the trailing year. The engine combined conventional SQL reporting with AI-generated summaries and anomaly narratives powered by an LLM. That AI layer added a new class of multi-tenancy problems that purely relational systems don't face: prompt management, token budget allocation per tenant, and model cost attribution.

Tenant isolation: the three-layer model

Advertisement

We enforced tenant isolation at three independent layers, each covering a different failure mode. The first layer is data isolation: every tenant's data lives in a separate PostgreSQL schema within a shared cluster. Schema-per-tenant gives us row-level security without the operational cost of a database-per-tenant, while still making cross-tenant data access impossible by default — you'd have to explicitly set `search_path` to another tenant's schema, which the application never does. A nightly audit job verifies that no foreign keys or views cross schema boundaries.

The second layer is compute isolation: each tenant has a dedicated ECS Fargate task group with explicit CPU and memory limits. This is the key noisy-neighbor control. A tenant that generates a burst of report requests gets more tasks in their group (up to their configured autoscaling ceiling), but cannot consume resources from another tenant's task group. The autoscaling ceiling is part of the tenant's SLA contract — a Premium tenant gets a higher ceiling than a Standard tenant, and the infrastructure enforces it, not the application code.

The third layer is request-rate limiting enforced at the API gateway using a Redis token bucket, keyed by tenant ID. Each tenant has a configured requests-per-minute allowance. Requests that exceed the allowance receive a 429 with a `Retry-After` header. This layer protects the shared infrastructure (the API gateway itself, the shared Postgres cluster) from being overwhelmed by a single tenant's client making unbounded parallel requests.

The report execution engine: push-down, not pull-up

Report queries in enterprise analytics are notoriously expensive — GROUP BY across millions of rows, multi-table joins, window functions over rolling periods. The wrong instinct is to pull data into the application layer and process it there. We pushed as much computation as possible into PostgreSQL, using materialized views for the most common aggregations and refreshing them on a schedule keyed to each tenant's data ingestion cadence.

For ad-hoc queries that couldn't be served from a materialized view, we used a query planner layer that inspects the query parameters and decides whether to hit the live schema or a daily snapshot in S3 (via Athena for the heaviest historical queries). Most requests — especially month-to-date and quarter-to-date reports — can be served from the snapshot, which offloads the live database substantially. The query planner decision is logged and reviewed weekly; queries that repeatedly hit the live database when a snapshot would have sufficed are candidates for a new materialized view.

LLM integration: prompt templates, tenant persona, and token budgets

The AI reporting layer generates narrative summaries and anomaly explanations from structured query results. The core challenge is that each tenant has a different vocabulary, domain context, and tolerance for hedging language. A fintech tenant wants precise, conservative language. A media company wants conversational summaries. A SaaS vendor wants benchmarks framed against industry percentiles. One-size-fits-all prompts produce mediocre output for everyone.

We solved this with a prompt template system stored in Postgres per tenant. Each template has a base section (common across all tenants), a persona section (tenant-specific tone and domain vocabulary), and a data section (populated at runtime with the query result). Templates are versioned — a new template version goes through a quality review process before becoming the default for that tenant. Rollback is a single SQL update. This structure made prompt engineering a product operation, not a deployment — the product team could iterate on templates without engineering involvement.

  • Token budget per tenant per request is configured in the tenant record and enforced at the LLM call site — no tenant can accidentally trigger a 100K-token generation job.
  • We cache LLM summaries aggressively: the cache key is (tenant_id, report_type, data_hash). Same data, same summary — we don't call the LLM twice.
  • Model selection is per-tenant: Premium tenants use GPT-4 class models; Standard tenants use a smaller, faster, cheaper model. Tenants can opt up.
  • Every LLM call is logged with input token count, output token count, model used, tenant ID, and report ID — this is the cost attribution system.

Achieving 99.9% uptime: what it actually requires

99.9% uptime means roughly 8.7 hours of downtime budget per year, or about 43 minutes per month. In practice, every deployment, every database migration, and every dependency upgrade eats into that budget if you're not careful. The two practices that mattered most were zero-downtime deployments (blue/green via ECS, with health checks that gate traffic shift) and database migrations that are always backward compatible with the previous application version.

The backward-compatible migration rule sounds simple and is hard to follow in practice. Adding a column with a default is fine. Removing a column requires three separate deployments: first deploy the application that stops reading the column, then deploy the migration that drops it, then clean up dead code. This multi-step process for every schema change is tedious and occasionally frustrating for the team, but the alternative — 5-minute maintenance windows that compound across 6 tenants' operational schedules — is worse.

Observability: per-tenant dashboards as a first-class product

Every DataDog metric in the system carries a `tenant_id` tag. This single decision — tagging all metrics with the tenant dimension at the instrumentation layer, not in a post-processing step — made it trivial to build per-tenant dashboards, per-tenant SLA alerting, and per-tenant cost reports. When a tenant opens a support ticket saying their reports are slow, we can pull up their specific p99 latency, queue depth, and LLM call duration in under a minute, which changes the support conversation entirely.

Multi-tenancy is not just a deployment concern — it's a product promise. When a Premium tenant signs a contract with a 99.9% uptime SLA, that number has to be enforced in the infrastructure, not just hoped for in the application.

If you're building a multi-tenant SaaS product — especially one where AI-generated content is a core feature — the architectural decisions described here are the kind of thing that's much cheaper to get right at design time than to retrofit under load. Advisory engagements to review and stress-test SaaS architectures before launch are something I offer on a select basis. The patterns here have been validated in production across 6 enterprise tenants at meaningful scale.

Open to select projects

Building something with AI?

I take on select AI engineering projects end-to-end — from React frontend to LLM pipeline on AWS. Tell me what you're building.

Advertisement