RAGMulti-Tenant SaaSDocument IntelligenceCase Study

SureCiteAI: Building a Multi-Tenant Document Intelligence RAG SaaS

How I architected and shipped a production multi-tenant RAG platform as sole engineer — 8-stage retrieval pipeline, 4-layer tenant isolation, 3-tier LLM failover, per-query audit trace, 147ms responses, 96.8% accuracy, 0 hallucinated citations across 297 cases on public benchmarks including PatronusAI FinanceBench. Live at sureciteai.com.

By Nic Chin10 min read

Most "enterprise RAG" demos collapse the moment you put them in front of real customers. They work on one tenant. They work on clean PDFs. They hallucinate the moment a document is long, or scanned, or contains a table. They don't have billing. They don't have tenant isolation. They don't have a failover story when OpenAI has an outage at 2am. SureCiteAI is the opposite of that: a production multi-tenant document intelligence SaaS I built as sole architect, currently live at sureciteai.com, with an open-source codebase of 1,300+ commits and 470+ TypeScript files (as of April 2026). The product shipped under the working name DocsFlow until early 2026; the rename to SureCiteAI reflects the sharper positioning the project has grown into — verified citations as the core product guarantee. This case study covers the architecture, the decisions that made it production-ready, and the metrics it hits in production and on public benchmarks.

The Problem: Enterprise Document Chaos

The category SureCiteAI addresses is familiar. Teams accumulate hundreds of contracts, SOPs, policies, meeting notes, reports, and spec documents across Google Drive, SharePoint, email attachments, and local folders. When someone needs an answer — "what does this supplier contract say about liability caps?", "how do we handle GDPR erasure requests?", "what were the Q3 regional performance numbers?" — the answer lives inside a document someone has to find, open, and read. That friction costs hours per week per knowledge worker and produces inconsistent, memory-based answers.

A generic AI chatbot can't solve this. General-purpose LLMs don't know your documents, hallucinate confidently when they don't, can't cite sources, and — critically for any multi-user product — can't isolate one team's data from another's. The technical bar for a production answer is high: accurate retrieval across mixed document formats, verifiable source attribution for every claim, strict tenant isolation, resilient LLM infrastructure, and a real billing + admin story on top.

SureCiteAI is my attempt to hit that bar and ship it as a real SaaS product, not a demo.

Core Architecture: The 8-Stage RAG Pipeline

Every document query in SureCiteAI flows through eight distinct stages, each instrumented and observable. The pipeline is: upload → OCR → parsing → chunking → embeddings → vector upsert → semantic query → LLM generation with source citations. Treating this as eight separable stages rather than one opaque "RAG step" is what makes production debugging tractable.

Ingestion, OCR, and Parsing

Users upload PDF, DOCX, XLSX, PPTX, image, and plain text files. Text-native formats are parsed directly. Images and scanned PDFs are routed through Gemini 2.0 Flash Vision OCR, which handles both printed and handwritten content and preserves spatial structure for tables and forms — a significant upgrade over traditional Tesseract-style OCR, which loses layout and often corrupts multi-column documents.

The parsed output is structured text with provenance metadata: which page, which paragraph, which cell of which table. That metadata is what enables source-attributed answers downstream — without it, you can extract text but you can't honestly cite it back.

Chunking and Embedding

Chunking is the single highest-leverage RAG decision. SureCiteAI uses semantic-aware chunking that respects section boundaries and preserves local context around each chunk, rather than naively splitting on fixed token counts. Embeddings are generated with OpenAI text-embedding-3-small and written to Pinecone in a per-tenant namespace (more on that below).

Hybrid Search + Reciprocal Rank Fusion

Retrieval does not use vector search alone. SureCiteAI runs hybrid search: a dense vector search for semantic similarity (good at concepts and paraphrases) and a BM25 sparse vector search for exact-term matching (good at proper nouns, clause numbers, part codes, and direct quotes). The two ranked lists are then combined with Reciprocal Rank Fusion, which produces a more stable, more precise final ranking than either method alone. In practice this is what pushes retrieval accuracy from "good demo" territory into the 96.8% accuracy band measured in production.

Hierarchical Two-Stage Retrieval for Large Collections

Most RAG tutorials assume a few dozen documents. SureCiteAI customers routinely exceed 20 documents per collection, and some push into the hundreds. A flat vector search over that much content surfaces "adjacent but wrong" chunks from unrelated documents. SureCiteAI solves this with a two-stage hierarchical retriever: it first ranks documents by summary similarity to the query, then performs the hybrid search only within the top-ranked documents. This dramatically reduces cross-document noise and gives answers that are grounded in the right document, not just a document that shares vocabulary with the question.

Query Complexity Routing and LLM Generation

Not every query needs the most expensive model. A lightweight classifier labels each incoming query as simple, medium, or complex and routes it to the appropriate LLM tier. Simple lookups hit cheaper, faster models; complex multi-step reasoning goes to heavier models. Conversation memory is server-side and vague queries are reformulated by a lightweight LLM before retrieval — this is what keeps follow-up questions like "and what about the second one?" working correctly.

Multi-Tenant Isolation: Four Layers, Zero Leakage

Every multi-tenant AI SaaS has the same terrifying failure mode: tenant A's document surfacing in tenant B's query. A single instance of that in the wild is an existential bug for the product. SureCiteAI defends against it with four overlapping layers, any one of which would block the attack on its own.

  1. Database layer: Supabase Row-Level Security on every table. Tenant ID is enforced at the Postgres level by RLS policies, so queries physically cannot return another tenant's rows regardless of what the application code asks for.
  2. Vector layer: Each tenant gets a separate Pinecone namespace. There is no "shared" index with a tenant-ID filter — tenants live in distinct namespaces, which makes cross-tenant vector leakage structurally impossible.
  3. Auth layer: Clerk session tokens carry tenant context, and middleware validates the tenant claim on every request before the handler even runs. Session manipulation doesn't get past the middleware layer.
  4. Routing layer: Per-tenant subdomains ({tenant}.sureciteai.com) with tenant resolution in edge middleware. The tenant is identified from the subdomain before any application code runs, closing the loop.

Defense in depth is boring to build and invisible to users when it works. It's also non-negotiable for any product that is going to touch a B2B customer's actual documents.

LLM Resilience: Three-Tier Failover with Circuit Breaker

Single-provider dependencies break products. OpenAI has outages. Anthropic has rate limits. Google changes pricing on Gemini with minimal notice. SureCiteAI runs on a three-tier LLM failover chain with a circuit breaker pattern: Llama 3.3 70B as primary, GPT-4o-mini as secondary, Mixtral as tertiary, with Gemini 2.0 Flash as an emergency fallback if the entire primary chain is compromised. The circuit breaker prevents the system from hammering a failing provider and automatically re-tests availability on a backoff schedule.

The measured production result: 96% LLM generation success rate with a 12% fallback rate. That 12% represents the percentage of queries that fell back to a secondary tier at least once — meaning without the failover, roughly one in eight user queries would have errored out or returned nothing. Resilience isn't theoretical.

Auditability: Every Answer Is a Typed Trace

The fourth pillar — and the one that distinguishes SureCiteAI from generic RAG SaaS — is the auditability substrate. Every query in SureCiteAI emits a typed retrieval trace covering all seven retrieval substages (HyDE expansion, hybrid dense+sparse search, cross-encoder rerank, post-rerank grounding penalty, small-to-big sibling expansion, confidence calibration, citation verification) plus the LLM generation stage. The trace is opt-in per query and a pure no-op when disabled — production paths pay zero cost when a query doesn't request observability.

When a query does request observability — debug sessions, on-call investigations, eval runs — the trace is persisted to a service-role-only Supabase table (public.rag_trace_events, RLS-locked, no read policies) alongside the existing structured log line. Six saved leading-indicator queries are documented in the on-call runbook: suspect-refusal rate, citation no-match rate, reranker fallback rate, model fallback depth, p50/p95/p99 latency, and per-tenant verdict distribution. An operator goes from "something feels off" to "here's what changed" in under five minutes.

The citation verifier is the load-bearing component of this pillar. Every cited filename in a SureCiteAI answer is matched against the retrieved context with alias normalisation; mismatches are rewritten or the answer is held back entirely. On the most recent published benchmark run, the verifier held hallucinated citations to zero across 297 cases on public legal (CUAD), healthcare (openFDA), accounting (SEC EDGAR 10-K), and finance (PatronusAI FinanceBench) corpora. A second mechanism, evaluateSuspectRefusal, runs three independent signals (top-chunk score, abstention non-trigger, dominant-document presence) and flags any refusal where ≥2-of-3 disagree with the retrieval evidence. That converts a silent failure mode — model refusing despite having the answer in context — into a queryable metric.

On the prompt side, a custom-instructions-sanitiser blocks tenant-supplied prompt overrides that would invert grounding rules or instruct the model to refuse. Industry presets (vetted strings under our control) bypass the sanitiser; free-form tenant text does not. Together with the model-side failover, this is what makes the system safe to expose to tenant-configured personas without a security review on every change.

Benchmarks: Reproducible, Public, Honest About the Hard Cases

Most production RAG claims aren't falsifiable — "our system is 95% accurate" isn't paired with a corpus, a methodology, or a run artifact. SureCiteAI is graded against a six-suite golden eval, four of which are built on public corpora that anyone can re-run independently:

  • Legal — CUAD v1 (Contract Understanding Atticus Dataset, CC BY 4.0, Hendrycks et al., NeurIPS 2021). The standard academic benchmark for legal-contract understanding.
  • Healthcare — openFDA drug labels (public domain).
  • Accounting — SEC EDGAR 10-K filings (public domain).
  • Finance — PatronusAI FinanceBench (the standard public benchmark for retrieval-augmented financial QA: 150 ecologically-valid Q&A pairs across 84 SEC filings from 32 publicly-traded companies, Islam et al., NeurIPS 2023).
  • Plus internal real-estate and consulting suites for domain-specific retrieval behaviour.

The most recent published run (2026-04-27, Cohere rerank-3.5) scored 221/297 (74%) aggregate with 0/297 hallucinated citations, with grounding-penalty enabled. Per-suite breakdown:

  • accounting (SEC 10-K): 35/35 (100%)
  • healthcare (openFDA): 35/35 (100%)
  • real_estate (internal): 32/35 (91%)
  • consulting (internal): 7/7 (100%)
  • finance (PatronusAI FinanceBench): 95/150 (63%) — with 96% retrieval hit-rate (144/150) and 0/150 hallucinations. The pass/retrieval gap is over-abstention (the safe failure mode), not fabrication; see BENCHMARKS.md §5.1 for the failure-mode breakdown.
  • legal (CUAD): 17/35 (49%) — see note below

The CUAD pass rate looks low next to the others, on purpose. CUAD is adversarial: a large fraction of cases test for clause categories that simply aren't in the contract being queried (does this license agreement contain a non-compete? a most-favoured-nation clause? a change-of-control trigger?). The correct behaviour on those cases is abstention, not fabrication — and the eval scores fabrication as zero credit. Picking a hard, published benchmark over a tame internal one was a deliberate trade: it costs us pass-rate in exchange for credibility. The system produced zero hallucinated citations on the legal suite even where it scored 49% on pass-rate — and zero hallucinations across all 297 cases including the 150-question FinanceBench external benchmark.

Quality metrics extend beyond pass/fail: per-suite calibration is reported as Expected Calibration Error, Brier score, and AUROC. The RAGAS triad (faithfulness, answer relevancy, context precision, context recall) per Es et al. 2023 is implemented and runnable per-case. Methodology, suite definitions, full reproducibility instructions, and the raw run artifacts live in the public repository — see BENCHMARKS.md and benchmarks/runs/. The harness is one command (npm run eval:all) against the public corpora and reproduces the same scorecard.

The Full SaaS Stack

SureCiteAI is a complete product, not an AI component. That means the stack extends well beyond the RAG pipeline:

  • Stripe subscription billing with per-tier enforcement and usage tracking at the query level, so pricing stays honest against actual infrastructure cost.
  • 6-step guided onboarding wizard that lets new tenants customize their AI persona (role, tone, business context, focus areas) without writing any prompts themselves.
  • Admin dashboard with real-time pipeline monitoring, security events, and API health checks — the operational surface I needed to run this confidently as a one-person company.
  • Per-tenant AI personality, industry-specific defaults, and configurable citation styles so the assistant feels native to each customer rather than generic.

Production Metrics

Production-traffic numbers, distinct from the public-benchmark eval numbers above:

  • 147ms average retrieval-stage latency (end-to-end semantic query avg ~784ms including generation).
  • 96.8% RAG retrieval accuracy on production traffic.
  • 85% cost reduction through intelligent caching and multi-provider model selection vs. a naive single-provider implementation.
  • 96% LLM generation success rate, with the failover chain absorbing the remainder (12% of queries trigger at least one fallback — proving the chain is exercised, not decorative).
  • 1,300+ commits, 470+ TypeScript files, open source and public (as of April 2026).

Tech Stack

Next.js 15, TypeScript, Supabase (PostgreSQL with Row-Level Security), Pinecone, OpenAI Embeddings, Llama 3.3, GPT-4o-mini, Mixtral, Gemini 2.0 Flash Vision, Clerk, Stripe, Vercel. No framework magic; every component is a deliberate choice anchored to a specific production requirement.

The Takeaway

SureCiteAI is the reference implementation for what I mean by "production AI" when I work with clients: hybrid retrieval (not just vector search), hierarchical retrieval for scale, four-layer tenant isolation (not just an app-layer filter), multi-provider failover (not a single OpenAI key), full billing and admin tooling (not a demo dashboard), and real observability into every stage of the pipeline. Every one of those decisions is something I can transplant into a client's own product in weeks rather than quarters, because the hard problems are already solved here.

Related reading: For the deeper pattern library behind SureCiteAI's retrieval, see RAG architecture in production. For how this connects to broader multi-agent work, see Multi-Agent AI Systems Guide and the 20-agent trading ensemble case study. If you want the same architecture applied to your own documents, I take on a small number of fractional AI CTO engagements each quarter — or if you already have a codebase, you can get an instant system audit via SystemAudit. I also serve clients in Malaysia and Singapore.

Ready to discuss your AI project?

Book a free 30-minute discovery call to explore how AI can transform your business. Or if you already have a codebase, get an instant architecture report at SystemAudit.dev No technical knowledge needed, results in 3 minutes.

About the Author

Nic Chin is an AI Architect and Fractional CTO who helps companies design and deploy production AI systems including RAG pipelines, multi-agent systems, and AI automation platforms. He has delivered enterprise AI solutions across the UK, US, and Europe, and provides AI consulting in Malaysia and Singapore.