DocsFlow: Building a Multi-Tenant Document Intelligence RAG SaaS
How I architected and shipped a production multi-tenant RAG platform as sole engineer — 8-stage retrieval pipeline, 4-layer tenant isolation, 3-tier LLM failover, 147ms responses, 96.8% accuracy. Live at docsflow.app.
Most "enterprise RAG" demos collapse the moment you put them in front of real customers. They work on one tenant. They work on clean PDFs. They hallucinate the moment a document is long, or scanned, or contains a table. They don't have billing. They don't have tenant isolation. They don't have a failover story when OpenAI has an outage at 2am. DocsFlow is the opposite of that: a production multi-tenant document intelligence SaaS I built as sole architect, currently live at docsflow.app, with an open-source codebase of 1,188 commits and 352 TypeScript files. This case study covers the architecture, the decisions that made it production-ready, and the metrics it hits in production.
The Problem: Enterprise Document Chaos
The category DocsFlow addresses is familiar. Teams accumulate hundreds of contracts, SOPs, policies, meeting notes, reports, and spec documents across Google Drive, SharePoint, email attachments, and local folders. When someone needs an answer — "what does this supplier contract say about liability caps?", "how do we handle GDPR erasure requests?", "what were the Q3 regional performance numbers?" — the answer lives inside a document someone has to find, open, and read. That friction costs hours per week per knowledge worker and produces inconsistent, memory-based answers.
A generic AI chatbot can't solve this. General-purpose LLMs don't know your documents, hallucinate confidently when they don't, can't cite sources, and — critically for any multi-user product — can't isolate one team's data from another's. The technical bar for a production answer is high: accurate retrieval across mixed document formats, verifiable source attribution for every claim, strict tenant isolation, resilient LLM infrastructure, and a real billing + admin story on top.
DocsFlow is my attempt to hit that bar and ship it as a real SaaS product, not a demo.
Core Architecture: The 8-Stage RAG Pipeline
Every document query in DocsFlow flows through eight distinct stages, each instrumented and observable. The pipeline is: upload → OCR → parsing → chunking → embeddings → vector upsert → semantic query → LLM generation with source citations. Treating this as eight separable stages rather than one opaque "RAG step" is what makes production debugging tractable.
Ingestion, OCR, and Parsing
Users upload PDF, DOCX, XLSX, PPTX, image, and plain text files. Text-native formats are parsed directly. Images and scanned PDFs are routed through Gemini 2.0 Flash Vision OCR, which handles both printed and handwritten content and preserves spatial structure for tables and forms — a significant upgrade over traditional Tesseract-style OCR, which loses layout and often corrupts multi-column documents.
The parsed output is structured text with provenance metadata: which page, which paragraph, which cell of which table. That metadata is what enables source-attributed answers downstream — without it, you can extract text but you can't honestly cite it back.
Chunking and Embedding
Chunking is the single highest-leverage RAG decision. DocsFlow uses semantic-aware chunking that respects section boundaries and preserves local context around each chunk, rather than naively splitting on fixed token counts. Embeddings are generated with OpenAI text-embedding-3-small and written to Pinecone in a per-tenant namespace (more on that below).
Hybrid Search + Reciprocal Rank Fusion
Retrieval does not use vector search alone. DocsFlow runs hybrid search: a dense vector search for semantic similarity (good at concepts and paraphrases) and a BM25 sparse vector search for exact-term matching (good at proper nouns, clause numbers, part codes, and direct quotes). The two ranked lists are then combined with Reciprocal Rank Fusion, which produces a more stable, more precise final ranking than either method alone. In practice this is what pushes retrieval accuracy from "good demo" territory into the 96.8% accuracy band measured in production.
Hierarchical Two-Stage Retrieval for Large Collections
Most RAG tutorials assume a few dozen documents. DocsFlow customers routinely exceed 20 documents per collection, and some push into the hundreds. A flat vector search over that much content surfaces "adjacent but wrong" chunks from unrelated documents. DocsFlow solves this with a two-stage hierarchical retriever: it first ranks documents by summary similarity to the query, then performs the hybrid search only within the top-ranked documents. This dramatically reduces cross-document noise and gives answers that are grounded in the right document, not just a document that shares vocabulary with the question.
Query Complexity Routing and LLM Generation
Not every query needs the most expensive model. A lightweight classifier labels each incoming query as simple, medium, or complex and routes it to the appropriate LLM tier. Simple lookups hit cheaper, faster models; complex multi-step reasoning goes to heavier models. Conversation memory is server-side and vague queries are reformulated by a lightweight LLM before retrieval — this is what keeps follow-up questions like "and what about the second one?" working correctly.
Multi-Tenant Isolation: Four Layers, Zero Leakage
Every multi-tenant AI SaaS has the same terrifying failure mode: tenant A's document surfacing in tenant B's query. A single instance of that in the wild is an existential bug for the product. DocsFlow defends against it with four overlapping layers, any one of which would block the attack on its own.
- Database layer: Supabase Row-Level Security on every table. Tenant ID is enforced at the Postgres level by RLS policies, so queries physically cannot return another tenant's rows regardless of what the application code asks for.
- Vector layer: Each tenant gets a separate Pinecone namespace. There is no "shared" index with a tenant-ID filter — tenants live in distinct namespaces, which makes cross-tenant vector leakage structurally impossible.
- Auth layer: Clerk session tokens carry tenant context, and middleware validates the tenant claim on every request before the handler even runs. Session manipulation doesn't get past the middleware layer.
- Routing layer: Per-tenant subdomains (
{tenant}.docsflow.app) with tenant resolution in edge middleware. The tenant is identified from the subdomain before any application code runs, closing the loop.
Defense in depth is boring to build and invisible to users when it works. It's also non-negotiable for any product that is going to touch a B2B customer's actual documents.
LLM Resilience: Three-Tier Failover with Circuit Breaker
Single-provider dependencies break products. OpenAI has outages. Anthropic has rate limits. Google changes pricing on Gemini with minimal notice. DocsFlow runs on a three-tier LLM failover chain with a circuit breaker pattern: Llama 3.3 70B as primary, GPT-4o-mini as secondary, Mixtral as tertiary, with Gemini 2.0 Flash as an emergency fallback if the entire primary chain is compromised. The circuit breaker prevents the system from hammering a failing provider and automatically re-tests availability on a backoff schedule.
The measured production result: 96% LLM generation success rate with a 12% fallback rate. That 12% represents the percentage of queries that fell back to a secondary tier at least once — meaning without the failover, roughly one in eight user queries would have errored out or returned nothing. Resilience isn't theoretical.
The Full SaaS Stack
DocsFlow is a complete product, not an AI component. That means the stack extends well beyond the RAG pipeline:
- Stripe subscription billing with per-tier enforcement and usage tracking at the query level, so pricing stays honest against actual infrastructure cost.
- 6-step guided onboarding wizard that lets new tenants customize their AI persona (role, tone, business context, focus areas) without writing any prompts themselves.
- Admin dashboard with real-time pipeline monitoring, security events, and API health checks — the operational surface I needed to run this confidently as a one-person company.
- Per-tenant AI personality, industry-specific defaults, and configurable citation styles so the assistant feels native to each customer rather than generic.
Production Metrics
These are measured numbers from production traffic, not benchmark fabrications:
- 147ms average response time (end-to-end semantic query avg ~784ms including generation).
- 96.8% RAG retrieval accuracy.
- 85% cost reduction through intelligent caching and multi-provider model selection vs. a naive single-provider implementation.
- 96% LLM generation success rate, with the failover chain absorbing the remainder.
- 1,188 commits, 352 TypeScript files, open source and public.
Tech Stack
Next.js 15, TypeScript, Supabase (PostgreSQL with Row-Level Security), Pinecone, OpenAI Embeddings, Llama 3.3, GPT-4o-mini, Mixtral, Gemini 2.0 Flash Vision, Clerk, Stripe, Vercel. No framework magic; every component is a deliberate choice anchored to a specific production requirement.
The Takeaway
DocsFlow is the reference implementation for what I mean by "production AI" when I work with clients: hybrid retrieval (not just vector search), hierarchical retrieval for scale, four-layer tenant isolation (not just an app-layer filter), multi-provider failover (not a single OpenAI key), full billing and admin tooling (not a demo dashboard), and real observability into every stage of the pipeline. Every one of those decisions is something I can transplant into a client's own product in weeks rather than quarters, because the hard problems are already solved here.
Related reading: For the deeper pattern library behind DocsFlow's retrieval, see RAG architecture in production. For how this connects to broader multi-agent work, see Multi-Agent AI Systems Guide and the 20-agent trading ensemble case study. If you want the same architecture applied to your own documents, I take on a small number of fractional AI CTO engagements each quarter — or if you already have a codebase, you can get an instant system audit via SystemAudit. I also serve clients in Malaysia and Singapore.
Ready to discuss your AI project?
Book a free 30-minute discovery call to explore how AI can transform your business. Or if you already have a codebase, get an instant architecture report at SystemAudit.dev — no technical knowledge needed, results in 3 minutes.