System Design

Semantic Document Search Platform

Multi-tenant semantic search over 10M+ documents — hybrid dense/sparse retrieval, neural re-ranking, and a React UI with real-time result highlighting.

Qdrant Sentence-Transformers BM25 FastAPI React Redis Python HuggingFace

Platform Architecture

Five layers form a complete semantic search system — from raw document storage up to the query API serving results.

Document Storage S3 · PostgreSQL · Object Store Indexing & Chunking Parsing · Splitting · Dedup Embedding Engine Sentence-Transformers · GPU Hybrid Retrieval Qdrant · BM25 · Re-ranker Query API FastAPI · Redis · SSE

Retrieval Modes

The platform automatically selects and blends retrieval strategies based on query characteristics and tenant configuration.

Dense Retrieval

Semantic Vector Search

semantic meaning similar to
  • Sentence-Transformers bi-encoder embeddings
  • Cosine similarity over HNSW index in Qdrant
  • Multilingual support via multilingual-e5-base
  • Handles paraphrase, synonyms, and concept drift
Dominates for conceptual queries where the user knows the idea but not the exact phrasing.
Sparse Retrieval

BM25 Keyword Search

exact phrase product code error message
  • BM25Okapi inverted index with TF-IDF scoring
  • Exact token matching — no embedding overhead
  • Per-tenant stop-word and stemming configuration
  • Incremental index updates on document ingest
Dominates for precise keyword lookups — error codes, product IDs, or technical identifiers.
Hybrid + Re-rank

RRF Fusion & Cross-Encoder

complex query mixed intent
  • Reciprocal Rank Fusion merges dense + sparse sets
  • Cross-encoder re-ranker scores top-50 candidates
  • Full query–passage interaction for precision
  • Re-ranker threshold configurable per tenant
Default mode for general queries — captures recall from dense while preserving precision from sparse.

Core Components

Six specialized subsystems that together deliver production-grade semantic search at scale.

Document Ingestion

Multi-format parsing pipeline supporting PDF, HTML, and DOCX. Sentence-level chunking with configurable overlap. Content-hash deduplication prevents re-embedding unchanged documents.

multi-format parsing sentence chunking deduplication

Embedding Pipeline

Sentence-Transformers models run batch GPU inference on document chunks. Incremental re-embedding triggers automatically when the active model checkpoint is updated in the model registry.

Sentence-Transformers batch GPU inference incremental re-embed

Vector Store

Qdrant with HNSW index delivers sub-100ms approximate nearest-neighbour search at 10M+ scale. Per-tenant collection namespaces ensure data isolation. Payload filters enable faceted search without post-processing.

Qdrant HNSW index per-tenant namespaces payload filtering

BM25 Index

BM25Okapi inverted index built per tenant. Term-frequency store supports incremental document addition and deletion without full rebuild. Stop-word lists and stemming rules are configurable per corpus language.

BM25Okapi incremental updates stop-word tuning

Re-ranking Layer

Cross-encoder (ms-marco-MiniLM-L-6) re-scores the top-50 candidates from RRF fusion with full query–passage interaction. Score threshold is configurable — tenants can trade latency for precision.

cross-encoder ms-marco-MiniLM top-50 re-score

Search Frontend

React UI with real-time token highlighting powered by SSE streaming. Faceted sidebar filters on metadata payload fields. Infinite scroll pagination and keyboard-first navigation for power users.

React real-time highlighting facets SSE streaming

Query Pipeline

Every search request fans out to parallel dense and sparse retrieval, then converges through fusion and re-ranking before results are streamed to the UI.

Query Embed & Tokenize Dense Search BM25 Search RRF Fusion Re-rank Highlight Results dense branch sparse branch

Multi-Tenancy & Scale

Hard tenant isolation and horizontal scalability without compromise on search quality.

Tenant Isolation

  • Qdrant collection-per-tenant namespaces — no shared index
  • Per-tenant query quotas enforced by Redis rate limiter
  • API key scoped to tenant ID — zero cross-tenant data bleed
  • Tenant-level index config: chunk size, re-ranker on/off, model selection
  • Tenant onboarding via admin API — no manual infra provisioning

Scale & Performance

Index Size p50 Latency p95 Latency Throughput 10M+ docs ~30 ms ~90 ms 500 QPS

Key Tools & Resources

The primary libraries, databases, and frameworks powering this platform.

Qdrant

High-performance vector DB with payload filtering, multi-tenant namespaces, and HNSW indexing. Rust core, Python client.

qdrant.tech →

Sentence-Transformers

Pre-trained bi-encoder models for dense embeddings. Multilingual support, fast batch inference, easy HuggingFace Hub integration.

sbert.net →

BM25 (rank_bm25)

Python BM25Okapi implementation. Lightweight inverted index for sparse keyword retrieval with configurable k1/b parameters.

GitHub →

FastAPI

Async search API layer. Native SSE streaming for progressive result delivery. OpenAPI schema auto-generated for client SDKs.

fastapi.tiangolo.com →

React

Search UI with real-time token highlight rendering, faceted filters, and infinite scroll. SSE client for streaming search snippets.

react.dev →

Redis

Query result cache with TTL-based expiry, per-tenant rate limiting via Redis token-bucket, and session store for search history.

redis.io →

HuggingFace Hub

Model registry for embedding and re-ranker checkpoints with versioned deploys. Supports model pinning per tenant for reproducibility.

huggingface.co →

ms-marco-MiniLM

Cross-encoder re-ranker fine-tuned on MS MARCO. Scores query–passage pairs for precise re-ranking of the top-50 fusion candidates.

Model card →