Semantic Document Search

Architecture

Platform Architecture

Five layers form a complete semantic search system — from raw document storage up to the query API serving results.

Retrieval

Retrieval Modes

The platform automatically selects and blends retrieval strategies based on query characteristics and tenant configuration.

Dense Retrieval

Semantic Vector Search

semantic meaning similar to

Sentence-Transformers bi-encoder embeddings
Cosine similarity over HNSW index in Qdrant
Multilingual support via multilingual-e5-base
Handles paraphrase, synonyms, and concept drift

Dominates for conceptual queries where the user knows the idea but not the exact phrasing.

Sparse Retrieval

BM25 Keyword Search

exact phrase product code error message

BM25Okapi inverted index with TF-IDF scoring
Exact token matching — no embedding overhead
Per-tenant stop-word and stemming configuration
Incremental index updates on document ingest

Dominates for precise keyword lookups — error codes, product IDs, or technical identifiers.

Hybrid + Re-rank

RRF Fusion & Cross-Encoder

complex query mixed intent

Reciprocal Rank Fusion merges dense + sparse sets
Cross-encoder re-ranker scores top-50 candidates
Full query–passage interaction for precision
Re-ranker threshold configurable per tenant

Default mode for general queries — captures recall from dense while preserving precision from sparse.

Components

Core Components

Six specialized subsystems that together deliver production-grade semantic search at scale.

Document Ingestion

Multi-format parsing pipeline supporting PDF, HTML, and DOCX. Sentence-level chunking with configurable overlap. Content-hash deduplication prevents re-embedding unchanged documents.

multi-format parsing sentence chunking deduplication

Embedding Pipeline

Sentence-Transformers models run batch GPU inference on document chunks. Incremental re-embedding triggers automatically when the active model checkpoint is updated in the model registry.

Sentence-Transformers batch GPU inference incremental re-embed

Vector Store

Qdrant with HNSW index delivers sub-100ms approximate nearest-neighbour search at 10M+ scale. Per-tenant collection namespaces ensure data isolation. Payload filters enable faceted search without post-processing.

Qdrant HNSW index per-tenant namespaces payload filtering

BM25 Index

BM25Okapi inverted index built per tenant. Term-frequency store supports incremental document addition and deletion without full rebuild. Stop-word lists and stemming rules are configurable per corpus language.

BM25Okapi incremental updates stop-word tuning

Re-ranking Layer

Cross-encoder (ms-marco-MiniLM-L-6) re-scores the top-50 candidates from RRF fusion with full query–passage interaction. Score threshold is configurable — tenants can trade latency for precision.

cross-encoder ms-marco-MiniLM top-50 re-score

Search Frontend

React UI with real-time token highlighting powered by SSE streaming. Faceted sidebar filters on metadata payload fields. Infinite scroll pagination and keyboard-first navigation for power users.

React real-time highlighting facets SSE streaming

Pipeline

Query Pipeline

Every search request fans out to parallel dense and sparse retrieval, then converges through fusion and re-ranking before results are streamed to the UI.

Operations

Multi-Tenancy & Scale

Hard tenant isolation and horizontal scalability without compromise on search quality.

Tenant Isolation

Qdrant collection-per-tenant namespaces — no shared index
Per-tenant query quotas enforced by Redis rate limiter
API key scoped to tenant ID — zero cross-tenant data bleed
Tenant-level index config: chunk size, re-ranker on/off, model selection
Tenant onboarding via admin API — no manual infra provisioning

Scale & Performance

Resources

Key Tools & Resources

The primary libraries, databases, and frameworks powering this platform.

Qdrant

High-performance vector DB with payload filtering, multi-tenant namespaces, and HNSW indexing. Rust core, Python client.

qdrant.tech →

Sentence-Transformers

Pre-trained bi-encoder models for dense embeddings. Multilingual support, fast batch inference, easy HuggingFace Hub integration.

sbert.net →

BM25 (rank_bm25)

Python BM25Okapi implementation. Lightweight inverted index for sparse keyword retrieval with configurable k1/b parameters.

GitHub →

FastAPI

Async search API layer. Native SSE streaming for progressive result delivery. OpenAPI schema auto-generated for client SDKs.

fastapi.tiangolo.com →

React

Search UI with real-time token highlight rendering, faceted filters, and infinite scroll. SSE client for streaming search snippets.

react.dev →

Redis

Query result cache with TTL-based expiry, per-tenant rate limiting via Redis token-bucket, and session store for search history.

redis.io →

HuggingFace Hub

Model registry for embedding and re-ranker checkpoints with versioned deploys. Supports model pinning per tenant for reproducibility.

huggingface.co →

ms-marco-MiniLM

Cross-encoder re-ranker fine-tuned on MS MARCO. Scores query–passage pairs for precise re-ranking of the top-50 fusion candidates.

Model card →

Semantic Document Search Platform