Architecture Overview

On-Premise LLM Platform

Private, air-gapped AI platform — full control over data, models, and access. No cloud dependency.

Ollama vLLM Keycloak Docker FAISS Qdrant Neo4j MLflow Prometheus Guardrails AI MCP

Platform Layer Stack

Four interdependent layers from bare-metal infrastructure up to end-user applications — all running on-premise.

Infrastructure & Data GPU Servers · Storage · Docker LLM Engine, RAG & Guardrails Ollama · FAISS · Safety Rails API Gateway & AuthN/Z Kong · Keycloak · RBAC · JWT Applications & Clients Web UI · API Clients · Agents

What to Obey

Non-negotiable constraints that every on-premise LLM deployment must satisfy before going to production.

Network & Isolation

Air-Gap by Design

  • No outbound model API calls — all inference stays on-prem
  • Firewall egress rules block LLM SaaS endpoints by default
  • DMZ placement for any externally accessible inference endpoints
  • Internal DNS only — no public resolution for model endpoints
A single misconfigured egress rule can silently route prompts to a cloud provider. Verify with active traffic monitoring.
Security & Compliance

Data Residency & Audit

  • GDPR data residency — all embeddings and logs stay in the defined region
  • SOC 2 audit logging — every prompt, model, user, and latency recorded
  • Prompt injection defence — input sanitisation and output validation layers
  • Data classification enforced — PII routed to restricted-access models only
Audit logs must be immutable and shipped to a SIEM. Log retention ≥ 90 days per most compliance frameworks.
Hardware & Availability

GPU Sizing & HA

  • VRAM budget per model: 7B≈6GB, 13B≈12GB, 70B≈48GB (4-bit quant)
  • HA setup — at least two inference nodes with load balancing
  • Resource quotas per team/role — prevent GPU monopolisation
  • Cold-start SLA — model load time factored into p99 latency budget
VRAM requirements shown for GGUF Q4_K_M quantisation. Full-precision or larger models require proportionally more.

Core Platform Components

Six capability areas that together form a production-grade on-premise LLM platform.

Model Serving

Self-hosted inference runtime with GPU acceleration, GGUF/GPTQ quantisation support, and a REST API compatible with OpenAI clients. Hot-swap models without downtime.

Ollama vLLM LM Studio GGUF GPTQ

API Gateway

Centralised request routing across multiple models with rate limiting, JWT validation, and usage metering per team. Routes traffic based on model capability requirements.

Kong Traefik Nginx

Auth & RBAC

Single sign-on via OIDC with LDAP/AD bridge. Role-scoped model access — e.g., only approved roles can query uncensored or high-capability models. Token-based API auth for service accounts.

Keycloak OIDC OAuth2 LDAP

Knowledge & RAG

Vector stores for semantic document retrieval. RAG pipelines ingest internal knowledge bases, code repositories, and enterprise docs — keeping sensitive data entirely within the perimeter.

FAISS Qdrant Chroma LangChain LlamaIndex

Safety & Guardrails

Input/output validation layer intercepts harmful prompts, PII leakage, and policy violations before they reach end users. Output filtering prevents hallucinated credentials or confidential data exposure.

Guardrails AI NeMo Guardrails

Observability

Full-stack visibility into model performance, token throughput, error rates, and latency percentiles. Experiment tracking ties every inference call back to the model version and prompt template used.

MLflow Prometheus Grafana Evidently AI

Data & Integration Layer

How the platform connects to live enterprise data without exporting sensitive content to external systems.

Enterprise Connectors via MCP

Model Context Protocol bridges the LLM to live enterprise data without copying it. Agents call tools in real time — no data duplication, no stale exports.

  • Confluence — knowledge base articles, technical docs
  • Jira — tickets, epics, sprints, status updates
  • SharePoint — document libraries, intranet pages
  • SQL Databases — structured queries via read-only connectors
  • REST APIs — any internal service with OpenAPI spec
MCP tool calls are logged and auditable — every data access by an agent is traceable to a user and session.

Knowledge Store Options

Type When to use Examples
Vector Store Semantic similarity, document Q&A FAISS, Qdrant, Weaviate, Chroma
Knowledge Graph Entity relationships, multi-hop reasoning Neo4j, Neptune, GraphDB
Hybrid Complex enterprise RAG Neo4j + Qdrant, LangChain graph transformers
Enterprise Data MCP / Embeddings Vector / Graph DB RAG Engine LLM (On-Prem)

MLOps & Testing Pipeline

A four-phase lifecycle for managing, testing, deploying, and monitoring on-premise LLMs in production.

1
Track
MLflow experiment tracking, model registry, and prompt versioning. Log every fine-tune run with hyperparameters, eval metrics, and dataset lineage.
2
Test
Automated prompt regression suites, adversarial and red-team testing, output scoring. Tools: PromptFoo, DeepEval, custom test harnesses with CI gating.
3
Deploy
Ollama hot-swap, A/B routing in the API gateway, canary rollouts with traffic splitting, and automated rollback triggers on error-rate threshold breach.
4
Monitor
Prometheus metrics (latency, tokens/sec, error rates), Grafana dashboards, Evidently AI for drift detection. Alerts on performance degradation trigger retest cycles.

Key Open-Source Tooling

The primary open-source projects that make a production-grade on-premise LLM platform possible.

Ollama

Run LLMs locally with GGUF model management, a REST API compatible with OpenAI clients, and GPU acceleration on NVIDIA and Apple Silicon.

ollama.ai →

Keycloak

Open source identity and access management. SSO, OIDC, LDAP/AD bridge, RBAC, all fully self-hosted. No external IdP dependency.

keycloak.org →

Qdrant

High-performance vector search engine with filtering, payload indexing, and on-disk storage support. Fully self-hostable via Docker or Kubernetes.

qdrant.tech →

MCP Protocol

Open standard for connecting AI models to external tools and data sources. Enables agents to call enterprise systems without data duplication.

modelcontextprotocol.io →

MLflow

End-to-end ML lifecycle management — experiment tracking, model registry, prompt versioning, and deployment. Integrates with any training framework.

mlflow.org →

Grafana

Observability dashboards for metrics, logs, and traces. Pair with Prometheus for LLM latency, token throughput, and error rate monitoring dashboards.

grafana.com →

Guardrails AI

Output validation and safety rails for LLM responses. Schema enforcement, PII redaction, toxicity detection, and custom validators via a declarative RAIL spec.

guardrailsai.com →

Neo4j

Graph database for entity relationships and multi-hop reasoning. Used in hybrid RAG pipelines to capture structured knowledge that vector search alone cannot represent.

neo4j.com →