On-Premise LLM Platform

Architecture

Platform Layer Stack

Four interdependent layers from bare-metal infrastructure up to end-user applications — all running on-premise.

Requirements

What to Obey

Non-negotiable constraints that every on-premise LLM deployment must satisfy before going to production.

Network & Isolation

Air-Gap by Design

No outbound model API calls — all inference stays on-prem
Firewall egress rules block LLM SaaS endpoints by default
DMZ placement for any externally accessible inference endpoints
Internal DNS only — no public resolution for model endpoints

A single misconfigured egress rule can silently route prompts to a cloud provider. Verify with active traffic monitoring.

Security & Compliance

Data Residency & Audit

GDPR data residency — all embeddings and logs stay in the defined region
SOC 2 audit logging — every prompt, model, user, and latency recorded
Prompt injection defence — input sanitisation and output validation layers
Data classification enforced — PII routed to restricted-access models only

Audit logs must be immutable and shipped to a SIEM. Log retention ≥ 90 days per most compliance frameworks.

Hardware & Availability

GPU Sizing & HA

VRAM budget per model: 7B≈6GB, 13B≈12GB, 70B≈48GB (4-bit quant)
HA setup — at least two inference nodes with load balancing
Resource quotas per team/role — prevent GPU monopolisation
Cold-start SLA — model load time factored into p99 latency budget

VRAM requirements shown for GGUF Q4_K_M quantisation. Full-precision or larger models require proportionally more.

Components

Core Platform Components

Six capability areas that together form a production-grade on-premise LLM platform.

Model Serving

Self-hosted inference runtime with GPU acceleration, GGUF/GPTQ quantisation support, and a REST API compatible with OpenAI clients. Hot-swap models without downtime.

Ollama vLLM LM Studio GGUF GPTQ

API Gateway

Centralised request routing across multiple models with rate limiting, JWT validation, and usage metering per team. Routes traffic based on model capability requirements.

Kong Traefik Nginx

Auth & RBAC

Single sign-on via OIDC with LDAP/AD bridge. Role-scoped model access — e.g., only approved roles can query uncensored or high-capability models. Token-based API auth for service accounts.

Keycloak OIDC OAuth2 LDAP

Knowledge & RAG

Vector stores for semantic document retrieval. RAG pipelines ingest internal knowledge bases, code repositories, and enterprise docs — keeping sensitive data entirely within the perimeter.

FAISS Qdrant Chroma LangChain LlamaIndex

Safety & Guardrails

Input/output validation layer intercepts harmful prompts, PII leakage, and policy violations before they reach end users. Output filtering prevents hallucinated credentials or confidential data exposure.

Guardrails AI NeMo Guardrails

Observability

Full-stack visibility into model performance, token throughput, error rates, and latency percentiles. Experiment tracking ties every inference call back to the model version and prompt template used.

MLflow Prometheus Grafana Evidently AI

Integration

Data & Integration Layer

How the platform connects to live enterprise data without exporting sensitive content to external systems.

Enterprise Connectors via MCP

Model Context Protocol bridges the LLM to live enterprise data without copying it. Agents call tools in real time — no data duplication, no stale exports.

Confluence — knowledge base articles, technical docs
Jira — tickets, epics, sprints, status updates
SharePoint — document libraries, intranet pages
SQL Databases — structured queries via read-only connectors
REST APIs — any internal service with OpenAPI spec

MCP tool calls are logged and auditable — every data access by an agent is traceable to a user and session.

Knowledge Store Options

Type	When to use	Examples
Vector Store	Semantic similarity, document Q&A	FAISS, Qdrant, Weaviate, Chroma
Knowledge Graph	Entity relationships, multi-hop reasoning	Neo4j, Neptune, GraphDB
Hybrid	Complex enterprise RAG	Neo4j + Qdrant, LangChain graph transformers

MLOps

MLOps & Testing Pipeline

A four-phase lifecycle for managing, testing, deploying, and monitoring on-premise LLMs in production.

Track

MLflow experiment tracking, model registry, and prompt versioning. Log every fine-tune run with hyperparameters, eval metrics, and dataset lineage.

Test

Automated prompt regression suites, adversarial and red-team testing, output scoring. Tools: PromptFoo, DeepEval, custom test harnesses with CI gating.

Deploy

Ollama hot-swap, A/B routing in the API gateway, canary rollouts with traffic splitting, and automated rollback triggers on error-rate threshold breach.

Monitor

Prometheus metrics (latency, tokens/sec, error rates), Grafana dashboards, Evidently AI for drift detection. Alerts on performance degradation trigger retest cycles.

Tools & Resources

Key Open-Source Tooling

The primary open-source projects that make a production-grade on-premise LLM platform possible.

Ollama

Run LLMs locally with GGUF model management, a REST API compatible with OpenAI clients, and GPU acceleration on NVIDIA and Apple Silicon.

ollama.ai →

Keycloak

Open source identity and access management. SSO, OIDC, LDAP/AD bridge, RBAC, all fully self-hosted. No external IdP dependency.

keycloak.org →

Qdrant

High-performance vector search engine with filtering, payload indexing, and on-disk storage support. Fully self-hostable via Docker or Kubernetes.

qdrant.tech →

MCP Protocol

Open standard for connecting AI models to external tools and data sources. Enables agents to call enterprise systems without data duplication.

modelcontextprotocol.io →

MLflow

End-to-end ML lifecycle management — experiment tracking, model registry, prompt versioning, and deployment. Integrates with any training framework.

mlflow.org →

Grafana

Observability dashboards for metrics, logs, and traces. Pair with Prometheus for LLM latency, token throughput, and error rate monitoring dashboards.

grafana.com →

Guardrails AI

Output validation and safety rails for LLM responses. Schema enforcement, PII redaction, toxicity detection, and custom validators via a declarative RAIL spec.

guardrailsai.com →

Neo4j

Graph database for entity relationships and multi-hop reasoning. Used in hybrid RAG pipelines to capture structured knowledge that vector search alone cannot represent.

neo4j.com →