Live Demo

Vision AI — Fruit Classifier

Upload a fruit photo and compare two fundamentally different AI approaches: a zero-shot CLIP classifier running in real time via ONNX, and a multimodal large language model generating a detailed free-text description.

CLIP ViT-B/32 Zero-shot ONNX Runtime FastAPI Ollama Vision

CLIP ViT-B/32 — Zero-shot

OpenAI's vision-language model matches image embeddings against natural-language fruit prompts at runtime. No fruit-specific training — labels are plain text like "a photo of a mango".

Quantized ONNX Runtime

Both CLIP encoders are served as INT8 quantized ONNX models (~150 MB total). Text embeddings are pre-computed once at startup — only the visual encoder (~30 ms) runs per request.

Non-fruit Rejection

A dedicated "not a fruit" sentinel label — averaged from multiple reject prompts — prevents false positives. If its cosine similarity score wins, the result is rejected before returning any fruit label.

Two approaches — one image

The same photo is processed by two fundamentally different systems. Here is how they differ:

CLIP Classifier

ModelCLIP ViT-B/32 visual encoder (quantized ONNX)
MethodZero-shot — cosine similarity between image and text label embeddings
OutputTop-3 ranked fruit labels with confidence scores
Latency~30–80 ms on CPU — near real-time, edge-deployable
StrengthsExtremely fast, no GPU required, explicit non-fruit rejection
LimitationsClosed label vocabulary, less nuanced on rare varieties

Ollama Vision AI

Modelqwen3-vl:8b multimodal large language model
MethodVision-language generation — image tokens fused with text context
OutputFree-text description: fruit name, colour, ripeness, features
Latency3–10 seconds — full LLM inference on remote GPU
StrengthsRich contextual understanding, open-ended output, handles ambiguity
LimitationsHigh latency, requires a capable GPU, overkill for simple classification