Live Demo

Vision AI — Fruit Classifier

Upload a fruit photo and compare two fundamentally different AI approaches: a zero-shot CLIP classifier running in real time via ONNX, and a multimodal large language model generating a detailed free-text description.

CLIP ViT-B/32 Zero-shot ONNX Runtime FastAPI Ollama Vision

Input

Click to upload or drag & drop

JPG, PNG, WEBP — max 10 MB

Preview

Results

Upload a fruit photo
and click Classify

Ask Ollama Vision

Local LLM
qwen3.5:9b
Upload an image, then click "Ask Ollama" to get a free-text description from the vision model.

CLIP ViT-B/32 — Zero-shot

OpenAI's vision-language model matches image embeddings against natural-language fruit prompts at runtime. No fruit-specific training — labels are plain text like "a photo of a mango".

Quantized ONNX Runtime

Both CLIP encoders are served as INT8 quantized ONNX models (~150 MB total). Text embeddings are pre-computed once at startup — only the visual encoder (~30 ms) runs per request.

Non-fruit Rejection

A dedicated "not a fruit" sentinel label — averaged from multiple reject prompts — prevents false positives. If its cosine similarity score wins, the result is rejected before returning any fruit label.

Two approaches — one image

The same photo is processed by two fundamentally different systems. Here is how they differ:

CLIP Classifier
  • ModelCLIP ViT-B/32 visual encoder (quantized ONNX)
  • MethodZero-shot — cosine similarity between image and text label embeddings
  • OutputTop-3 ranked fruit labels with confidence scores
  • Latency~30–80 ms on CPU — near real-time, edge-deployable
  • StrengthsExtremely fast, no GPU required, explicit non-fruit rejection
  • LimitationsClosed label vocabulary, less nuanced on rare varieties
Ollama Vision AI
  • Modelqwen3-vl:8b multimodal large language model
  • MethodVision-language generation — image tokens fused with text context
  • OutputFree-text description: fruit name, colour, ripeness, features
  • Latency3–10 seconds — full LLM inference on remote GPU
  • StrengthsRich contextual understanding, open-ended output, handles ambiguity
  • LimitationsHigh latency, requires a capable GPU, overkill for simple classification