Upload a fruit photo and compare two fundamentally different AI approaches: a zero-shot CLIP classifier running in real time via ONNX, and a multimodal large language model generating a detailed free-text description.
Input
Click to upload or drag & drop
JPG, PNG, WEBP — max 10 MB
Results
Upload a fruit photo
and click Classify
OpenAI's vision-language model matches image embeddings against natural-language fruit prompts at runtime. No fruit-specific training — labels are plain text like "a photo of a mango".
Both CLIP encoders are served as INT8 quantized ONNX models (~150 MB total). Text embeddings are pre-computed once at startup — only the visual encoder (~30 ms) runs per request.
A dedicated "not a fruit" sentinel label — averaged from multiple reject prompts — prevents false positives. If its cosine similarity score wins, the result is rejected before returning any fruit label.
The same photo is processed by two fundamentally different systems. Here is how they differ: