CLIP ViT-B/32 (ONNX)
This repository contains the ONNX-exported version of OpenAI’s CLIP model (ViT-B/32), optimized for inference using ONNX Runtime. It supports fast image-text similarity and zero-shot classification without requiring PyTorch or TensorFlow.
Model Details
- Base Model: openai/clip-vit-base-patch32
- Export Format: ONNX
- Architecture: Vision Transformer (ViT-B/32)
- File Size: ~600 MB (FP32 version)
- Use Case: Zero-shot classification, image-text similarity, and retrieval.
Quantized Models
In addition to the standard model.onnx
(FP32), this repo provides multiple quantized variants to reduce memory usage and improve inference speed:
File | Precision | Approx. Size |
---|---|---|
model_fp16.onnx |
FP16 | ~303 MB |
model_quantized.onnx |
INT8/Hybrid | ~153 MB |
model_q4.onnx |
4-bit | ~189 MB |
model_q4f16.onnx |
4-bit + FP16 | ~125 MB |
model_bnb4.onnx |
Bits-and-Bytes 4-bit | ~181 MB |
model_uint8.onnx |
8-bit | ~152 MB |
Note: Quantized models may have slightly lower accuracy but offer better performance and smaller size. Use them with the same ONNX Runtime API.
How to Use
1. Install Dependencies
pip install onnxruntime transformers huggingface_hub pillow numpy
2. Load the Model and Processor
from huggingface_hub import hf_hub_download
import onnxruntime as ort
from transformers import CLIPProcessor
# Download ONNX model from Hugging Face Hub
repo_id = "sayantan47/clip-vit-b32-onnx"
onnx_model_path = hf_hub_download(repo_id=repo_id, filename="onnx/model.onnx")
# Load ONNX Runtime session
session = ort.InferenceSession(onnx_model_path, providers=["CPUExecutionProvider"])
# Load processor
processor = CLIPProcessor.from_pretrained(repo_id)
# Example input
image = Image.open("example.jpg")
texts = ["a dog", "a cat"]
# Preprocess
inputs = processor(text=texts, images=image, return_tensors="np", padding=True)
inputs = {k: (v.astype(np.int64) if v.dtype == np.int32 else v) for k, v in inputs.items()}
# Run inference
outputs = session.run(None, inputs)
logits_per_image = outputs[0]
probs = np.exp(logits_per_image) / np.exp(logits_per_image).sum(-1, keepdims=True)
print("Probabilities:", probs)
Applications
- Zero-Shot Classification: Classify images by comparing them to textual descriptions.
- Image Similarity: Compare embeddings between two images or between images and text.
- Search Engines: Use as the backbone for image-text retrieval systems.
ONNX Runtime Performance
CPU-only: Works out of the box with
onnxruntime
on CPUs.GPU: To use CUDA, install
onnxruntime-gpu
and ensure you have CUDA 12 and cuDNN 9 installed.pip install onnxruntime-gpu
Export Command Used
The model was exported using Hugging Face Optimum with:
python -m optimum.exporters.onnx --model=openai/clip-vit-base-patch32 onnx_model/
- Downloads last month
- 11
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
Model tree for sayantan47/clip-vit-b32-onnx
Base model
openai/clip-vit-base-patch32