Scalable Multi-Modal Vision - ONNX optimized SigLIP and related foundation models
Rhys Williams
Hey there! It’s been a while, I got lost down a startup rabbit hole for a while but I promise you a bunch of videos are on their way. You’re not going to believe where Mr-B is at now. He’s tall and programmable with Blender. More on that soon!
I have some time on my hands so I got down a rabbit hole and created a project to keep track of, and unify a bunch of optimized foundation models under one easy to use Python package. So far I have the following up and running:
- SigLIP as a FP16 ONNX representation for super fast zero-shot classification image classification - quantized model support is on it’s way
- Automatic pre and post processing switching - choosing CLIP as a model type falls back to cosine similarity with softmax, where SigLIP falls back to its full graph with a Scipy based sigmoid output activation.
- Manual mode with exposed image and text encoders for each model - SigLIP also has it’s pre-pooled hidden output available for analysis etc
- An ONNX Segment Anything representation - the plan is to have each CLIP/SigLIP model’s salient map feed directly as a multi-point prompt to SAM and its variants. More on this soon!
The plan is to have the same usage wrap a TensorRT backend along the road, and tie that into Chromadb for super fast search but for now check out the example Gradio app. It’s still a little clunky but the results are pretty impressive! I’d imagine this would be great for lightweight RAG too!
Check it out here: https://github.com/rhysdg/vision-at-a-clip - pun definitely intended
Sources:
CLIP - https://openai.com/index/clip/ SigLIP - https://arxiv.org/abs/2303.15343 Segment Anything - https://github.com/facebookresearch/segment-anything Chroma DB - https://www.trychroma.com/ TensorRT - https://developer.nvidia.com/tensorrt-getting-started#:~:text=NVIDIA%C2%AE%20TensorRT%E2%84%A2%20is,high%20throughput%20for%20production%20applications. onnx - https://onnx.ai/ ... https://www.youtube.com/watch?v=pmxh_Aas2OI
130098833 Bytes