[ICCVW 25] LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning
-
Updated
Aug 8, 2025 - Python
[ICCVW 25] LLaVA-MORE: A Comparative Study of LLMs and Visual Backbones for Enhanced Visual Instruction Tuning
Command-line tool for extracting DINO, CLIP, and SigLIP2 features for images and videos
deepfake-detector-model-v1 is a vision-language encoder model fine-tuned from siglip2-base-patch16-512 for binary deepfake image classification. It is trained to detect whether an image is real or generated using synthetic media techniques. The model uses the SiglipForImageClassification architecture.
Facial-Emotion-Detection-SigLIP2 is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224
Watermark-Detection-SigLIP2 is a vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for binary image classification. It is trained to detect whether an image contains a watermark or not, using the SiglipForImageClassification architecture.
SigLIP2 is a vision-language encoder model fine-tuned from google/siglip2-base-patch16-224
Human-Action-Recognition is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for multi-class human action recognition. It uses the SiglipForImageClassification architecture to predict human activities from still images.
Age-Classification-SigLIP2 is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to predict the age group of a person from an image using the SiglipForImageClassification architecture.
Fire-Detection-Siglip2 is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to detect fire, smoke, or normal conditions using the SiglipForImageClassification architecture.
Augmented-Waste-Classifier-SigLIP2 is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224
Coral-Health is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to classify coral reef images into two health conditions using the SiglipForImageClassification architecture.
siglip2-mini-explicit-content is an image classification vision-language encoder model fine-tuned from siglip2-base-patch16-512 for a single-label classification task. It is designed to classify images into categories related to explicit, sensual, or safe-for-work content using the SiglipForImageClassification architecture.
Anime-Classification-v1.0 is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to classify anime-related images using the SiglipForImageClassification architecture.
This repository offers tools and guidance for fine-tuning the Siglip2 Vision Transformer (ViT) model. It includes scripts and best practices to adapt the model for custom datasets and tasks. Designed for researchers and developers, it ensures efficient fine-tuning and optimal performance for vision-based applicatio
Painting-126-DomainNet is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to classify paintings into 126 domain categories using the SiglipForImageClassification architecture
x-bot-profile-detection is a SigLIP2-based classification model designed to detect profile authenticity types on social media platforms (such as X/Twitter). It categorizes a profile image into four classes: bot, cyborg, real, or verified. Built on google/siglip2-base-patch16-224.
nsfw-image-detection is a vision-language encoder model fine-tuned from siglip2-base-patch16-256 for multi-class image classification. Built on the SiglipForImageClassification architecture, the model is trained to identify and categorize content types in images, especially for explicit, suggestive, or safe media filtering.
classify handwritten digits (0-9)
nsfw-image-detection is a vision-language encoder model fine-tuned from siglip2-base-patch16-256 for multi-class image classification. Built on the SiglipForImageClassification architecture, the model is trained to identify and categorize content types in images, especially for explicit, suggestive, or safe media filtering.
Fashion-Mnist-SigLIP2 is an image classification vision-language encoder model fine-tuned from google/siglip2-base-patch16-224 for a single-label classification task. It is designed to classify images into Fashion-MNIST categories using the SiglipForImageClassification architecture.
Add a description, image, and links to the siglip2 topic page so that developers can more easily learn about it.
To associate your repository with the siglip2 topic, visit your repo's landing page and select "manage topics."