Quantitative Decoding & Modeling of Pet Sentiments & Instincts using a Hybrid CNN-ViT Approach on Visual Data
We developed an end-to-end pet sentiment analysis pipeline combining Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) to quantitatively decode and model pet emotions from facial images. Inspired by the needs of veterinary care for domestic animals and the condition of street dogs in South-Asian cities, our project aims to remove the necessity of a pet trainer in households as well. In this repository, we cover the project details such as:
- Dataset preparation (download, augmentation, splitting)
- Hybrid architecture combining local (ResNet-50) and global (ViT) feature extraction
- Training with mixed-precision, scheduling, and performance tracking
- Evaluation and visualization of results
- Model: In the hybrid CNN-ViT model, we did not use a standard pre-defined ViT model (like vit_b_16). Instead, we built a custom transformer encoder on top of the CNN backbone.
- UI: Streamlit web app for real-time predictions, run it on your IDE terminal utilizing local server as "streamlit run app.py" (via
streamlit_maths/app.py)
We use the Oxford-IIIT Pet Dataset (37 breeds, ~7.3 K images) from the Visual Geometry Group at Oxford.
-
Download: Oxford-IIIT Pet Dataset
-
Original images: ~7,368
-
Augmented images: ~29,000 (via
utilities/augmentation.py) -
Split: 70% train (~20 K), 15% validation (~4 K), 15% test (~4 K) (via
utilities/split.py) -
The original veterinary classified datasets we used for augmentation can be found in this link: Original Datasets
-
The augmented images datasets that was processed to be ready for splitting can be found in this link: Augmented Datasets
-
The final splitted images datasets that was used for ultimate model training can be found in this link: Split Datasets
The trained model weights are too large for this repository. You can download the pet_sentiment_model.pth file from the following link:
Download Model Weights from Google Drive
If you can't access the CNN_ViT jupyter notebook, here's the collab link:
The Google Collab Notebook: CNN-ViT Notebook
The model backbone is ResNet-50 (without its final two layers) that outputs a feature map of size 7×7×2048. A 1×1 convolution projects this to 7×7×768, which is flattened and prepended with a learnable CLS token. The resulting sequence of length 50 (1 CLS + 49 patches) is processed by a 4-layer, 8-head Transformer encoder (d_model = 768). Finally, the CLS output is fed into a linear classifier over four emotion classes.
Refer to the diagram below for a 3D-style view of the hybrid CNN–ViT flow:
git clone https://github.com/aamodpaudel/CNN-ViT.git
cd CNN-ViT
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txtpython utilities/augmentation.py \
--input_dir classified_images \
--output_dir augmented_data \
--target_count 7000 # per classpython utilities/split.py \
--input_dir augmented_data \
--output_dir augmented_data_splitpython train.py \
--data_dir augmented_data_split \
--epochs 50 \
--batch_size 32 \
--lr 1e-4- Model saved as
pet_sentiment_model.pth - Metrics:
training_metrics.json - Plots:
epoch_vs_accuracy.png,epoch_vs_loss.png,activation_function_gelu.png
| Split | Accuracy |
|---|---|
| Training | 0.999 |
| Validation | 0.851 |
| Test | 0.842 |
- Anatomy-aware few-shot learning for diverse species beyond pets
- Audio–visual multimodal sentiment tracking
- Advanced augmentation strategies and stronger regularization
Contributions welcome! Open issues or submit PRs for features and improvements.
MIT License. See LICENSE for details.

