Quantitative Decoding & Modeling of Pet Sentiments & Instincts using a Hybrid CNN-ViT Approach on Visual Data

Domain: CABA (Computational Animal Behavior Analysis) & Animal Psychology

Overview

We developed an end-to-end pet sentiment analysis pipeline combining Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs) to quantitatively decode and model pet emotions from facial images. Inspired by the needs of veterinary care for domestic animals and the condition of street dogs in South-Asian cities, our project aims to remove the necessity of a pet trainer in households as well. In this repository, we cover the project details such as:

Dataset preparation (download, augmentation, splitting)
Hybrid architecture combining local (ResNet-50) and global (ViT) feature extraction
Training with mixed-precision, scheduling, and performance tracking
Evaluation and visualization of results
Model: In the hybrid CNN-ViT model, we did not use a standard pre-defined ViT model (like vit_b_16). Instead, we built a custom transformer encoder on top of the CNN backbone.
UI: Streamlit web app for real-time predictions, run it on your IDE terminal utilizing local server as "streamlit run app.py" (via streamlit_maths/app.py)

Datasets

We use the Oxford-IIIT Pet Dataset (37 breeds, ~7.3 K images) from the Visual Geometry Group at Oxford.

Download: Oxford-IIIT Pet Dataset
Original images: ~7,368
Augmented images: ~29,000 (via utilities/augmentation.py)
Split: 70% train (~20 K), 15% validation (~4 K), 15% test (~4 K) (via utilities/split.py)
The original veterinary classified datasets we used for augmentation can be found in this link: Original Datasets
The augmented images datasets that was processed to be ready for splitting can be found in this link: Augmented Datasets
The final splitted images datasets that was used for ultimate model training can be found in this link: Split Datasets

Model Weights

The trained model weights are too large for this repository. You can download the pet_sentiment_model.pth file from the following link:

Download Model Weights from Google Drive

If you can't access the CNN_ViT jupyter notebook, here's the collab link:

The Google Collab Notebook: CNN-ViT Notebook

Architecture

The model backbone is ResNet-50 (without its final two layers) that outputs a feature map of size 7×7×2048. A 1×1 convolution projects this to 7×7×768, which is flattened and prepended with a learnable CLS token. The resulting sequence of length 50 (1 CLS + 49 patches) is processed by a 4-layer, 8-head Transformer encoder (d_model = 768). Finally, the CLS output is fed into a linear classifier over four emotion classes.

Refer to the diagram below for a 3D-style view of the hybrid CNN–ViT flow:

Installation

git clone https://github.com/aamodpaudel/CNN-ViT.git
cd CNN-ViT
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Usage

1. Data Augmentation

python utilities/augmentation.py \
  --input_dir classified_images \
  --output_dir augmented_data \
  --target_count 7000  # per class

2. Data Splitting

python utilities/split.py \
  --input_dir augmented_data \
  --output_dir augmented_data_split

3. Training & Evaluation

python train.py \
  --data_dir augmented_data_split \
  --epochs 50 \
  --batch_size 32 \
  --lr 1e-4

Model saved as pet_sentiment_model.pth
Metrics: training_metrics.json
Plots: epoch_vs_accuracy.png, epoch_vs_loss.png, activation_function_gelu.png

Generated Epoch vs Accuracy Plot

![Epoch vs Accuracy](epoch_vs_accuracy.png)

Results (Hybrid CNN–ViT: combined)

Split	Accuracy
Training	0.999
Validation	0.851
Test	0.842

Future Work

Anatomy-aware few-shot learning for diverse species beyond pets
Audio–visual multimodal sentiment tracking
Advanced augmentation strategies and stronger regularization

Contributing

Contributions welcome! Open issues or submit PRs for features and improvements.

License

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Visualizations_Generated		Visualizations_Generated
poster		poster
streamlit_maths		streamlit_maths
utilities		utilities
.gitattributes		.gitattributes
CNN_ViT.ipynb		CNN_ViT.ipynb
README.md		README.md
app.py		app.py
instructions.txt		instructions.txt
requirements.txt		requirements.txt
research_papers.txt		research_papers.txt
train.py		train.py
training_metrics.json		training_metrics.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Quantitative Decoding & Modeling of Pet Sentiments & Instincts using a Hybrid CNN-ViT Approach on Visual Data

Domain: CABA (Computational Animal Behavior Analysis) & Animal Psychology

Overview

Datasets

Model Weights

Architecture

Installation

Usage

1. Data Augmentation

2. Data Splitting

3. Training & Evaluation

Generated Epoch vs Accuracy Plot

Results (Hybrid CNN–ViT: combined)

Future Work

Contributing

License

About

Uh oh!

Releases

Packages

Uh oh!

Languages

aamodpaudel/CNN-ViT

Folders and files

Latest commit

History

Repository files navigation

Quantitative Decoding & Modeling of Pet Sentiments & Instincts using a Hybrid CNN-ViT Approach on Visual Data

Domain: CABA (Computational Animal Behavior Analysis) & Animal Psychology

Overview

Datasets

Model Weights

Architecture

Installation

Usage

1. Data Augmentation

2. Data Splitting

3. Training & Evaluation

Generated Epoch vs Accuracy Plot

Results (Hybrid CNN–ViT: combined)

Future Work

Contributing

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages