This is a WIP - I'll likely change the architecture significantly.
This repository contains code and training utilities for building a track-level classifier of invasive species (e.g., possums, rodents, cats, hedgehogs) using thermal camera trap videos from The Cacophony Project.
The goal is to support conservation by automatically detecting and classifying animals in thermal footage, enabling better monitoring and control of invasive species.
We use the public NZ Thermal Wildlife dataset:
- 121,190 thermal videos, recorded mostly at night across New Zealand.
- Each video is annotated with one or more tracks (continuous movement).
- Tracks include:
- bounding boxes (
points
) - frame indices
- species labels (45 categories, including “false positive”).
- bounding boxes (
- We use only the non-filtered mp4s (not the background-subtracted ones).
- Data layout (after download):
nz_thermal_data/ videos/ # <id>.mp4 individual-metadata/ # <id>_metadata.json with tracks + points new-zealand-wildlife-thermal-imaging.json
See the official dataset page for download instructions.
We build a track-level classifier:
- Backbone: DINOv3 ViT-B/16, loaded from a local checkpoint (
dinov3_vitb16_pretrain_lvd1689m-73cec8be.pth
). - Frozen backbone (optionally unfreeze last N blocks).
- Temporal pooling:
- Mean+Max pooling (
meanmax
, default), or - GRU head (
gru
).
- Mean+Max pooling (
- Outputs per-track species classification logits.
Normalization:
- Grayscale frames are replicated to 3 channels.
- Resized to 224×224.
- Normalized with ImageNet mean/std (
0.485, 0.456, 0.406
/0.229, 0.224, 0.225
) to match DINOv3 pretraining.
We use Conda for system deps (Python, PyTorch, CUDA) and Poetry for Python deps.
conda env create -f env.yml
conda activate invasive-classifier
pip install poetry
poetry install
Key Python deps:
timm
(ViT + DINOv3 models)torchvision
opencv-python-headless
matplotlib
,pandas
,tqdm
,tensorboard
Run training:
python invasive_classifier/train/train.py
This will:
- Build the dataset + dataloaders.
- Load DINOv3 backbone with local weights.
- Train a temporal classifier head.
- Log training progress with tqdm and TensorBoard.
- Save checkpoints under
artifacts/
.
CFG = dict(
root="PATH/TO/nz_thermal_data",
local_ckpt="dinov3_vitb16_pretrain_lvd1689m-73cec8be.pth",
out="artifacts",
logdir="runs/invasive",
epochs=10,
batch_size=8,
num_workers=8,
num_samples=24,
size=224,
lr_head=1e-4,
lr_backbone=5e-5,
weight_decay=0.05,
grad_clip=1.0,
unfreeze_last_n_blocks=0,
use_weighted_sampler=True,
)
- Debug crops are saved to
artifacts/debug_plots/
and logged to TensorBoard. - Class F1 scores are written per eval step into JSON under
artifacts/
.
This repository is for research purposes. Dataset: Community Data License Agreement (permissive variant).
See The Cacophony Project for details.