(WIP) PyTorch implementation of popular datasets and models in remote sensing tasks (Change Detection, Image Super Resolution, Land Cover Classification/Segmentation, Image Captioning, Audio-visual recognition etc.) for various Optical (Sentinel-2, Landsat, etc.) and Synthetic Aperture Radar (SAR) (Sentinel-1) sensors.
# pypi
pip install torch-rs
# pypi with training extras
pip install 'torch-rs[train]'
# latest
pip install git+https://github.com/isaaccorley/torchrs
# latest with extras
pip install 'git+https://github.com/isaaccorley/torchrs.git#egg=torch-rs[train]'
- PROBA-V Multi-Image Super Resolution
- ETCI 2021 Flood Detection
- FAIR1M - Fine-grained Object Recognition
- ADVANCE - Audiovisual Aerial Scene Recognition
- OSCD - Onera Satellite Change Detection
- S2Looking - Satellite Side-Looking Change Detection
- LEVIR-CD+ - LEVIR Change Detection+
- S2MTCP - Sentinel-2 Multitemporal Cities Pairs Change Detection
- RSVQA LR - Remote Sensing Visual Question Answering Low Resolution
- RSVQAxBEN - Remote Sensing Visual Question Answering BigEarthNet
- RSICD - Remote Sensing Image Captioning Dataset
- Sydney Captions
- UC Merced (UCM) Captions
- RESISC45 - Remote Sensing Image Scene Classification
- EuroSAT
- SAT-4-&-SAT-6
The PROBA-V Super Resolution Challenge dataset is a Multi-image Super Resolution (MISR) dataset of images taken by the ESA PROBA-Vegetation satellite. The dataset contains sets of unregistered 300m low resolution (LR) images which can be used to generate single 100m high resolution (HR) images for both Near Infrared (NIR) and Red bands. In addition, Quality Masks (QM) for each LR image and Status Masks (SM) for each HR image are available. The PROBA-V contains sensors which take imagery at 100m and 300m spatial resolutions with 5 and 1 day revisit rates, respectively. Generating high resolution imagery estimates would effectively increase the frequency at which HR imagery is available for vegetation monitoring.
The dataset can be downloaded (0.83GB) using scripts/download_probav.sh
and instantiated below:
from torchrs.transforms import Compose, ToTensor
from torchrs.datasets import PROBAV
transform = Compose([ToTensor()])
dataset = PROBAV(
split="train", # or 'test'
band="RED", # or 'NIR'
x = dataset[0]
x: dict(
lr: low res images (t, 1, 128, 128)
qm: quality masks (t, 1, 128, 128)
hr: high res image (1, 384, 384)
sm: status mask (1, 384, 384)
t varies by set of images (minimum of 9)
The ETCI 2021 Dataset is a flood detection segmentation dataset of SAR images taken by the ESA Sentinel-1 satellite. The dataset contains pairs of VV and VH polarization images processed by the Hybrid Pluggable Processing Pipeline (hyp3) along with corresponding binary flood and water body ground truth masks.
The dataset can be downloaded (5.6GB) using scripts/download_etci2021.sh
and instantiated below:
from torchrs.transforms import Compose, ToTensor
from torchrs.datasets import ETCI2021
transform = Compose([ToTensor()])
dataset = ETCI2021(
split="train", # or 'val', 'test'
x = dataset[0]
x: dict(
vv: (3, 256, 256)
vh: (3, 256, 256)
flood_mask: (1, 256, 256)
water_mask: (1, 256, 256)
The FAIR1M dataset, proposed in "FAIR1M: A Benchmark Dataset for Fine-grained Object Recognition in High-Resolution Remote Sensing Imagery", Sun et al. is a fine-grained object recognition/detection dataset of 15,000 high resolution (0.3-0.8m) RGB images taken by the Gaogen (GF) satellites and extracted from Google Earth. The dataset contains rotated bounding boxes for objects of 5 categories (ships, vehicles, airplanes, courts, and roads) and 37 sub-categories. This dataset is a part of the ISPRS Benchmark on Object Detection in High-Resolution Satellite Images. Note that so far only a portion of the training dataset has been released for the challenge (1,732/15,000 images).
The dataset can be downloaded (8.7GB) using scripts/download_fair1m.sh
and instantiated below:
import torchvision.transforms as T
from torchrs.datasets import FAIR1M
transform = T.Compose([T.ToTensor()])
dataset = FAIR1M(
split="train", # only 'train' for now
x = dataset[0]
x: dict(
x: (3, h, w)
y: (N,)
points: (N, 5, 2)
where N is the number of objects in the image
The AuDio Visual Aerial sceNe reCognition datasEt (ADVANCE) dataset, proposed in "Cross-Task Transfer for Geotagged Audiovisual Aerial Scene Recognition", Hu et al. is a dataset composed of 5,075 pairs of geotagged audio recordings and 512x512 RGB images extracted from FreeSound and Google Earth, respectively. The images are then labeled into 13 scene categories using OpenStreetMap.
The dataset can be downloaded (4.5GB) using scripts/download_advance.sh
and instantiated below:
import torchvision.transforms as T
from torchrs.datasets import ADVANCE
image_transform = T.Compose([T.ToTensor()])
audio_transform = T.Compose([])
dataset = ADVANCE(
x = dataset[0]
x: dict(
image: (3, 512, 512)
audio: (1, 220500)
cls: int
['airport', 'beach', 'bridge', 'farmland', 'forest', 'grassland', 'harbour', 'lake',
'orchard', 'residential', 'sparse shrub land', 'sports land', 'train station']
The Onera Satellite Change Detection (OSCD) dataset, proposed in "Urban Change Detection for Multispectral Earth Observation Using Convolutional Neural Networks", Daudt et al. is a change detection dataset of multispectral (MS) images taken by the ESA Sentinel-2 satellite. The dataset contains 24 registered image pairs from multiple continents between 2015-2018 along with binary change masks.
The dataset can be downloaded (0.73GB) using scripts/download_oscd.sh
and instantiated below:
from torchrs.transforms import Compose, ToTensor
from torchrs.datasets import OSCD
transform = Compose([ToTensor(permute_dims=False)])
dataset = OSCD(
split="train", # or 'test'
x = dataset[0]
x: dict(
x: (2, 13, h, w)
mask: (1, h, w)
The S2Looking dataset, proposed in "S2Looking: A Satellite Side-Looking Dataset for Building Change Detection", Shen et al. is a rural building change detection dataset of 5,000 1024x1024 0.5-0.8m registered RGB image pairs of varying off-nadir angles taken by the Gaogen (GF), SuperView (SV), and BeiJing-2 (BJ-2) satellites. The dataset contains separate new and demolished building masks from regions all over the Earth with a time span of 1-3 years. This dataset was proposed along with the LEVIR-CD+ dataset and is considered difficult due to the rural locations and off-nadir angles.
The dataset can be downloaded (11GB) using scripts/download_s2looking.sh
and instantiated below:
from torchrs.transforms import Compose, ToTensor
from torchrs.datasets import S2Looking
transform = Compose([ToTensor()])
dataset = S2Looking(
split="train", # or 'val', 'test'
x = dataset[0]
x: dict(
x: (2, 3, 1024, 1024)
build_mask: (1, 1024, 1024),
demolish_mask: (1, 1024, 1024)
The LEVIR-CD+ dataset, proposed in "S2Looking: A Satellite Side-Looking Dataset for Building Change Detection", Shen et al. is an urban building change detection dataset of 985 1024x1024 0.5m RGB image pairs extracted from Google Earth. The dataset contains building/land use change masks from 20 different regions of Texas between 2002-2020 with a time span of 5 years. This dataset was proposed along with the S2Looking dataset and is considered the easier version due to the urban locations and near-nadir angles.
The dataset can be downloaded (3.6GB) using scripts/download_levircd_plus.sh
and instantiated below:
from torchrs.transforms import Compose, ToTensor
from torchrs.datasets import LEVIRCDPlus
transform = Compose([ToTensor()])
dataset = LEVIRCDPlus(
split="train", # or 'test'
x = dataset[0]
x: dict(
x: (2, 3, 1024, 1024)
mask: (1, 1024, 1024)
The Sentinel-2 Multitemporal Cities Pairs (S2MTCP) dataset, proposed in "Self-supervised pre-training enhances change detection in Sentinel-2 imagery", Leenstra et al. is an urban change detection dataset of 1,520 medium resolution 10m unregistered image pairs taken by the ESA Sentinel-2 satellite. The dataset does not contain change masks and was originally used for self-supervised pretraining for other downstream change detection tasks (e.g. the OSCD dataset). The imagery are roughly 600x600 in shape and contain all Sentinel-2 bands of the Level 1C (L1C) product resampled to 10m.
The dataset can be downloaded (10GB/139GB compressed/uncompressed) using scripts/download_s2mtcp.sh
and instantiated below:
from torchrs.transforms import Compose, ToTensor
from torchrs.datasets import S2MTCP
transform = Compose([ToTensor()])
dataset = S2MTCP(
x = dataset[0] # (2, 14, h, w)
The RSVQA LR dataset, proposed in "RSVQA: Visual Question Answering for Remote Sensing Data", Lobry et al. is a visual question answering (VQA) dataset of 772 256x256 RGB images taken by the ESA Sentinel-2 satellite. Each image is annotated with a set of questions and their corresponding answers. Among other applications, this dataset can be used to train VQA models to perform detailed scene understanding of medium resolution remote sensing imagery.
The dataset can be downloaded (0.2GB) using scripts/download_rsvqa_lr.sh
and instantiated below:
import torchvision.transforms as T
from torchrs.datasets import RSVQALR
transform = T.Compose([T.ToTensor()])
dataset = RSVQALR(
split="train", # or 'val', 'test'
x = dataset[0]
x: dict(
x: (3, 256, 256)
questions: List[str]
answers: List[str]
types: List[str]
The RSVQAxBEN dataset, proposed in "RSVQA Meets BigEarthNet: A New, Large-Scale, Visual Question Answering Dataset for Remote Sensing", Lobry et al. is a version of the BigEarthNet dataset with visual question answering (VQA) annotations using the same method applied to generate annotations forthe RSVQA LR dataset. The dataset consists of 120x120 RGB Sentinel-2 imagery annotated with a set of questions and their corresponding answers.
The dataset can be downloaded (35.4GB) using scripts/download_rsvqaxben.sh
and instantiated below:
import torchvision.transforms as T
from torchrs.datasets import RSVQAxBEN
transform = T.Compose([T.ToTensor()])
dataset = RSVQAxBEN(
split="train", # or 'val', 'test'
x = dataset[0]
x: dict(
x: (3, 120, 120)
questions: List[str]
answers: List[str]
types: List[str]
The RSICD dataset, proposed in "Exploring Models and Data for Remote Sensing Image Caption Generation", Lu et al. is an image captioning dataset with 5 captions per image for 10,921 224x224 RGB images extracted using Google Earth, Baidu Map, MapABC and Tianditu. While one of the larger remote sensing image captioning datasets, this dataset contains very repetitive language with little detail and many captions are duplicated.
The dataset can be downloaded (0.57GB) using scripts/download_rsicd.sh
and instantiated below:
import torchvision.transforms as T
from torchrs.datasets import RSICD
transform = T.Compose([T.ToTensor()])
dataset = RSICD(
split="train", # or 'val', 'test'
x = dataset[0]
x: dict(
x: (3, 224, 224)
captions: List[str]
The Sydney Captions dataset, proposed in "Deep semantic understanding of high resolution remote sensing image", Qu et al. is a version of the Sydney scene classification dataset proposed in "Saliency-Guided Unsupervised Feature Learning for Scene Classification", Zhang et al. The dataset contains 613 500x500 1ft resolution RGB images of Sydney, Australia extracted using Google Earth and is annotated with 5 captions per image.
The dataset can be downloaded (0.44GB) using scripts/download_sydney_captions.sh
and instantiated below:
import torchvision.transforms as T
from torchrs.datasets import SydneyCaptions
transform = T.Compose([T.ToTensor()])
dataset = SydneyCaptions(
split="train", # or 'val', 'test'
x = dataset[0]
x: dict(
x: (3, 500, 500)
captions: List[str]
The UC Merced (UCM) Captions dataset, proposed in "Deep semantic understanding of high resolution remote sensing image", Qu et al. is a version of the UCM land use classification dataset proposed in "Bag-Of-Visual-Words and Spatial Extensions for Land-Use Classification", Yang et al. The dataset contains 2100 256x256 1ft resolution RGB images of urban locations around the U.S. extracted from the USGS National Map Urban Area Imagery collection and is annotated with 5 captions per image.
The dataset can be downloaded (0.4GB) using scripts/download_ucm_captions.sh
and instantiated below:
import torchvision.transforms as T
from torchrs.datasets import UCMCaptions
transform = T.Compose([T.ToTensor()])
dataset = UCMCaptions(
split="train", # or 'val', 'test'
x = dataset[0]
x: dict(
x: (3, 256, 256)
captions: List[str]
The RESISC45 dataset, proposed in "Remote Sensing Image Scene Classification: Benchmark and State of the Art", Cheng et al. is an scene classification dataset of 31,500 RGB images extracted using Google Earth Engine. The dataset contains 45 scenes with 700 images per class from over 100 countries and was selected to optimize for high variability in image conditions (spatial resolution, occlusion, weather, illumination, etc.).
The dataset can be downloaded (0.47GB) using scripts/download_resisc45.sh
and instantiated below:
import torchvision.transforms as T
from torchrs.datasets import RESISC45
transform = T.Compose([T.ToTensor()])
dataset = RESISC45(
x, y = dataset[0]
x: (3, 256, 256)
y: int
['airplane', 'airport', 'baseball_diamond', 'basketball_court', 'beach', 'bridge', 'chaparral',
'church', 'circular_farmland', 'cloud', 'commercial_area', 'dense_residential', 'desert', 'forest',
'freeway', 'golf_course', 'ground_track_field', 'harbor', 'industrial_area', 'intersection', 'island',
'lake', 'meadow', 'medium_residential', 'mobile_home_park', 'mountain', 'overpass', 'palace', 'parking_lot',
'railway', 'railway_station', 'rectangular_farmland', 'river', 'roundabout', 'runway', 'sea_ice', 'ship',
'snowberg', 'sparse_residential', 'stadium', 'storage_tank', 'tennis_court', 'terrace', 'thermal_power_station', 'wetland']
The EuroSAT dataset, proposed in "EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification", Helber et al. is a land cover classification dataset of 27,000 64x64 images taken by the ESA Sentinel-2 satellite. The dataset contains 10 land cover classes with 2-3k images per class from over 34 European countries. The dataset is available in the form of RGB only or all 13 Multispectral (MS) Sentinel-2 bands. This dataset is fairly easy with ~98.6% accuracy achievable with a ResNet-50.
The dataset can be downloaded (.13GB and 2.8GB) using scripts/download_eurosat_rgb.sh
or scripts/download_eurosat_ms.sh
and instantiated below:
import torchvision.transforms as T
from torchrs.transforms import ToTensor
from torchrs.datasets import EuroSATRGB, EuroSATMS
transform = T.Compose([T.ToTensor()])
dataset = EuroSATRGB(
x, y = dataset[0]
x: (3, 64, 64)
y: int
transform = T.Compose([ToTensor()])
dataset = EuroSATMS(
x, y = dataset[0]
x: (13, 64, 64)
y: int
['AnnualCrop', 'Forest', 'HerbaceousVegetation', 'Highway', 'Industrial',
'Pasture', 'PermanentCrop', 'Residential', 'River', 'SeaLake']
The SAT-4 & SAT-6 datasets, proposed in "DeepSat - A Learning framework for Satellite Imagery", Basu et al. are land cover classification datasets of 500k and 405k 28x28 RGBN images, respectively, sampled across the Continental United States (CONUS) and extracted from the National Agriculture Imagery Program (NAIP). The SAT-4 and SAT-6 datasets contain 4 and 6 land cover classes, respectively. This dataset is fairly easy with ~80% accuracy achievable with a 5-layer CNN.
The dataset can be downloaded (2.7GB) using scripts/download_sat.sh
and instantiated below:
import torchvision.transforms as T
from torchrs.datasets import SAT4, SAT6
transform = T.Compose([T.ToTensor()])
dataset = SAT4(
split="train" # or 'test'
x, y = dataset[0]
x: (4, 28, 28)
y: int
['barren land', 'trees', 'grassland', 'other']
dataset = SAT6(
split="train" # or 'test'
x, y = dataset[0]
x: (4, 28, 28)
y: int
['barren land', 'trees', 'grassland', 'roads', 'buildings', 'water']
- Multi-Image Super Resolution - RAMS
- Change Detection - FC-EF, FC-Siam-conc, and FC-Siam-diff
- Change Detection - EarlyFusion (EF) and Siamese (Siam)
Residual Attention Multi-image Super-resolution Network (RAMS) from "Multi-Image Super Resolution of Remotely Sensed Images Using Residual Attention Deep Neural Networks", Salvetti et al. (2021)
RAMS is currently one of the top performers on the PROBA-V Super Resolution Challenge. This Multi-image Super Resolution (MISR) architecture utilizes attention based methods to extract spatial and spatiotemporal features from a set of low resolution images to form a single high resolution image. Note that the attention methods are effectively Squeeze-and-Excitation blocks from "Squeeze-and-Excitation Networks", Hu et al..
import torch
from torchrs.models import RAMS
# increase resolution by factor of 3 (e.g. 128x128 -> 384x384)
model = RAMS(
# Input should be of shape (bs, t, c, h, w), where t is the number
# of low resolution input images and c is the number of channels/bands
lr = torch.randn(1, 9, 1, 128, 128)
sr = model(lr) # (1, 1, 384, 384)
Change Detection - Fully Convolutional Early Fusion (FC-EF), Siamese Concatenation (FC-Siam-conc), and Siamese Difference (FC-Siam-diff)
Fully Convolutional Early Fusion (FC-EF), Siamese Concatenation (FC-Siam-conc), Siamese Difference (FC-Siam-conc) and are change detection segmentation architectures proposed in "Fully Convolutional Siamese Networks for Change Detection", Daudt et al.. The architectures are essentially modified U-Nets from "U-Net: Convolutional Networks for Biomedical Image Segmentation", Ronneberger et al.. FC-EF is a U-Net which takes as input the concatenated images. FC-Siam-conc and FC-Siam-diff are U-Nets with a shared encoder which concatenate or take the difference of the skip connections, respectively. Both models been modified to work with any number of input images t
and channels c
import torch
from torchrs.models import FCEF, FCSiamConc, FCSiamDiff
model = FCEF(
model = FCSiamConc(
model = FCSiamDiff(
x = torch.randn(1, 2, 3, 128, 128) # (b, t, c, h, w)
model(x) # (b, num_classes, h, w)
Early Fusion (EF) and Siamese (Siam) are change detection architectures proposed along with the OSCD - Onera Satellite Change Detection dataset in "Urban Change Detection for Multispectral Earth Observation Using Convolutional Neural Networks", Daudt et al.. The architectures are effectively CNN classifiers which are trained to classify whether the central pixel of a set (typically a pair) of input patches contains change/no change. EF takes as input the concatenated images while Siam extracts feature vectors using a shared CNN and then feeds the concatenated vectors to a MLP classifier. Both models expect patches of size Cx15x15 but have been modified to work with any number of input images t
and channels c
import torch
from torchrs.models import EarlyFusion, Siam
model = EarlyFusion(
model = Siam(
x = torch.randn(1, 2, 3, 15, 15) # (b, t, c, h, w)
model(x) # (b, num_classes, h, w)
For training purposes, each model and dataset has been adapted into Pytorch Lightning LightningModules and LightningDataModules, respectively. The modules can be found in torchrs.train.modules
and torchrs.train.datamodules
. Among other things, Pytorch Lightning has the benefits of reducing boilerplate code, requiring minimal rewrite for multi-gpu/cluster training, supports mixed precision training, gradient accumulation, callbacks, logging metrics, etc.
To use the training features, torch-rs must be installed with the train
# pypi
pip install 'torch-rs[train]'
# latest
pip install 'git+https://github.com/isaaccorley/torchrs.git#egg=torch-rs[train]'
A simple training example:
import torch
import torch.nn as nn
import pytorch_lightning as pl
from torchrs.train.modules import FCEFModule
from torchrs.train.datamodules import LEVIRCDPlusDataModule
from torchrs.transforms import Compose, ToTensor
def collate_fn(batch):
x = torch.stack([x["x"] for x in batch])
y = torch.cat([x["mask"] for x in batch])
x = x.to(torch.float32)
y = y.to(torch.long)
return x, y
transform = Compose([ToTensor()])
model = FCEFModule(channels=3, t=2, num_classes=2, lr=1E-3)
dm = LEVIRCDPlusDataModule(
callbacks = [
pl.callbacks.ModelCheckpoint(monitor="val_loss", mode="min", verbose=True, save_top_k=1),
pl.callbacks.EarlyStopping(monitor="val_loss", mode="min", patience=10)
trainer = pl.Trainer(
trainer.fit(model, datamodule=dm)
$ pytest -ra