Project based on the original U-Net paper by Olaf Ronneberger, Philipp Fischer and Thomas Brox (2015)
- Ikonos-2 Multispectral images are consisted of a Blue, Green, Red, and Near-Infrared channel. Ikonos-2 images come at a Spatial Resolution of 0.8 meters and a Radiometric Resolution of 11 bits.
- Initial training phase includes samples from 10 sub-areas of an image taken at the greater Thessaloniki Region, Greece, taken in Spring. This phase aims to give initial performance evaluations and generalization capabilities on images of different distributions (e.g. acquired in other seasons), before the dataset distribution can be expanded.
- Sample areas were delineated in QGIS and samples were collected similarly from industrial and urban environments. Further samples were taken from irregular background areas. Extracted rasters were processed further into normalized tiles, separated in positive and negative samples and stored in hdf5 format. About 1/6 of each sub-area was kept for validation.
- Data was normalized to an [0. 1] interval prior to storage, divided by 2**11.
Training mainly followed the recommendations of Ronneberger et al. (2015), without applying additional weights to edge pixels as suggested in the paper. Additional training ideas and methods, such as class balancing, were adopted from Deep Learning with PyTorch by Eli Stevens, Luca Antiga and Thomas Viehmann (2020).
- Adam was used with a high momentum (beta1), as recommended in Ronneberger et al. 2015. Beta2 was kept at its default value.
- A tile size of 256 * 256 was chosen, since it was found to produce cleaner samples and allowed for a better separation of tiles into negative (label 0) and positive (label 1).
-
Augmentation includes affine transformations (Translation, Rotation, Scaling and Shear), noise, brightness and contrast adjustment, as well as elastic deformation. Elastic deformation was implemented according to Microsoft paper Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis.
- Translations were implemented randomly up to 20% in both x and y axes.
- Rotation was unrestricted up to 360 degrees.
- Scaling was performed within 75-150% of the original scale.
- Shear was applied randomly at a range of 70 degrees, using a single angular parameter in torchvision.transforms.functional.affine().
- Pixel Noise was applied in a normal distribution of Standard Deviation 0.02.
- Atmospheric Noise was applied from a mask of size 32 x 32 and Std 0.5, upsampled to tile dimensions. This augmentation attempts to simulate the effects of haze and absorption.
- Contrast adjustment was found to be particularly valuable in training for this particular task. Contrast was randomly adjusted between 70% and 150% of the original image, using a customised torchvision method to support 4-channel images. The images were re-normalized to [0, 1] post adjustment.
- Brightness adjustments were applied within 80-120% of the original image brightness. Excessive pixel values were clipped to [0, 1] post adjustment.
- Elastic deformations proved to be as helpful in training as claimed in the U-Net paper. The gaussian kernel used in the deformations appears to have to be the same size as the kernel used for the convolutional neural networks, for optimal results.
-
-
Weight Decay: L2 regularization was applied to the first two convolutional layers, due to excessive growth of single filters. This is assumed to be occuring due to the NIR input channel, which can be exploited to explain the majority of negative samples.
-
Dropout Layers Two dropout layers were applied in the last downsampling block as suggested in literature. However, despite the extent of augmentations the model kept overfitting, so additional layers were added in each residual conjunction with a droprate of 30%.
--
Still experimenting from time to time, might not be in line with current model.
--
-
-
Incase anyone is interested in training further
model_training.py:
usage: Model Training [-h] [--epochs EPOCHS] [--batch-size BATCH_SIZE] [--num-workers NUM_WORKERS] [--lr LR] [--report] [--monitor] [--l2 L2 [L2 ...]] [--reload] [--init-scale INIT_SCALE] [--checkpoint CHECKPOINT] [--balance-ratio BALANCE_RATIO] [--report-rate REPORT_RATE] [--dropouts DROPOUTS [DROPOUTS ...]] [--weights WEIGHTS [WEIGHTS ...]] [--check-rate CHECK_RATE]
Training
optional arguments: -h, --help show this help message and exit --epochs EPOCHS Number of epochs for training --batch-size BATCH_SIZE Batch size for training --num-workers NUM_WORKERS Number of background processes for data loading --lr LR Learning rate --report, -r Store losses on memory and produce a report graph -- Contrained by memory size. Control with REPORT_RATE to minimize logs accordingly --monitor, -m Observe activations and predictions of a sample --l2 L2 [L2 ...] L2 Regularization parameters. Sequence of length 23. --reload Load checkpoint and continue training --init-scale INIT_SCALE, -i INIT_SCALE The factor to initially multiply input channels with: in_channels*INIT_SCALE = out_channels -- Controls overall U-net feature length --checkpoint CHECKPOINT, -c CHECKPOINT Path to saved checkpoint --balance-ratio BALANCE_RATIO, -b BALANCE_RATIO For positive values roughly every n-th sample is negative, the rest are positive. The opposite for negative values. --report-rate REPORT_RATE Epoch frequency to log losses for reporting. Default: EPOCHS // 10 --dropouts DROPOUTS [DROPOUTS ...], -d DROPOUTS [DROPOUTS ...] Sequence of length 23. Dropout probabilities for each CNN. --weights WEIGHTS [WEIGHTS ...], -w WEIGHTS [WEIGHTS ...] Class weights for loss computation. Sequence of length 2 --check-rate CHECK_RATE Write checkpoint every n epochs - For Monitor/Checkpoint options. Default: EPOCHS // 10
-
Testing samples were drawn from a neighboring scene to the training distribution, taken the same day.
- Urban:
- Industrial:
- Mostly Background -- Elevated Areas:
-
# TODO
Cement rooftops, which are a majority in the background test sample group, appear to be under represented in the training set. However, because I have classified this area before in my thesis and I know it well, the error distribution looks very similar to the results I had gotten using OBIA with a SVM classifier. This, together with the fact that affine transformations / image flips and other spatial augmentations do not seem to have any effect in training leads me to believe that there's a problem with the architecture and the model is mostly working with colors, rather than spatial patterns. Which is not what this architecture is supposed to do, as it was proven to work remarkably well with 1-channel images. This probably happens because the input channels are intermixed immediately (similar to regular pixel based classification) and the subsequent features are developed based on that.
Solution to this problem could be the isolation of each input channel and the parallel development of features per channel during downsampling, while merging them during upsampling. This however would probably result in duplicate work and an unnecessarily large model, since you would have to produce 4 times the feature maps.
A better solution would be to add an image synthesizer 1x1 conv layer near the input and let the network combine the channels linearly to its preference, into a single one-channel image, before feeding it to the rest of the network.
To be addressed in version 2.
# TODOAdditionally, it would be ideal to label each rooftop according to its type. That would allow for much more elaborate error analysis, but it is something I'm not eager doing right now for an experimental personal project. Potentially to be addressed in the future.
M.Eng. Spatial Planning & Development
iosif.doundoulakis@outlook.com