An experiment to do template matching based on neural networks.
The model is a modified version of the original U-Net architecture. Instead of single encoder, two encoders (one for the query image and another for the original image) will be used. In original architecture, there are skip connections from encoder to decoder side. But here, the outputs from such blocks are first multiplied (or can be added i.e. encoding multiplication) and passed to the decoder. The inputs to the model will be, query image (where template will be at the center of a blank image) and input image (where that template is being searched). Both are of same size.
A basic architecture of a model.
Crop a part of an image based on the bounding box annotation available in COCO dataset. Then put that cropped part in the center of a blank image. Now the model's input will be original image and that blank image. The target will be the mask where the cropped image originally was.
- Prepare a
.venv
file that contains following:
TRAIN_DIR=assets/training_data/train2017/train2017
TRAIN_ANNOTATION_DIR=assets/training_data/annotations_trainval2017/annotations/instances_train2017.json
VAL_DIR=assets/training_data/val2017/val2017
VAL_ANNOTATION_DIR=assets/training_data/annotations_trainval2017/annotations/instances_val2017.json
Had to do this to make it compatible with HPC. Slurm job is in scripts.
- A model is defined at temp_matching/model.py. By default, the encodings will be multiplied.
- A dataset handler is at temp_matching/data_handler.py. And the same config can be used for valid and train and with the use of fixed seed, the split is expected to be same everytime with same
train_size
. - Trainer is at temp_matching/trainer.py.
A live_run.py should work out of the box. First compute the encodings of query and search based on that. Please download the weight files from Google Drive.
- Encoder:
ResNet152
- Train Data Per Epoch: 10000
- Valid Data Per Epoch: 500
- Batch Size: 32
- Image HW: 512, 512
- Optimizer: Adam with Lr=0.0001
- Loss function: DiceLoss
- Training Curve
- Predictions at assets/2024-09-24/
- The weight file can be downloaded from Google Drive
- Some experiments I did are available on notebooks as well.
Note that storing the mask was done to view masks later. I found RLE (Run Length Encoding to be the perfect for that task.)
The scripts to extract the masks and storing in RLE is temp_matching/benchmarking.py. And the plots are generated on the notebooks/test_benchmark.ipynb.
Above result shows that SIFT is far more better than the template matching model we trained. And after looking into the describe, it is even clearer.
model_iou | sift_iou | model_time | sift_time | |
---|---|---|---|---|
count | 21627.000000 | 21627.000000 | 21627.000000 | 21627.000000 |
mean | 0.415356 | 0.945153 | 0.020917 | 0.088319 |
std | 0.391032 | 0.223428 | 0.112210 | 0.030342 |
min | 0.000000 | 0.000000 | 0.000499 | 0.010363 |
25% | 0.000000 | 1.000000 | 0.000537 | 0.069422 |
50% | 0.432000 | 1.000000 | 0.000572 | 0.083969 |
75% | 0.825000 | 1.000000 | 0.000607 | 0.101738 |
max | 1.000000 | 1.000000 | 1.225988 | 0.898370 |
Based on the IoU, SIFT seems to be outperforming the template matching model. However, it seems that model was faster than the SIFT. It must be because the model was tested on GPU while SIFT was not.
Out of 21627 only 232 cases.
2057 cases. Some are as follows:
816 cases.
The results did not show that template matching with the model and the training I had is not better than the classical SIFT feature extractor. What could be the reasons?
- Training process is really sensitive. The input size, colorspace, rotation of template and the image, template size, minimizer function and so on.
- It is not trained with more variations of the images. For example, we want our model to be able to perform well in scale/rotation as well but it is not yet. However, we can use augmentation techniques during training for that.
I have trained several template matching models in other projects (very compact domain) and I have found them to better than SIFT only when I trained for weeks but still without much rotation/scale augmentation. In addition to that, I have also trained model with attention layers in different places of the architecture and still results were not great. This means it needs careful design of the architecture.
If you find this project helpful in your research or applications, please consider citing it as follows:
@misc{acharya2024template,
title={Template Matching Using Deep Learning},
author={Ramkrishna Acharya},
year={2024},
howpublished={\url{https://github.com/q-viper/template-matching}},
note={An experimental approach to template matching using dual-encoder U-Net architecture},
}
Alternatively, feel free to link to this repository.