This is a project for the CS4186 Computer Vision and Image Processing course at City University of Hong Kong.
The project is a simple instance search system that uses multiple image retrieval methods (including SIFT, LBP, CNN and ensembled methods) to search for objects in a database of images.
pip install -r requirements.txt
If you are using a GPU, make sure to install the correct version of PyTorch. You can find the installation instructions here, or check for previous versions.
If you are on a server without a display, you may encounter the following error when trying to import OpenCV:
File "/usr/local/python/3.12.1/lib/python3.12/site-packages/cv2/__init__.py", line 153, in bootstrap
native_module = importlib.import_module("cv2")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/python/3.12.1/lib/python3.12/importlib/__init__.py", line 90, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ImportError: libGL.so.1: cannot open shared object file: No such file or directory
You can fix it by installing the following package:
pip install opencv-python-headless
.
├── dataset
│ ├── gallery_4186 (.jpg * 5000)
| ├── query_img_4186 (.jpg * 20)
│ └── query_img_box_4186 (.jpg * 20)
├── output
│ ├── brisk/rankList.txt
| ├── color_histogram/rankList.txt
│ ├── ...
| └── demo
| ├── brisk (.png * 20)
| ├── color_histogram (.png * 20)
| └── ...
├── brisk.py
├── color_histogram.py
├── ...
├── main.py
├── utils.py
|
├── README.md
└── requirements.txt
dataset/
: Contains the dataset used for the project. Gallery set contains 5000 images, and query set contains 20 images.query_img_box_4186/
: Contains the bounding box of the query images. Each.txt
file corresponds to the same name image inquery_img_4186/
and contains the coordinates of the bounding box in the formatx1 y1 w h
.
output/
: Contains the output of the project.- Each method has a folder.
rankList.txt
: for each query image, all 5000 images in the gallery are ranked according to the similarity in descending order. demo/
: Each method has a subfolder. For each query image, the query image and the top 10 matches are aggregated for visualization.
- Each method has a folder.
utils.py
: Contains File I/O functions and image processing functions. Check this file if you wish to change the directory of the dataset.
- Input: RGB or HSV image
- Output:
$\text{bins} \times \text{bins} \times \text{bins}$ histogram
Color histogram computes the distribution of colors in an image. Each pixel is assigned to a bin based on its RGB values.
To compare two histograms, cosine similarity is used:
where
- Input: Grayscale image
- Output: 256-dimensional LBP histogram
Local Binary Pattern (LBP) is a texture descriptor that encodes the local structure of an image. It compares each pixel with its neighbors and assigns a binary value based on whether the neighbor is greater than the center pixel.
where
In practice, a radius of 8 and 24 neighbors are used.
- Input: RGB image (preprocessed according to ResNet50)
- Output: 2048-dimensional feature vector
Convolutional Neural Networks (CNN) are a class of deep learning models that are particularly effective for image classification and object detection tasks. In this project, we use a pre-trained CNN model, ResNet50, to extract features from the images. The features are then used to compute the similarity between the query and gallery images.
The last fully connected layer of the ResNet50 model is removed, and the output of the last convolutional layer is used as the feature vector for each image. The feature vectors are then compared using cosine similarity.
- Input: Grayscale image
- Output: 128-dimensional float descriptor
Scale-Invariant Feature Transform (SIFT) is a feature detection algorithm that extracts keypoints and descriptors from an image. The keypoints are invariant to scale and rotation, making them suitable for object recognition.
To compute the similarity between two images, the number of matched keypoints is used as a measure of similarity. The more keypoints that match, the more similar the images are.
In details, KNN matching is used to find the best matches between the keypoints of the query and gallery images. To filter out false matches, the ratio test is applied. The distance of the best match
- Input: Grayscale image
- Output: 64-dimensional binary descriptor
BRISK (Binary Robust Invariant Scalable Keypoints) is a feature detection and description algorithm that is similar to SIFT but is designed to be faster and more efficient.
Similar to SIFT, KNN matching is used.
- Input: Grayscale image
- Output: Both SIFT and BRISK descriptors (handled separately)
The ensemble method combines the scores from SIFT and BRISK to improve the overall performance of the instance search system.
However, they give descriptors of different dimensions. To combine them, we first normalize the scores of each method to the range [0, 1]. Then, we compute the final score as the weighted sum of the normalized scores:
It can be observed that dark images are often considered as similar images to any given query image. This is because dark images have vague features and are often misclassified as similar images.
CLAHE (Contrast Limited Adaptive Histogram Equalization) is used to enhance the contrast of the images. It divides the image into small regions and applies histogram equalization to each region. This helps to improve the visibility of the features in the images, making it easier to match them.
In details, for grayscale images, CLAHE can be applied directly:
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
clahe_image = clahe.apply(gray)
For color images, it is first converted to Lab color space, and CLAHE is applied to the L channel (lightness channel) only:
l, a, b = cv2.split(cv2.cvtColor(image, cv2.COLOR_BGR2Lab))
clahe = cv2.createCLAHE(clipLimit=2.0, tileGridSize=(8, 8))
l = clahe.apply(l)
clahe_image = cv2.merge((l, a, b))
clahe_image = cv2.cvtColor(clahe_image, cv2.COLOR_Lab2BGR)