Because detection-models (Yolo, Mask-RCNN, etc) are often developed by the different frameworks and evaluated the performance on different hardware, it's hard to evaluate the performance (i.e, fps) and fine-tune these on your customized datasets. This project builds all detection-models with pytorch and provides the general API for training, fine-tuning and detection in supervised and unsupervised settings.
- Pytorch-1.0
- python-3.6
- CUDA-9.0
- 2 TiTAN XP with 12 GPU Memory
- Installation
- Supporting Models
- Loss visualization and analysis
- Fine-tuning
- Usage
- Mask-RCNN (in progress)
- RetinaNet (in progress)
- Yolov3
- M2Det (in progress)
- References
$ git clone https://github.com/jacksonly/Detection-Fine-tuning-API.git
$ cd Detection-Fine-tuning-API/
$ sudo pip3 install -r requirements.txt
$ cd weights/
$ bash ./yolov3/weights/download_weights.sh
Pretrained pytorch models (.pt):
- Two-stage Detection:
- Mask RCNN (CVPR'17) inspired by wkentaro
- One-stage Detection:
- RetinaNet (ICCV'17) inspired by yhenon
- Yolov3 (arXiv'18) inspired by ultralytics and TencentYoutuResearch
- M2Det (AAAI'19, the latest SSD) inspired by qijiezhao
To analysis the relationship between loss and bounding boxes, we plot the bbs that have small loss with green rectange and bbs that have large loss with purple rectange in pedestrian detection (dataset from WildTrack). We find that the hard bounding boxes are often having occlusion and hard to detect by models.
Intuitively, we can use hard bbs as training set to get efficient training. But I found the performance is too bad when the number of hard bbs is small. Because detection-models need pair (image, bbs) as training set, only using subset of the all bbs may be easy to overfitting. ...- Class imbalance:
In fine-tuning, we must decide to use how many detections for backprogation because detection-models usually generate too many bounding boxes (many nosie) and the RoIs are overlapping with the small number of detections. If we use all detections as training sets., too many simple negatives will influence the performance of fine-tuning. Therefore, we need to design some strategies to resolve the imbalance problem between negative and positive samples. There are three main methods: Online hard example mining (OHEM), Focal Loss and Using positives only.
To get more efficient backpropagation on bounding boxes, we adopt the online hard example mininig (OHEM) to training . In other words, we will sort all bounding boxes (positives and negatives) by loss in mini-batch and select B% bounding boxes that have the highest loss. Backprogation is performed based on the selected bounding boxes. Details can be refered in paper and this method is often used to training two-stage detection-models.
...
In Yolov3, it only chooses the most suitable positive bounding boxes for backpagation. When they get the ground-truth bb1, they will search the image and find the detection bb2 that is the most possible candidate for bb1. Finally, they compute the loss between bb1 and bb2. This means each ground-truth only has one detection as candidate. Thus, there are not existing too many negatives.
- Supervised Training (Given labeled data):
- Fine-tuning the whole model.
- Fine-tuning the high-level feature-extractor and detection.
- Fine-tuning the detection.
- Unsupervised Training (Given unlabeled data or raw videos):
Standard Fine-tuning Scheme: Fine-tuning with detections or Easy-to-Hard.- Fine-tuning the whole model with pseudo-bounding-boxes.
- Fine-tuning the high-level feature-extractor and detection with pseudo-bounding-boxes.
- Fine-tuning the detection with pseudo-bounding-boxes.
We support to detect image and video:
- Images:
python3 detect.py --cfg ./cfg/yolov3.cfg --weights ./weights/yolov3.pt --images ./data/samples
- Video: (in progress)
- Put all images to
./yolov3/data/custom/images/
and all labeled files to./yolov3/data/custom/labels/
. Each image's name must be same as the corresponding labeled files. An example image and label pair would be:
./yolov3/data/custom/images/00000000.png # image
./yolov3/data/custom/labels/00000000.txt # label
- One file per image (if no objects in image, no label file is required).
- One row per object.
- Each row is class x_center y_center width height format.
- Box coordinates must be in normalized xywh format (from 0 - 1).
- Classes are zero-indexed (start from 0).
An example label file with 32 persons (all class 0):
- Update the train.txt and val.txt in
./yolov3/data/custom/
sada
- Update the custom.names file in
./yolov3/data/custom/
- Update the custom.data file in
./yolov3/data/custom/
We support 2 types of datasets:
- Images with bounding boxes (Supervised Training).
cd ./yolov3
###
### useful trick: use a large img-size or a small learning rate would improve performance in fine-tuning.
###
# Train new model from scratch
python train.py --data ./data/custom/custom.data --cfg ./cfg/custom.cfg --img-size=800
# Fine-tune model from coco (Detection)
python train.py --data ./data/custom/custom.data --cfg ./cfg/custom.cfg --resume --class_num=1 --img-size=800
# Fine-tune model from coco (High+Detection)
python train.py --data ./data/custom/custom.data --cfg ./cfg/custom.cfg --resume --class_num=1 --transfer_id=1 --img-size=800
# Fine-tune model from coco (Low+High+Detection)
python train.py --data ./data/custom/custom.data --cfg ./cfg/custom.cfg --resume --class_num=1 --transfer_id=2 --img-size=800
-
Video without bounding boxes (Learning without supervision).
- To evaluate the upper-bound of self-training, I compare the performance between detections with 0.5 confidence-threshold and 0.1 confidence-threshold. Then I will use detections (threshold=0.1) to teach self.
- extract the all frames at 10 fps.
cd ./yolov3/video ## move your video in this folder and named video.mp4. ## In my experiment, I use a video from AICity. You can download this video from https://drive.google.com/open?id=1QahmuP87oPceze_Jn2aSm9oZGaDs9ji8 (93.6MB, 800*410, 14 minutes 14 seconds). # create a folder to save the frames mkdir images # crate a folder to save the pseudo-labels. mkdir labels # extract all frames at 10 fps. ffmpeg -i video.mp4 -vf "fps=10" images/%08d.png
- use yolov3 (pretrained on coco) to detect these frames.
# useful tricks: when the yolov3 detect many positives with low-threshold, we can use the low threshold and a larger img-size. cd .. python detect_video.py --cfg ./cfg/yolov3.cfg --weights ./weights/yolov3.pt --images ./video/images/ --output ./video/output/ --img-size 800 --conf-thres 0.1 --nms-thres=0.2 # (optional, use tracking to filter hard negatives) If you extract the video at a high fps (>30), you can use template matching technique to filter some hard negatives (bbs that only were detected in frame-i but weren't detected in frame-[i-1] and frame-[i+1]). Deteails can be refered to Unsupervised Hard Example Mining from Videos for Improved Object Detection (ECCV'18). # (optional, use tracking to augument the hard positives) refer to Automatic adaptation of object detectors to new domains using self-training (CVPR 2019).
- analysis the detections.
# analysis the detections and extract the RoI-class in the video (top-3) python plot_bb (./video/video_analysis.png) # update ./yolov3/video/custom_video.names and ./yolov3/video/custom_video.data as discussed in Data Prepartion (Yolov3)
- use detections (from the 2nd step) those confidence-scores are large than threshold as the pseudo labels.
- Because we don't get true labels and our target is making model overfit on the current video, we use same data in training-set and validation-set.
cd video # generate pseudo-labels python generate_bb.py # update ./yolov3/cfg/custom_video.cfg as discussed in Data preparation.
- train the current model on the pseudo labels.
# remove the old .shape file (if you have old val.shapes file in ./yolov3/data) cd ../data rm val.shapes cd .. # fine-tuning python train.py --data ./video/custom_video.data --cfg ./cfg/custom_video.cfg --resume --class_num=1 --transfer_id=1 --img-size=800
- To evaluate the upper-bound of self-training, I compare the performance between detections with 0.5 confidence-threshold and 0.1 confidence-threshold. Then I will use detections (threshold=0.1) to teach self.
In experiment, I train yolov3 on pedestrain detection (from WildTrack). The preprocessed data can be download in images and labels. You can extract these and put to ./yolov3/
from utils import utils; utils.plot_results()
I plot the performance and loss. I find that the loss on class is getting convergence fast because single-class classification is more simple than multi-classification (in coco).
When fine-tuning your customized dataset, you can modify the hyper-parameters in ./yolov3/train.py
. I use the default hyper-parameters setting:
# Hyperparameters: train.py --evolve --epochs 2 --img-size 320, Metrics: 0.204 0.302 0.175 0.234 (square smart)
hyp = {'xy': 0.2, # xy loss gain
'wh': 0.1, # wh loss gain
'cls': 0.04, # cls loss gain
'conf': 4.5, # conf loss gain
'iou_t': 0.5, # iou target-anchor training threshold
'lr0': 0.001, # initial learning rate
'lrf': -4., # final learning rate = lr0 * (10 ** lrf)
'momentum': 0.90, # SGD momentum
'weight_decay': 0.0005} # optimizer weight decay
[1] Li Liu et al. Deep Learning for Generic Object Detection: A Survey. Arxiv 2018.
[2] Kaiming He et al. Mask R-CNN. ICCV 2017.
[3] Tsung-Yi Lin et al. Focal Loss for Dense Object Detection. CVPR 2017
[4] Joseph Redmon et al. YOLOv3: An Incremental Improvement. Arxiv 2018.
[5] Qijie Zhao et al. M2Det: A Single-Shot Object Detector based on Multi-Level Feature Pyramid Network. AAAI 2019.
[6] Abhinav Shrivastava et al. Training Region-based Object Detectors with Online Hard Example Mining. CVPR 2016.
[7] Yang Zou et al. Unsupervised Domain Adaptation for Semantic Segmentation via Class-Balanced Self-Training. ECCV 2018.
[8] SouYoung Jin et al. Unsupervised Hard Example Mining from Videos for Improved Object Detection. ECCV 2018.
[9] Paul Voigtlaender et al. Large-Scale Object Discovery and Detector Adaptation from Unlabeled Video. CVPR 2018.
[10] Aruni RoyChowdhury et al. Automatic adaptation of object detectors to new domains using self-training. CVPR 2019.