Rubric Points
This project is to implement a software pipeline to detect vehicles in video stream. We used Yolo v2(homepage) and retrained it on Udacity Annotated Driving Dataset and PASCAL VOC Dataset for vehicle detection. Experiment result shows that the trained network can detect vehicles with high accuracy, and potential to perform detection in real-time.
Test on project_video:
youtube link
Original paper on arxiv.org:
Yolo is a end-to-end neural network for object detection on images. It takes as resized 416x416 image as input and go through a convolution network for feature extraction. This process subsample the image by 32 into a 13x13 feature map. Then it applies a 1x1 convolution layer with anchor boxes on feature map to predict bounding box offsets and classes confidence. This layer has #anchor * (1 + 4 + #classes) filters (on a feature map pixel, for each anchor box it predicts 4 offsets of anchor box, a probability of object existence, and conditional probability of each class given object existence). Finally it performs non-maximum-suppression and thresholding on predictions to produce bounding box for each potential object detection.
Yolo claims to be fast and accurate. Although in this task we only need to detect vehicles but it is designed to detect a bunch of classes. Compared to traditional HOG + SVM approach, Yolo uses neural network to learn the best features, instead of hand tuning tons of parameters to design an "ultimate" feature vector. Most importantly Yolo has the potential to perform real-time detection, which is crucial for self-driving-car.
"On a Titan X it processes images at 40-90 FPS and has a mAP on VOC 2007 of 78.6% and a mAP of 44.0% on COCO test-dev."
The whole Yolo detection process actually perfectly matches HOG + SVM pipeline. First it performs feature extraction using convolution neural network, replace the HOG feature construction. Then it use last convolution layer with anchor boxes to predict object and bounding box, which resembles slide window search. Finally the non-maximum-suppression and thresholding is applied to filter out false positive.
Though out-of-box Yolo model already has ability to detect vehicles, we decided to retrain the network for vehicle detection specifically. The training process including configure the last convolution layer and some hyper parameters, and generating labels for Udacity dataset that fit the training procedure. For simplicity we used weights pre-trained on Imagenet (download from Yolo's website) for feature extraction model, then train the last convolutional layer for detection. Before training we filtered out labels that are too small in size (smaller than (1920x0.05) x (1200x0.05) ), since we don't expect to detect vehicle that is very far away. Also they could introduce noise into training process because label smaller than (1920x0.05) x (1200x0.05) occupies less than one slot in the subsampled 13x13 feature map.
The library used to train Yolo and a small tutorial can be found in Yolo's website. Configuration file, trained weights and scripts to create data set labels can be found in corresponding folder in my repository. Another small tutorial to run the model can be found at the end of this writeup.
It took more than 10 hours to train 8000 epochs on my computer with a GTX 960 graphic card, until then it reached stable high IOU score and recall. The re-train model demonstrates better performance than out-of-box yolo model on project_video and Udacity dataset. Under GTX 960 setting, the model runs the whole pipeline with about 20 fps, which is nearly real-time detection. In other words it's very promising to perform real-time detection with more powerful GPU.
Test result using Yolo out-of-box model
Test result using new-trained model
As the author claims in he paper, Yolo is a high accuracy real-time object detection framework. It performs excellent in the project_video. But there is still some limitation. Unlike SSD, Yolo does not collect information from high resolution feature map. This could make the model unstable when detecting objects that need to be detected in different scale of size(i.e. vehicles nearby and vehicles far away). Also if we want the model to detect vehicle occlude one and one we need the train the model with many "partial" vehicle sample. This sometimes could confuse the model make it detect a single vehicle as a whole, while also predict it as two half vehicles.
Another important note is that, in this project we just default 5 prior anchor boxes, which is obtained by Yolo author running a KNN clustering on the PASCAL VOC dataset. Since PASCAL VOC dataset is for general object detection, the default set of anchor boxes might not be the best for vehicle detection.
While it successfully detect vehicles in the video, it only performs frame by frame detection. There isn't any temporal relation, or "tracking", between each detection in this project, while tracking is much more useful for self-driving-car.
While deep neural network approach is proven to be powerful, it's worth to try on traditional HOG + SVM approach just for comparison. Also, adding other classes such as pedestrian, traffic sign and traffic light into the model will be one important work in the future.
Darknet can be found in Yolo's webpage
Download darknet config files this repository and copy them into corresponding folder under darknet.
Download trained weights (256MB) from google drive (link), and copy into backup/ under darknet
assume in the darnet folder, run:
./darknet detector test cfg/SDC.data cfg/SDC.cfg backup/SDC_8000.weights <image_file>
./darknet detector demo cfg/SDC.data cfg/SDC.cfg backup/SDC_8000.weights <video_file>
Note:
- best compile with CUDA-8.0, CUDNN, and openCV-3.1, need pkg-config.
- compile tutorial can be found in Yolo's webpage.
- darknet default width of BBox is thick, you can change it in src/image.c, set
int width = im.h * .005;
- if your GPU run out of memory, try set batch size to 1. open cfg/SDC.cfg, set
batch = 1
andsubdivisions = 1
. These attributes are under section[net]