-
Multiple People Tracking by Lifted Multicut and Person Re-Identification
-
Multi-Object Tracking With Quadruplet Convolutional Neural Networks
-
Joint Graph Decomposition & Node Labeling: Problem, Algorithms, Applications
-
Beyond Local Search: Tracking Objects Everywhere with Instance-Specific Proposals
-
(RANK 8, SCEA) Online Multi-Object Tracking via Structural Constraint Event Aggregation
-
The Solution Path Algorithm for Identity-Aware Multi-Object Tracking
-
Multi-view People Tracking via Hierarchical Trajectory Composition
-
Target Identity-aware Network Flow for Online Multiple Target Tracking
-
Multihypothesis Trajectory Analysis for Robust Visual Tracking
-
An Online Learned Elementary Grouping Model for Multi-target Tracking
-
Multiple Target Tracking Based on Undirected Hierarchical Relation Hypergraph
-
A Probabilistic Framework for Multitarget Tracking with Mutual Occlusions
-
Multi-target Tracking with Motion Context in Tensor Power Iteration
-
Learning an Image-based Motion Context for Multiple People Tracking
-
Multi-target Tracking by Lagrangian Relaxation to Min-cost Network Flow
-
Robust Real-Time Tracking of Multiple Objects by Volumetric Mass Densities
-
Tracking Sports Players with Context-Conditioned Motion Models
-
Hypergraphs for Joint Multi-view Reconstruction and Multi-object Tracking
-
Detection- and Trajectory-Level Exclusion in Multiple Object Tracking
-
(RANK 3, NOMP) Near-Online Multi-Target Tracking With Aggregated Local Flow Descriptor
-
(RANK 5, MHT_DAM) Multiple Hypothesis Tracking Revisited
-
(RANK 6, MDP) Learning to Track: Online Multi-Object Tracking by Decision Making
-
Unsupervised Object Discovery and Tracking in Video Collections
-
FollowMe: Efficient Online Min-Cost Flow Tracking with Bounded Memory and Computation
-
Learning to Divide and Conquer for Online Multi-Target Tracking
-
Latent Data Association: Bayesian Model Selection for Multi-target Tracking
-
Tracking via Robust Multi-task Multi-view Joint Sparse Representation
-
The Way They Move: Tracking Multiple Targets with Similar Appearance
-
Higher Order Matching for Consistent Multiple Target Tracking
-
Discriminative Label Propagation for Multi-object Tracking with Sporadic Appearance Features
-
(Arxiv 2017) Tracking the Trackers: An Analysis of the State of the Art in Multiple Object Tracking
-
(AAAI 2017) Online Multi-Target Tracking Using Recurrent Neural Networks
-
(ICML 2017) Analysis and Optimization of Graph Decompositions by Lifted Multicuts
-
(Arxiv 2017) NoScope: Optimizing Neural Network Queries over Video at Scale
-
(Arxiv 2017) Simple Online and Realtime Tracking with a Deep Association Metric
-
(ICIP 2017) Instance Flow Based Online Multiple Object Tracking
-
(RANK 1, MDPNN, Arxiv 2017) Tracking The Untrackable: Learning To Track Multiple Cues with Long-Term Dependencies
-
(RANK 2, TSMLCDEnew, PAMI 2017) Tracklet Association by Online Target-Specific Metric Learning and Coherent Dynamics Estimation
-
(Arxiv 2016) On The Stability of Video Detection and Tracking
-
(ECCV 2016 workshop) POI: Multiple Object Tracking with High Performance Detection and Appearance Feature
-
(Arxiv 2016) Spatially Supervised Recurrent Convolutional Neural Networks for Visual Object Tracking
-
(RANK 4, TDAM, CVIU 2016) Temporal Dynamic Appearance Modeling for Online Multi-Person Tracking
-
(RANK 7, CNNTCM, CVPR 2016 Workshop) Joint Learning of Siamese CNNs and Temporally Constrained Metrics for Tracklet Association
-
(RANK 9, SiameseCNN, CVPR 2016 Workshop) Learning by tracking: Siamese CNN for robust target association
-
(RANK 10, TbX, Arxiv 2016) Tracking with multi-level features
Ordered based on their overall performance ranking on MOT challenges.
-
(FWT, Arxiv 2017) Improvements to Frank-Wolfe optimization for multi-detector multi-object tracking
-
(jCC, Arxiv 2016) A Multi-cut Formulation for Joint Segmentation and Tracking of Multiple Objects
-
(MHT_DAM, ICCV 2015) Multiple Hypothesis Tracking Revisited
-
(EDMT17, CVPRw 2017) Enhancing Detection Model for Multiple Hypothesis Tracking
-
(IOU17, AVSS 2017) High-Speed Tracking-by-Detection Without Using Image Information
-
(LMP, CVPR 2017) Multiple People Tracking with Lifted Multicut and Person Re-identification
-
(FWT, Arxiv 2017) Improvements to Frank-Wolfe optimization for multi-detector multi-object tracking
-
(NLLMPa, CVPR 2017) Joint Graph Decomposition and Node Labeling: Problem, Algorithms, Applications
-
(AMIR, Arxiv 2017) Tracking The Untrackable: Learning To Track Multiple Cues with Long-Term Dependencies
-
(MCjoint, CoRR 2016) A Multi-cut Formulation for Joint Segmentation and Tracking of Multiple Objects
-
(NOMT, ICCV 2015) Near-Online Multi-target Tracking with Aggregated Local Flow Descriptor
-
(JMC, BMTT 2016) Multi-Person Tracking by Multicuts and Deep Matching
-
(STAM16, Arxiv 2017) Online Multi-Object Tracking Using CNN-based Single Object Tracker with Spatial-Temporal Attention Mechanism
-
(MHT_DAM, ICCV 2015) Multiple Hypothesis Tracking Revisited
-
(EDMT, CVPRw 2017) Enhancing Detection Model for Multiple Hypothesis Tracking
-
(QuadMOT16, CVPR 2017) Multi-Object Tracking with Quadruplet Convolutional Neural Networks
-
(oICF, AVSS 2016) Online multi-person tracking using Integral Channel Features
-
(AMIR15, Arxiv 2017) Tracking The Untrackable: Learning To Track Multiple Cues with Long-Term Dependencies
-
(JointMC, CoRR 2016) A Multi-cut Formulation for Joint Segmentation and Tracking of Multiple Objects
-
(HybridDAT, TIP 2016) A Hybrid Data Association Framework for Robust Online Multi-Object Tracking
-
(AM, Arxiv 2017) Online Multi-Object Tracking Using CNN-based Single Object Tracker with Spatial-Temporal Attention Mechanism
-
(TSMLCDEnew, Arxiv 2015) Tracklet Association by Online Target-Specific Metric Learning and Coherent Dynamics Estimation
-
(QuadMOT, CVPR 2017) Multi-Object Tracking with Quadruplet Convolutional Neural Networks
-
(NOMT, ICCV 2015) Near-Online Multi-target Tracking with Aggregated Local Flow Descriptor
-
(TDAM, CVIU 2016) Temporal Dynamic Appearance Modeling for Online Multi-Person Tracking
-
(MHT_DAM, ICCV 2015) Multiple Hypothesis Tracking Revisited
-
(MDP, ICCV 2015) Learning to Track: Online Multi-Object Tracking by Decision Making
-
(CNNTCM, CVPRw 2016) Joint Learning of Siamese CNNs and Temporally Constrained Metrics for Tracklet Association
-
(SCEA, CVPR 2016) Online Multi-Object Tracking via Structural Constraint Event Aggregation
-
(SiameseCNN, CVPRw 2016) Learning by tracking: Siamese CNN for robust target association
-
(TBX, Arxiv 2016) Tracking with multi-level features
-
(oICF, AVSS 2016) Online multi-person tracking using Integral Channel Features
-
(TO, WACV 2016) Leveraging single for multi-target tracking using a novel trajectory overlap affinity measure
This repository contains references for papers and code for the Multiple Object Tracking 2017 (MOT17) project. To reduce the repository size, most documents are provided as links.
-
(ECCV16) Simple Online and Realtime Tracking with a Deep Association Metric. [pdf] [code]
-
(Arxiv17) NoScope: 1000x Faster Deep Learning Queries over Video. [project] [pdf] [code]
-
(ICCV17) Focal Loss for Dense Object Detection. [pdf]
-
(Arxiv16) On The Stability of Video Detection and Tracking. [pdf]
-
(Arxiv17) Optimizing Deep CNN-Based Queries over Video Streams at Scale. [pdf] [code]
-
(CVPR13) Visual Tracking via Locality Sensitive Histograms. [project] [pdf] [code]
-
(CVPR17) A Multi-cut Formulation for Joint Segmentation and Tracking of Multiple Objects. [pdf]
-
(CVPR17) Joint Graph Decomposition and Node Labeling: Problem, Algorithms, Applications. [pdf]
-
(CVPR17) Multiple People Tracking by Lifted Multicut and Person Re-identification. [pdf]
-
(ICML17) Analysis and Optimization of Graph Decompositions by Lifted Multicuts. [pdf]
-
(CVPR17) Densely Connected Convolutional Networks. [pdf] [code]
-
(CVPR17) Feature Pyramid Networks for Object Detection. [pdf]
Given a video contains moving objects of a specific class (e.g., pedestrian, vehicles, etc.), the task of multiple object tracking (MOT) is to locate all the objects of interest and associate them to find their correspondence across time.
Tracking-by-detection is recently the most successful paradigm among MOT methods. The paradigm separate tracking into two stages. First, an object detector is applied to each video frame. In a second step, a tracker is used to associate these detections to tracks. This note makes a survey of tracking-by-detection methods, where the input is a video and all the detections, and the output is the tracking results.
Lets first refer to a very simple basic implementation of multiple object tracking, and see what problems it may produce and try to improve this implementation.
(Note: For clarity, we name the objects we have tracked over time, i.e. frame 1~t, as tracks and the unassociated ones detected in a new frame, i.e. frame t+1, as detections)
A simple and intuitive idea is to associate the detections in consecutive frames by their spatial overlap between time steps. The detections with the highest IOU (Intersection-Over-Union) could probably belong to the same object. This gets closer to the truth when the frame rate is high and when the detector becomes increasingly reliable.
With the basic idea, we complete the implementation by answering the following questions.
-
How to find the correspondence between last tracks and new detections?
A greedy method. For a track, we compute its IOUs with all the detections. If the max IOU is bigger than a threshold (e.g. 0.5), we add the detection to the track, then remove it from the detections set. We loop over all the tracks to find their corresponding detections.
-
How to determine the initialization and the termination of a track?
An unassociated detection is initialized as a new track, and a track without corresponding detection will be removed (i.e. a termination).
-
How to filter out the false positives in detections?
By removing: 1. short tracks (filtering out all tracks with a length shorter than a number, e.g. 3). 2. low scoring tracks (remove all tracks without at least one detection with a score above a number, e.g. 0.3).
-
How to improve the completeness of a track?
The key is the use of low scoring detections. "Requiring a track to have at least one high-scoring detection ensures that the track belongs to a true object of interest while benefiting from lowscoring detections for the completeness of the track."
The simple implementation forms our first reviewed paper with its code publicly available:
- E. Bochinski, V. Eiselein and T. Sikora, ``High-speed tracking-by-detection without using image information'', AVSS 2017. [pdf] [code]
Despite its simpleness, it runs very fast (100K fps) and still achieves an average rank 7.2 on MOT17 with MOTA score 45.5 (with EB detector, see here for details).
The basic tracker is efficient but vulnerable, i.e., an occlusion or a missing of a detection will terminate a track imediately; when the objects/camera move fast or when the frame rate is low, the IOUs between correponding detections would be small or even close to 0, and the costs/similarity scores become less reliable. Moreover, the greedy assignment process is problematic when interactions or mutual occlusions happen among close objects.
Revisiting the basic tracker, we could find several algorithm modules: the intialization and termination processes, a pair-wise cost function (IOU criterion) and an assignment process (a greedy method for finding the correspondences). Before we start designing a better tracking algorithm, let's list out the modules we'd like to improve:
We'd like to:
-
reduce the false positives (wrong detections) in the initialization process, and the false negatives (occlusions or missing detections) in the termination process. (- by lazy evaluation)
-
improve the pair-wise cost/similarity functions. (- by introducing appearance and motion models)
-
choose a better optimizer for the assignment problem instead the greedy solver. (- by the Hungarian algorithm)
Now let's start the designing.
-
Initialization.
Tentative. After 3 frames of detections, change to Active.
-
Termination.
Only after 30 frames of lost, mark as lost.
-
Appearance Model.
Siamese CNN.
-
Motion Model.
Kalman Filters.
-
Pair-wise Similarity.
Siamese CNN score. IOU with Kalman predicted box.
-
Assignment Problem Solver.
Hungarian Algorithm.
The above briefs the deepsort algrithm of the following paper (source code available):
- W. Nicolai, B. Alex and P. Dietrich, ``Simple Online and Realtime Tracking with a Deep Association Metric'', Arxiv 2017. [pdf] [code]
The deepsort algorithm runs at approximately 40Hz and achieves an MOTA 61.4 with high performance detections (see here).
So far, we treat the initialization, termination, lost and rediscover of objects as trivial problems that were solved by some hand-craft simple tricks. Let's revisit the state changes of an object in a different perspective.
There are only four possible states for an object during tracking: appearance, disappearance, tracked, lost. An object that is lost might be rediscovered in later frames, while the one that is determined as disappeared is considered never to come back to the scene again. Therefore, the lifetime of an object could be modeled as transitions among its four possible states, which is exactly: the Markov Decision Process (MDP).
The MDP consists of the tuple (S
, A
, T
, R
), where S
is the state set, A
denotes the action set, T
: S
x A
-> S
is the state transition function descibes the effect of each action in each state, R
: S
x A
-> R
denotes the real-valued reward function, it defines the immediate reward received after executing action a to state s. Name the "appearance" and "disappearance" states as "Active" and "Inactive", the possible transitions between the four states are dispicted in the above figure. Only seven transitions/actions are possible.
In MDP, a policy π is a mapping from the state space S
to the action space A
, i.e., π : S
-> A
. So the rest tasks is to define the policies in three states (excluding the Inactive
state since an inactive object will never appear again).
-
Policy in an Active State
-
Policy in a Tracked State
-
Policy in a Lost State
The idea is originated in the following paper, with public source code available:
- Xiang, Yu and Alahi, Alexandre and Savarese, Silvio, ``Learning to Track: Online Multi-Object Tracking by Decision Making'', ICCV 2015. [pdf] [code]
How many features can we extract and utilize to associate tracks with detections, or to determine the pair-wise similarity/cost between a track and a detection?
-
Appearance Features
-
Motion Features
-
Interaction Features
- A. Sadeghian, A. Alahi and S. Savarese, ``Tracking The Untrackable: Learning To Track Multiple Cues with Long-Term Dependencies'', CVPR 2017. [pdf] [project]
MultiCut. Linear Programing + xxx. Node labeling and graph decomposition. etc.
Random notes.
Features used in multiple object tracking (MOT) include no more than appearance, motion and interaction features. Two questions need to be asked before using them in MOT:
- How to combine the features?
- How to model long term dependencies?
This paper studies the above problems.
Note in MOT, there are objects that we have continuously tracked through frame 1~t, and detections in frame t+1 that we wish to assign them to the tracked objects.
This paper models the appearance, motion and interaction models independently as RNNs (marked as module RNN), then the outputs of the three RNNs are feeded to yet another RNN (marked as target RNN) to generate similarity scores between objects and detections. With the similarity (cost) matrix obtained, the assignment problem is solved by the Hungarian algorithm.
- The module RNN solves the problem of long term dependencies.
- The target RNN solves the problem of feature selection/combination.
All the following RNNs are LSTMs.
- Appearance Model
Basic input feature: raw content.
CNN followed by an RNN for object feature extraction. Then the object feature is concatenated with detection CNN feature, and feeded to fully connected layers to generate a final k dimentional appearance feature.
- Motion Model
Basic input feature: velocity vector (vx, vy).
An RNN that accepts as inputs the velocity vector for extracting the H dimentional object feature, and a fully connected layer for extracting detection feature. The two features are concatenated and feeded to a FCN layer to generate a final k dimentional motion feature.
- Interaction Model
Basic input feature: flattened occupancy grid. Separate image into equal grids and neighboring grids of an object are annotated as 1 if there is another target, otherwise 0.
An RNN that accepts as inputs the flattened occupancy grid for extracting the H dimentional object feature, and the FCN layer for extracting detection feature. The two features are concatenated and feeded to a FCN layer to generate a final k dimentional interaction feature.
- Target Model
An RNN followed by FCN layers that accepts as inputs the concatenated 3k dimentional features as outputs the similarity score between an object and a detection.
- Training Process
First, each RNN as well as the CNN is pre-trained separately with a standard softmax classifier and cross-entropy loss, positive indicates matched object and detection and negative otherwise. Second, the target RNN is jonitly trained end-to-end with the component RNNs.
MOTA 47.2 on MOT16 and 37.6 on 2DMOT15, runs at 1 Hz.
Reports that the history (long-term dependencies in LSTMs) works, the combination works, and each cue matters.
Each object in MOT may fall in on of four states: active, inactive, tracked, lost.
- Active: initial state of any target. Whenever an object is detected.
- Tracked: confirmed as a true positive from an object detector.
- Lost: the target is lost due to some reasons like occlusion, out of view or disappear.
- Inactive: the target is confirmed lost, and stay inactive forever.
The paper formulates the MOT as decision making in Markov Decision Processes (MDPs).
- State Space: the four states active, inactive, tracked, lost.
- Action Space: feasible trasitions from one state to another.
- Transition Function: describes the effect of each action in each state.
- Reword Funtion: defines reward received after executing an action to a state.
The rest of the paper is to design reward functions (sometimes trainable models) for seven possible transitions in the state space.
The main contribution of the paper is it proposed a framework for modeling the object states and its state transitions.