Discriminative Video Pattern Search for
Efficient Action Detection
Actions are spatio-temporal patterns. Similar to the sliding window-based object detection, action detection finds the re-occurrences of such spatio-temporal patterns through pattern matching, despite clutter and dynamic backgrounds and other types of action variations. We address two critical issues in pattern matching-based action detection: (1) the tolerance of intra-pattern variations in actions, and (2) the computational efficiency in performing action pattern search in crowded videos. First, we propose a discriminative pattern matching criterion for action classification, called naive-Bayes based mutual information maximization (NBMIM). Each action is characterized by a collection of spatio-temporal invariant features and we match it with an action class by measuring the mutual information between them. Based on this matching criterion, action detection locates a subvolume in the volumetric video space, with maximum mutual information toward a specific action class. A novel spatio-temporal branch-and-bound (STBB) search algorithm is designed to efficiently find the optimal solution. Our proposed action detection method does not rely on the results of human detection, tracking or background subtraction. It can well handle action variations such as performing speed and style variations, as well as scale changes. It is also insensitive to dynamic and clutter backgrounds and even partial occlusions. The cross-dataset experiments on action detection, including KTH, CMU action datasets, and another new
MSR action dataset, demonstrate the effectiveness and efficiency of the proposed multi-class multiple-instance action detection method.
The following MSR action dataset used for the CVPR 09 paper is available for noncommercial research use. Here is the license agreement.
If you use this dataset, please cite the following paper:
Junsong Yuan, Zicheng Liu and Ying Wu, Discriminative Subvolume Search for Efficient Action Detection. IEEE Conf. on Computer Vision and Pattern Recognition, 2009
The test dataset contains 16 video sequences and has in total 63 actions: 14 hand clapping, 24 hand waving, and 25 boxing, performed by 10 subjects. Each sequence contains multiple types of actions. Some sequences contain actions performed by different people. There are both indoor and outdoor scenes. All of the video sequences are captured with clutter and moving backgrounds. Each video is of low resolution 320 x 240 and frame rate 15 frames per second. Their lengths are between 32 to 76 seconds. To evaluate the performance, we manually label a spatio-temporal bounding box for each action. The ground truth labeling can be found in the groundtruth.txt file. The ground truth format of each labeled action is "X width Y height T length".