Action Recognition in videos is an active research field that is fueled by an acute need, spanning several application domains. Still, existing systems fall short of the applications’ needs in real-world scenarios, where the quality of the video is less than optimal and the viewpoint is uncontrolled and often not static. In this paper, we consider the key elements of motion encoding and focus on capturing local changes in motion directions. In addition, we decouple image edges from motion edges using a suppression mechanism, and compensate for global camera motion by using an especially fitted registration scheme. Combined with a standard bag-of-words technique, our methods achieves state-of-the-art performance in the most recent and challenging benchmarks.
The Motion Interchange Patterns (MIP) representation. (a) Our encoding is based on comparing two SSD scores computed between three patches from three consecutive frames. Relative to the location of the patch in the current frame, the location of the patch in the previous (next) frame is said to be in direction i (j); The angle between directions i and j is denoted alpha. (b) Illustrating the different motion patterns captured by different i and j values. Blue arrows represent motion from a patch in position i in the previous frame; red for the motion to the patch j in the next frame. Shaded diagonal strips indicate same alpha values.
Below are some results of the MIP representation, applied to recent Action Recognition benchmarks. For additional results and more details, please see the paper.
|System||No CSML||With CSML|
|LTP [Yeffet & Wolf 09]||55.45 +- 0.6%||57.2||58.50 +- 0.7%||62.4|
|HOG [Laptev et al. 08]||58.55 +- 0.8%||61.59||60.15 +- 0.6%||64.2|
|HOF [Laptev et al. 08]||56.82 +- 0.6%||58.56||58.62 +- 1.0%||61.8|
|HNF [Laptev et al. 08]||58.67 +- 0.9%||62.16||57.20 +- 0.8%||60.5|
|MIP single channel alpha=0||58.27 +- 0.6%||61.7||61.52 +- 0.8%||66.5|
|MIP single best channel alpha=1||61.45 +- 0.8%||66.1||63.55 +- 0.8%||69.0|
|MIP w/o suppression||61.67 +- 0.9%||66.4||63.17 +- 1.1%||68.4|
|MIP w/o motion compensation||62.27 +- 0.8%||66.4||63.57 +- 1.0%||69.5|
|MIP w/o both||60.43 +- 1.0%||64.8||63.08 +- 0.9%||68.2|
|MIP on stabilized clips||59.73 +- 0.77%||62.9||62.30 +- 0.77%||66.4|
|MIP||62.23 +- 0.8%||67.5||64.62 +- 0.8%||70.4|
|HOG+HOF+HNF||60.88 +- 0.8%||65.3||63.12 +- 0.9%||68.0|
|HOG+HOF+HNF with OSSML|
[Kliper-Gross et al. 11]
|62.52 +- 0.8%||66.6||64.25 +- 0.7%||69.1|
|MIP+HOG+HOF+HNF||64.27 +- 1.0%||69.2||65.45 +- 0.8%||71.92|
Table 1. Comparison to previous results on the ASLAN benchmark. The average accuracy and standard error on the ASLAN benchmark is given for a list of methods (see text for details). All HOG, HOF, and HNF results are taken from [Kliper-Gross et al. 11, Kliper-Gross et al. 12].
|System||Original clips||Stabilized clips|
|HOG/HOF [Laptev et al. 08]||20.44%||21.96%|
|C2 [Jhuang et al. 07]||22.83%||23.18%|
|Action Bank [Sadanand & Corso 12]||26.90%||N/A|
Table 2. Comparison to previous results on the HMDB51 database. Since our method contains a motion compensation component, we tested our method on the more challenging unstabilized videos. Our method significantly outperforms the best results obtained by previous work.
|HOG/HOF [Laptev et al. 08]||47.9%||N/A|
|Action Bank [Sadanand & Corso 12]||57.9%||N/A|
Table 3. Comparison to previous results on the UCF50 database. Our method significantly outperforms all reported methods.
MIP video descriptor for action recognition - Code
If you use this code in your own work, please cite our paper.
Copyright and disclaimer
Copyright 2012, Orit Kliper-Gross, Yaron Gurovich, Tal Hassner, and Lior Wolf