Abstract
Previous work on action recognition has focused on adapting hand-designed local features, such as SIFT or HOG, from static images to the video domain. In this paper, we propose using unsupervised feature learning as a way to learn features directly from video data. More specifically, we present an extension of the Independent Subspace Analysis algorithm to learn invariant spatio-temporal features from unlabeled video data. We discovered that, despite its simplicity, this method performs surprisingly well when combined with deep learning techniques such as stacking and convolution to learn hierarchical representations. By replacing hand-designed features with our learned features, we achieve classification results superior to all previous published results on the Hollywood2, UCF, KTH and YouTube action recognition datasets. On the challenging Hollywood2 and YouTube action datasets we obtain 53.3% and 75.8% respectively, which are approximately 5% better than the current best published results. Further benefits of this method, such as the ease of training and the efficiency of training and prediction, will also be discussed. You can download our code and learned spatio-temporal features here: http://ai.stanford.edu/~wzou/.
Keywords
Affiliated Institutions
Related Publications
Learning and Transferring Mid-level Image Representations Using Convolutional Neural Networks
Convolutional neural networks (CNN) have recently shown outstanding image classification performance in the large- scale visual recognition challenge (ILSVRC2012). The success o...
Residual Dense Network for Image Super-Resolution
A very deep convolutional neural network (CNN) has recently achieved great success for image super-resolution (SR) and offered hierarchical features as well. However, most deep ...
Unsupervised Feature Learning via Non-parametric Instance Discrimination
Neural net classifiers trained on data with annotated class labels can also capture apparent visual similarity among categories without being directed to do so. We study whether...
Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition
Dynamics of human body skeletons convey significant information for human action recognition. Conventional approaches for modeling skeletons usually rely on hand-crafted parts o...
Beta Process Joint Dictionary Learning for Coupled Feature Spaces with Application to Single Image Super-Resolution
This paper addresses the problem of learning over-complete dictionaries for the coupled feature spaces, where the learned dictionaries also reflect the relationship between the ...
Publication Info
- Year
- 2011
- Type
- article
- Citations
- 982
- Access
- Closed
External Links
Social Impact
Social media, news, blog, policy document mentions
Citation Metrics
Cite This
Identifiers
- DOI
- 10.1109/cvpr.2011.5995496