Optical Flow and Human Action

Understanding human action requires modeling and understanding human movement. While we mostly focus on 3D human movement, what is directly observable in videos is the 2D optical flow. Previous work has shown that flow is useful for action recognition and, consequently, we explore how to better estimate human flow and improve action recognition.

Specifically, we train a neural network to compute single-human [] and multi-human [] optical flow. To enable this we create a new synthetic training database of image sequences with ground-truth human optical flow. For this we use the 3D SMPL body model, motion-capture data, and computer graphics to synthesize realistic flow fields; this effectively extends the SURREAL dataset []. We then train a convolutional neural network (SpyNet []) to estimate human optical flow from pairs of images.

The new network is more accurate than a wide range of top methods on held-out test data and generalizes well to real image sequences. When combined with a person detector/tracker, the approach provides a full solution to the problem of 2D human flow estimation.

Most of the top-performing action-recognition methods use optical flow as a ``black box'' input. In [], we take a deeper look at the combination of flow and action recognition, and find that: 1) optical flow is useful for action recognition because it is invariant to appearance, 2) flow accuracy at boundaries and for small displacements is most correlated with action-recognition performance, 3) training optical flow needs to minimize classification error instead of the popular end-point-error (EPE) to improve action recognition, and 4) optical flow learned for action recognition differs from traditional optical flow mostly inside and at the boundary of human bodies.

Members

Perceiving Systems

Anurag Ranjan

Doctoral Researcher

Perceiving Systems

Laura Sevilla

Perceiving Systems

Javier Romero

Affiliated Researcher

Autonomous Vision

Yiyi Liao

Perceiving Systems, Autonomous Vision

Fatma Güney

Doctoral Researcher

Perceiving Systems

Varun Jampani

Autonomous Vision, Perceiving Systems

Andreas Geiger

Guest Scientist

Perceiving Systems

Michael Black

Emeritus / Acting Director

Perceiving Systems

David Hoffmann

Perceiving Systems

Dimitris Tzionas

Guest Scientist

Perceiving Systems

Siyu Tang

Guest Scientist

Publications

Perceiving Systems Article Learning Multi-Human Optical Flow Ranjan, A., Hoffmann, D. T., Tzionas, D., Tang, S., Romero, J., Black, M. J. International Journal of Computer Vision (IJCV), 128(4):873-890, April 2020 (Published)

Abstract ›

The optical flow of humans is well known to be useful for the analysis of human action. Recent optical flow methods focus on training deep networks to approach the problem. However, the training data used by them does not cover the domain of human motion. Therefore, we develop a dataset of multi-human optical flow and train optical flow networks on this dataset. We use a 3D model of the human body and motion capture data to synthesize realistic flow fields in both single-and multi-person images. We then train optical flow networks to estimate human flow fields from pairs of images. We demonstrate that our trained networks are more accurate than a wide range of top methods on held-out test data and that they can generalize well to real image sequences. The code, trained models and the dataset are available for research.

pdf DOI poster DOI URL BibTeX

Perceiving Systems Ph.D. Thesis Towards Geometric Understanding of Motion Ranjan, A. University of Tübingen, December 2019

Abstract ›

The motion of the world is inherently dependent on the spatial structure of the world and its geometry. Therefore, classical optical flow methods try to model this geometry to solve for the motion. However, recent deep learning methods take a completely different approach. They try to predict optical flow by learning from labelled data. Although deep networks have shown state-of-the-art performance on classification problems in computer vision, they have not been as effective in solving optical flow. The key reason is that deep learning methods do not explicitly model the structure of the world in a neural network, and instead expect the network to learn about the structure from data. We hypothesize that it is difficult for a network to learn about motion without any constraint on the structure of the world. Therefore, we explore several approaches to explicitly model the geometry of the world and its spatial structure in deep neural networks. The spatial structure in images can be captured by representing it at multiple scales. To represent multiple scales of images in deep neural nets, we introduce a Spatial Pyramid Network (SpyNet). Such a network can leverage global information for estimating large motions and local information for estimating small motions. We show that SpyNet significantly improves over previous optical flow networks while also being the smallest and fastest neural network for motion estimation. SPyNet achieves a 97% reduction in model parameters over previous methods and is more accurate. The spatial structure of the world extends to people and their motion. Humans have a very well-defined structure, and this information is useful in estimating optical flow for humans. To leverage this information, we create a synthetic dataset for human optical flow using a statistical human body model and motion capture sequences. We use this dataset to train deep networks and see significant improvement in the ability of the networks to estimate human optical flow. The structure and geometry of the world affects the motion. Therefore, learning about the structure of the scene together with the motion can benefit both problems. To facilitate this, we introduce Competitive Collaboration, where several neural networks are constrained by geometry and can jointly learn about structure and motion in the scene without any labels. To this end, we show that jointly learning single view depth prediction, camera motion, optical flow and motion segmentation using Competitive Collaboration achieves state-of-the-art results among unsupervised approaches. Our findings provide support for our hypothesis that explicit constraints on structure and geometry of the world lead to better methods for motion estimation.

PhD Thesis BibTeX

Perceiving Systems Conference Paper Learning Human Optical Flow Ranjan, A., Romero, J., Black, M. J. In 29th British Machine Vision Conference, September 2018

Abstract ›

The optical flow of humans is well known to be useful for the analysis of human action. Given this, we devise an optical flow algorithm specifically for human motion and show that it is superior to generic flow methods. Designing a method by hand is impractical, so we develop a new training database of image sequences with ground truth optical flow. For this we use a 3D model of the human body and motion capture data to synthesize realistic flow fields. We then train a convolutional neural network to estimate human flow fields from pairs of images. Since many applications in human motion analysis depend on speed, and we anticipate mobile applications, we base our method on SpyNet with several modifications. We demonstrate that our trained network is more accurate than a wide range of top methods on held-out test data and that it generalizes well to real image sequences. When combined with a person detector/tracker, the approach provides a full solution to the problem of 2D human flow estimation. Both the code and the dataset are available for research.

video code pdf URL BibTeX

Perceiving Systems Conference Paper Learning from Synthetic Humans Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M. J., Laptev, I., Schmid, C. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, 4627-4635, IEEE, Piscataway, NJ, USA, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

Abstract ›

Estimating human pose, shape, and motion from images and videos are fundamental challenges with many applications. Recent advances in 2D human pose estimation use large amounts of manually-labeled training data for learning convolutional neural networks (CNNs). Such data is time consuming to acquire and difficult to extend. Moreover, manual labeling of 3D pose, depth and motion is impractical. In this work we present SURREAL (Synthetic hUmans foR REAL tasks): a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data. We generate more than 6 million frames together with ground truth pose, depth maps, and segmentation masks. We show that CNNs trained on our synthetic dataset allow for accurate human depth estimation and human part segmentation in real RGB images. Our results and the new dataset open up new possibilities for advancing person analysis using cheap and large-scale synthetic data.

arXiv project data BibTeX