Inferring Actions

Human behavior can be described at multiple levels. At the lowest level, we observe the 3D pose of the body over time. Poses can be organized into primitives that capture coordinated activity of different body parts. These further form more complex actions. At the most abstract level, behavior can be described semantically in terms of actions and goals.

The BABEL dataset [] contains labels of actions being performed by subjects in mocap sequences from AMASS []. BABEL is larger and more complex than existing 3D action recognition datasets, making the action recognition task challenging. BABEL has a long-tailed action distribution, significant intra-class variance, and frequently, multiple actions are performed simultaneously. These characteristics are similar to real-world data, suggesting that BABEL can drive progress in the field.

Human movements typically involve different successive actions. In addition to asking what actions are occurring, Temporal Action Localization (TAL) asks when these actions occur; i.e., the start and end of each action in the video.

Prior methods addressing TAL lose important information while aggregating features across successive frames. We develop a novel, learnable bilinear pooling operation to aggregate features that retains fine-grained temporal information []. Experiments demonstrate superior performance to prior work on various datasets.

Humans can readily differentiate biological motion from non-biological motion without training, even with sparse visual cues like moving dots. In this spirit, we perform behavior analysis at a low-level using a novel dynamic clustering algorithm []. Low-level visual cues are aggregated to high-level action patterns, and are utilized for the TAL task.

Members

Perceiving Systems

Michael Black

Emeritus / Acting Director

Perceiving Systems

Michael Black

Emeritus / Acting Director

Perceiving Systems

Yan Zhang

Perceiving Systems

Siyu Tang

Guest Scientist

Empirical Inference

Krikamol Muandet

Research Group Leader

Perceiving Systems, Software Workshop

Abhinanda Ranjit Punnakkal

Guest Scientist

Perceiving Systems

Arjun Chandrasekaran

Guest Scientist

Perceiving Systems

Nikos Athanasiou

Guest Scientist

Perceiving Systems

Maria Alejandra Quiros-Ramirez

Guest Scientist

Publications

Perceiving Systems Conference Paper BABEL: Bodies, Action and Behavior with English Labels Punnakkal, A. R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, M. A., Black, M. J. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 722-731, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021) , June 2021 (Published)

Abstract ›

Understanding the semantics of human movement -- the what, how and why of the movement -- is an important problem that requires datasets of human actions with semantic labels. Existing datasets take one of two approaches. Large-scale video datasets contain many action labels but do not contain ground-truth 3D human motion. Alternatively, motion-capture (mocap) datasets have precise body motions but are limited to a small number of actions. To address this, we present BABEL, a large dataset with language labels describing the actions being performed in mocap sequences. BABEL consists of action labels for about 43.5 hours of mocap sequences from AMASS. Action labels are at two levels of abstraction -- sequence labels which describe the overall action in the sequence, and frame labels which describe all actions in every frame of the sequence. Each frame label is precisely aligned with the duration of the corresponding action in the mocap sequence, and multiple actions can overlap. There are over 28k sequence labels, and 63k frame labels in BABEL, which belong to over 250 unique action categories. Labels from BABEL can be leveraged for tasks like action recognition, temporal action localization, motion synthesis, etc. To demonstrate the value of BABEL as a benchmark, we evaluate the performance of models on 3D action recognition. We demonstrate that BABEL poses interesting learning challenges that are applicable to real-world scenarios, and can serve as a useful benchmark of progress in 3D action recognition. The dataset, baseline method, and evaluation code is made available, and supported for academic research purposes at https://babel.is.tue.mpg.de/.

dataset poster pdf sup mat video code DOI BibTeX

Perceiving Systems Empirical Inference Conference Paper Local Temporal Bilinear Pooling for Fine-grained Action Parsing Zhang, Y., Tang, S., Muandet, K., Jarvers, C., Neumann, H. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 12005-12015, IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), June 2019

Abstract ›

Fine-grained temporal action parsing is important in many applications, such as daily activity understanding, human motion analysis, surgical robotics and others requiring subtle and precise operations in a long-term period. In this paper we propose a novel bilinear pooling operation, which is used in intermediate layers of a temporal convolutional encoder-decoder net. In contrast to other work, our proposed bilinear pooling is learnable and hence can capture more complex local statistics than the conventional counterpart. In addition, we introduce exact lower-dimension representations of our bilinear forms, so that the dimensionality is reduced with neither information loss nor extra computation. We perform intensive experiments to quantitatively analyze our model and show the superior performances to other state-of-the-art work on various datasets.

Code video demo pdf URL BibTeX

Perceiving Systems Article Temporal Human Action Segmentation via Dynamic Clustering Zhang, Y., Sun, H., Tang, S., Neumann, H. arXiv preprint arXiv:1803.05790, 2018

Abstract ›

We present an effective dynamic clustering algorithm for the task of temporal human action segmentation, which has comprehensive applications such as robotics, motion analysis, and patient monitoring. Our proposed algorithm is unsupervised, fast, generic to process various types of features, and applica- ble in both the online and offline settings. We perform extensive experiments of processing data streams, and show that our algorithm achieves the state-of- the-art results for both online and offline settings.

URL BibTeX