Back

Perceiving Systems Members Publications

Inferring Actions

Top: Temporal Action Localization (TAL). The bilinear pooling algorithm [File Icon] recognizes and localizes all actions in an RGB video (yellow background). The hierarchical clustering algorithm [File Icon] uses both RGB and 3D joint positions for TAL (blue box). Below: The Action Recognition model in BABEL [File Icon] predicts the action in a mocap sequence.

Members

Perceiving Systems
Emeritus / Acting Director
Perceiving Systems
Emeritus / Acting Director
Perceiving Systems
Perceiving Systems
Guest Scientist
Empirical Inference
Research Group Leader
Perceiving Systems, Software Workshop
  • Guest Scientist
Perceiving Systems
  • Guest Scientist
Perceiving Systems
  • Guest Scientist
Perceiving Systems
Guest Scientist

Publications

Perceiving Systems Conference Paper BABEL: Bodies, Action and Behavior with English Labels Punnakkal, A. R., Chandrasekaran, A., Athanasiou, N., Quiros-Ramirez, M. A., Black, M. J. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 722-731, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021) , June 2021 (Published)
Understanding the semantics of human movement -- the what, how and why of the movement -- is an important problem that requires datasets of human actions with semantic labels. Existing datasets take one of two approaches. Large-scale video datasets contain many action labels but do not contain ground-truth 3D human motion. Alternatively, motion-capture (mocap) datasets have precise body motions but are limited to a small number of actions. To address this, we present BABEL, a large dataset with language labels describing the actions being performed in mocap sequences. BABEL consists of action labels for about 43.5 hours of mocap sequences from AMASS. Action labels are at two levels of abstraction -- sequence labels which describe the overall action in the sequence, and frame labels which describe all actions in every frame of the sequence. Each frame label is precisely aligned with the duration of the corresponding action in the mocap sequence, and multiple actions can overlap. There are over 28k sequence labels, and 63k frame labels in BABEL, which belong to over 250 unique action categories. Labels from BABEL can be leveraged for tasks like action recognition, temporal action localization, motion synthesis, etc. To demonstrate the value of BABEL as a benchmark, we evaluate the performance of models on 3D action recognition. We demonstrate that BABEL poses interesting learning challenges that are applicable to real-world scenarios, and can serve as a useful benchmark of progress in 3D action recognition. The dataset, baseline method, and evaluation code is made available, and supported for academic research purposes at https://babel.is.tue.mpg.de/.
dataset poster pdf sup mat video code DOI BibTeX

Perceiving Systems Empirical Inference Conference Paper Local Temporal Bilinear Pooling for Fine-grained Action Parsing Zhang, Y., Tang, S., Muandet, K., Jarvers, C., Neumann, H. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 12005-12015, IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), June 2019
Fine-grained temporal action parsing is important in many applications, such as daily activity understanding, human motion analysis, surgical robotics and others requiring subtle and precise operations in a long-term period. In this paper we propose a novel bilinear pooling operation, which is used in intermediate layers of a temporal convolutional encoder-decoder net. In contrast to other work, our proposed bilinear pooling is learnable and hence can capture more complex local statistics than the conventional counterpart. In addition, we introduce exact lower-dimension representations of our bilinear forms, so that the dimensionality is reduced with neither information loss nor extra computation. We perform intensive experiments to quantitatively analyze our model and show the superior performances to other state-of-the-art work on various datasets.
Code video demo pdf URL BibTeX

Perceiving Systems Article Temporal Human Action Segmentation via Dynamic Clustering Zhang, Y., Sun, H., Tang, S., Neumann, H. arXiv preprint arXiv:1803.05790, 2018
We present an effective dynamic clustering algorithm for the task of temporal human action segmentation, which has comprehensive applications such as robotics, motion analysis, and patient monitoring. Our proposed algorithm is unsupervised, fast, generic to process various types of features, and applica- ble in both the online and offline settings. We perform extensive experiments of processing data streams, and show that our algorithm achieves the state-of- the-art results for both online and offline settings.
URL BibTeX