Publications

Perceiving Systems Conference Paper Adversarial Likelihood Estimation With One-Way Flows Ben-Dov, O., Gupta, P. S., Abrevaya, V., Black, M. J., Ghosh, P. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 3779-3788, January 2024 (Published)

Abstract ›

Generative Adversarial Networks (GANs) can produce high-quality samples, but do not provide an estimate of the probability density around the samples. However, it has been noted that maximizing the log-likelihood within an energy-based setting can lead to an adversarial framework where the discriminator provides unnormalized density (often called energy). We further develop this perspective, incorporate importance sampling, and show that 1) Wasserstein GAN performs a biased estimate of the partition function, and we propose instead to use an unbiased estimator; and 2) when optimizing for likelihood, one must maximize generator entropy. This is hypothesized to provide a better mode coverage. Different from previous works, we explicitly compute the density of the generated samples. This is the key enabler to designing an unbiased estimator of the partition function and computation of the generator entropy term. The generator density is obtained via a new type of flow network, called one-way flow network, that is less constrained in terms of architecture, as it does not require a tractable inverse function. Our experimental results show that our method converges faster, produces comparable sample quality to GANs with similar architecture, successfully avoids over-fitting to commonly used datasets and produces smooth low-dimensional latent representations of the training data.

pdf arXiv BibTeX

Perceiving Systems Conference Paper Human Hair Reconstruction with Strand-Aligned 3D Gaussians Zakharov, E., Sklyarova, V., Black, M. J., Nam, G., Thies, J., Hilliges, O. In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, January 2024 (Published)

Abstract ›

We introduce a new hair modeling method that uses a dual representation of classical hair strands and 3D Gaussians to produce accurate and realistic strand-based reconstructions from multi-view data. In contrast to recent approaches that leverage unstructured Gaussians to model human avatars, our method reconstructs the hair using 3D polylines, or strands. This fundamental difference allows the use of the resulting hairstyles out-of-the-box in modern computer graphics engines for editing, rendering, and simulation. Our 3D lifting method relies on unstructured Gaussians to generate multi-view ground truth data to supervise the fitting of hair strands. The hairstyle itself is represented in the form of the so-called strand-aligned 3D Gaussians. This representation allows us to combine strand-based hair priors, which are essential for realistic modeling of the inner structure of hairstyles, with the differentiable rendering capabilities of 3D Gaussian Splatting. Our method, named Gaussian Haircut, is evaluated on synthetic and real scenes and demonstrates state-of-the-art performance in the task of strand-based hair reconstruction.

pdf project code video arXiv DOI URL BibTeX

Perceiving Systems Article HMP: Hand Motion Priors for Pose and Shape Estimation from Video Duran, E., Kocabas, M., Choutas, V., Fan, Z., Black, M. J. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), January 2024 (Published)

Abstract ›

Understanding how humans interact with the world necessitates accurate 3D hand pose estimation, a task complicated by the hand’s high degree of articulation, frequent occlusions, self-occlusions, and rapid motions. While most existing methods rely on single-image inputs, videos have useful cues to address aforementioned issues. However, existing video-based 3D hand datasets are insufficient for training feedforward models to generalize to in-the-wild scenarios. On the other hand, we have access to large human motion capture datasets which also include hand motions, e.g. AMASS. Therefore, we develop a generative motion prior specific for hands, trained on the AMASS dataset which features diverse and high-quality hand motions. This motion prior is then employed for video-based 3D hand motion estimation following a latent optimization approach. Our integration of a robust motion prior significantly enhances performance, especially in occluded scenarios. It produces stable, temporally consistent results that surpass conventional single-frame methods. We demonstrate our method’s efficacy via qualitative and quantitative evaluations on the HO3D and DexYCB datasets, with special emphasis on an occlusion-focused subset of HO3D.

webpage pdf code BibTeX

Perceiving Systems Ph.D. Thesis Natural Language Control for 3D Human Motion Synthesis Petrovich, M. LIGM, Ecole des Ponts, Univ Gustave Eiffel, CNRS, 2024 (Published)

Abstract ›

3D human motions are at the core of many applications in the film industry, healthcare, augmented reality, virtual reality and video games. However, these applications often rely on expensive and time-consuming motion capture data. The goal of this thesis is to explore generative models as an alternative route to obtain 3D human motions. More specifically, our aim is to allow a natural language interface as a means to control the generation process. To this end, we develop a series of models that synthesize realistic and diverse motions following the semantic inputs. In our first contribution, described in Chapter 3, we address the challenge of generating human motion sequences conditioned on specific action categories. We introduce ACTOR, a conditional variational autoencoder (VAE) that learns an action-aware latent representation for human motions. We show significant gains over existing methods thanks to our new Transformer-based VAE formulation, encoding and decoding SMPL pose sequences through a single motion-level embedding. In our second contribution, described in Chapter 4, we go beyond categorical actions, and dive into the task of synthesizing diverse 3D human motions from textual descriptions allowing a larger vocabulary and potentially more fine-grained control. Our work stands out from previous research by not deterministically generating a single motion sequence, but by synthesizing multiple, varied sequences from a given text. We propose TEMOS, building on our VAE-based ACTOR architecture, but this time integrating a pretrained text encoder to handle large-vocabulary natural language inputs. In our third contribution, described in Chapter 5, we address the adjacent task of text-to-3D human motion retrieval, where the goal is to search in a motion collection by querying via text. We introduce a simple yet effective approach, named TMR, building on our earlier model TEMOS, by integrating a contrastive loss to enhance the structure of the cross-modal latent space. Our findings emphasize the importance of retaining the motion generation loss in conjunction with contrastive training for improved results. We establish a new evaluation benchmark and conduct analyses on several protocols. In our fourth contribution, described in Chapter 6, we introduce a new problem termed as “multi-track timeline control” for text-driven 3D human motion synthesis. Instead of a single textual prompt, users can organize multiple prompts in temporal intervals that may overlap. We introduce STMC, a test-time denoising method that can be integrated with any pre-trained motion diffusion model. Our evaluations demonstrate that our method generates motions that closely match the semantic and temporal aspects of the input timelines. In summary, our contributions in this thesis are as follows: (i) we develop a generative variational autoencoder, ACTOR, for action-conditioned generation of human motion sequences, (ii) we introduce TEMOS, a text-conditioned generative model that synthesizes diverse human motions from textual descriptions, (iii) we present TMR, a new approach for text-to-3D human motion retrieval, (iv) we propose STMC, a method for timeline control in text-driven motion synthesis, enabling the generation of detailed and complex motions.

pdf YouTube Thesis BibTeX

Perceiving Systems Article InterCap: Joint Markerless 3D Tracking of Humans and Objects in Interaction from Multi-view RGB-D Images Huang, Y., Taheri, O., Black, M. J., Tzionas, D. International Journal of Computer Vision (IJCV), 132(7):2551-2566, 2024 (Published)

Abstract ›

Humans constantly interact with objects to accomplish tasks. To understand such interactions, computers need to reconstruct these in 3D from images of whole bodies manipulating objects, e.g., for grasping, moving and using the latter. This involves key challenges, such as occlusion between the body and objects, motion blur, depth ambiguities, and the low image resolution of hands and graspable object parts. To make the problem tractable, the community has followed a divide-and-conquer approach, focusing either only on interacting hands, ignoring the body, or on interacting bodies, ignoring the hands. However, these are only parts of the problem. On the contrary, recent work focuses on the whole problem. The GRAB dataset addresses whole-body interaction with dexterous hands but captures motion via markers and lacks video, while the BEHAVE dataset captures video of body-object interaction but lacks hand detail. We address the limitations of prior work with InterCap, a novel method that reconstructs interacting whole-bodies and objects from multi-view RGB-D data, using the parametric whole-body SMPL-X model and known object meshes. To tackle the above challenges, InterCap uses two key observations: (i) Contact between the body and object can be used to improve the pose estimation of both. (ii) Consumer-level Azure Kinect cameras let us set up a simple and flexible multi-view RGB-D system for reducing occlusions, with spatially calibrated and temporally synchronized cameras. With our InterCap method we capture the InterCap dataset, which contains 10 subjects (5 males and 5 females) interacting with 10 daily objects of various sizes and affordances, including contact with the hands or feet. To this end, we introduce a new data-driven hand motion prior, as well as explore simple ways for automatic contact detection based on 2D and 3D cues. In total, InterCap has 223 RGB-D videos, resulting in 67,357 multi-view frames, each containing 6 RGB-D images, paired with pseudo ground-truth 3D body and object meshes. Our InterCap method and dataset fill an important gap in the literature and support many research directions. Data and code are available at https://intercap.is.tue.mpg.de.

Paper DOI URL BibTeX

Perceiving Systems Conference Paper 3D Neural Edge Reconstruction Lil, L., Peng, S., Yu, Z., Liu, S., Pautrat, R., Yin, X., Pollefeys, M. In Proceedings 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 21219-21229, 10.1109/CVPR52733.2024.02005, 2024 (Published) DOI URL BibTeX

Perceiving Systems Article Accelerated Video Annotation Driven by Deep Detector and Tracker Price, E., Ahmad, A. INTELLIGENT AUTONOMOUS SYSTEMS 18, 2:141–153, IAS, 2024 (Published)

Abstract ›

Annotating object ground truth in videos is vital for several downstream tasks in robot perception and machine learning, such as for evaluating the performance of an object tracker or training an image-based object detector. The accuracy of the annotated instances of the moving objects on every image frame in a video is crucially important. Achieving that through manual annotations is not only very time consuming and labor intensive, but is also prone to high error rate. State-of-the-art annotation methods depend on manually initializing the object bounding boxes only in the first frame and then use classical tracking methods, e.g., adaboost, or kernelized correlation filters, to keep track of those bounding boxes. These can quickly drift, thereby requiring tedious manual supervision. In this paper, we propose a new annotation method which leverages a combination of a learning-based detector (SSD) and a learning-based tracker (RE). Through this, we significantly reduce annotation drifts, and, consequently, the required manual supervision. We validate our approach through annotation experiments using our proposed annotation method and existing baselines on a set of drone video frames. Source code and detailed information on how to run the annotation program can be found at https://github.com/robot-perception-group/smarter-labelme

project DOI URL BibTeX

Perceiving Systems Article Editor’s Note: Special Issue on Computer Vision Approach for Animal Tracking and Modeling Park, H. S., Rhodin, H., Kanazawa, A., Neverova, N., Nobuhara, S., Black, M., Zuffi, S. Journal of Computer Vision, 132, 2024 (Published) DOI URL BibTeX

Perceiving Systems Conference Paper Emotional Speech-Driven Animation with Content-Emotion Disentanglement Daněček, R., Chhatre, K., Tripathi, S., Wen, Y., Black, M. J., Bolkart, T. In SIGGRAPH Asia 2023 Conference Papers, Association for Computing Machinery , New York, NY, SIGGRAPH Asia, December 2023 (Published)

Abstract ›

To be widely adopted, 3D facial avatars must be animated easily, realistically, and directly from speech signals. While the best recent methods generate 3D animations that are synchronized with the input audio, they largely ignore the impact of emotions on facial expressions. Realistic facial animation requires lip-sync together with the natural expression of emotion. To that end, we propose EMOTE (Expressive Model Optimized for Talking with Emotion), which generates 3D talking-head avatars that maintain lip-sync from speech while enabling explicit control over the expression of emotion. To achieve this, we supervise EMOTE with decoupled losses for speech (i.e., lip-sync) and emotion. These losses are based on two key observations: (1) deformations of the face due to speech are spatially localized around the mouth and have high temporal frequency, whereas (2) facial expressions may deform the whole face and occur over longer intervals. Thus, we train EMOTE with a per-frame lip-reading loss to preserve the speech-dependent content, while supervising emotion at the sequence level. Furthermore, we employ a content-emotion exchange mechanism in order to supervise different emotions on the same audio, while maintaining the lip motion synchronized with the speech. To employ deep perceptual losses without getting undesirable artifacts, we devise a motion prior in the form of a temporal VAE. Due to the absence of high-quality aligned emotional 3D face datasets with speech, EMOTE is trained with 3D pseudo-ground-truth extracted from an emotional video dataset (i.e., MEAD). Extensive qualitative and perceptual evaluations demonstrate that EMOTE produces speech-driven facial animations with better lip-sync than state-of-the-art methods trained on the same data, while offering additional, high-quality emotional control.

arXiv DOI URL BibTeX

Empirical Inference Perceiving Systems Conference Paper Controlling Text-to-Image Diffusion by Orthogonal Finetuning Qiu*, Z., Liu*, W., Feng, H., Xue, Y., Feng, Y., Liu, Z., Zhang, D., Weller, A., Schölkopf, B. In Advances in Neural Information Processing Systems 36 (NeurIPS 2023), 36:79320-79362, (Editors: A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine), Curran Associates, Inc., 37th Annual Conference on Neural Information Processing Systems , December 2023, *equal contribution (Published)

Abstract ›

Large text-to-image diffusion models have impressive capabilities in generating photorealistic images from text prompts. How to effectively guide or control these powerful models to perform different downstream tasks becomes an important open problem. To tackle this challenge, we introduce a principled finetuning method -- Orthogonal Finetuning (OFT), for adapting text-to-image diffusion models to downstream tasks. Unlike existing methods, OFT can provably preserve hyperspherical energy which characterizes the pairwise neuron relationship on the unit hypersphere. We find that this property is crucial for preserving the semantic generation ability of text-to-image diffusion models. To improve finetuning stability, we further propose Constrained Orthogonal Finetuning (COFT) which imposes an additional radius constraint to the hypersphere. Specifically, we consider two important finetuning text-to-image tasks: subject-driven generation where the goal is to generate subject-specific images given a few images of a subject and a text prompt, and controllable generation where the goal is to enable the model to take in additional control signals. We empirically show that our OFT framework outperforms existing methods in generation quality and convergence speed.

Home Code URL BibTeX

Perceiving Systems Article From Skin to Skeleton: Towards Biomechanically Accurate 3D Digital Humans Keller, M., Werling, K., Shin, S., Delp, S., Pujades, S., Liu, C. K., Black, M. J. ACM Transactions on Graphics (TOG), ACM Transactions on Graphics (TOG), 42(6):253:1-253:15, ACM New York, NY, USA, December 2023 (Published)

Abstract ›

Great progress has been made in estimating 3D human pose and shape from images and video by training neural networks to directly regress the parameters of parametric human models like SMPL. However, existing body models have simplified kinematic structures that do not correspond to the true joint locations and articulations in the human skeletal system, limiting their potential use in biomechanics. On the other hand, methods for estimating biomechanically accurate skeletal motion typically rely on complex motion capture systems and expensive optimization methods. What is needed is a parametric 3D human model with a biomechanically accurate skeletal structure that can be easily posed. To that end, we develop SKEL, which re-rigs the SMPL body model with a biomechanics skeleton. To enable this, we need training data of skeletons inside SMPL meshes in diverse poses. We build such a dataset by optimizing biomechanically accurate skeletons inside SMPL meshes from AMASS sequences. We then learn a regressor from SMPL mesh vertices to the optimized joint locations and bone rotations. Finally, we re-parametrize the SMPL mesh with the new kinematic parameters. The resulting SKEL model is animatable like SMPL but with fewer, and biomechanically-realistic, degrees of freedom. We show that SKEL has more biomechanically accurate joint locations than SMPL, and the bones fit inside the body surface better than previous methods. By fitting SKEL to SMPL meshes we are able to “upgrade" existing human pose and shape datasets to include biomechanical parameters. SKEL provides a new tool to enable biomechanics in the wild, while also providing vision and graphics researchers with a better constrained

Project Page Paper DOI URL BibTeX

Perceiving Systems Article FLARE: Fast learning of Animatable and Relightable Mesh Avatars Bharadwaj, S., Zheng, Y., Hilliges, O., Black, M. J., Fernandez Abrevaya, V. ACM Transactions on Graphics (TOG), ACM Transactions on Graphics (TOG), 42(6):204:1-204:15, ACM New York, NY, USA, December 2023 (Published)

Abstract ›

Our goal is to efficiently learn personalized animatable 3D head avatars from videos that are geometrically accurate, realistic, relightable, and compatible with current rendering systems. While 3D meshes enable efficient processing and are highly portable, they lack realism in terms of shape and appearance. Neural representations, on the other hand, are realistic but lack compatibility and are slow to train and render. Our key insight is that it is possible to efficiently learn high-fidelity 3D mesh representations via differentiable rendering by exploiting highly-optimized methods from traditional computer graphics and approximating some of the components with neural networks. To that end, we introduce FLARE, a technique that enables the creation of animatable and relightable mesh avatars from a single monocular video. First, we learn a canonical geometry using a mesh representation, enabling efficient differentiable rasterization and straightforward animation via learned blendshapes and linear blend skinning weights. Second, we follow physically-based rendering and factor observed colors into intrinsic albedo, roughness, and a neural representation of the illumination, allowing the learned avatars to be relit in novel scenes. Since our input videos are captured on a single device with a narrow field of view, modeling the surrounding environment light is non-trivial. Based on the split-sum approximation for modeling specular reflections, we address this by approximating the pre-filtered environment map with a multi-layer perceptron (MLP) modulated by the surface roughness, eliminating the need to explicitly model the light. We demonstrate that our mesh-based avatar formulation, combined with learned deformation, material, and lighting MLPs, produces avatars with high-quality geometry and appearance, while also being efficient to train and render compared to existing approaches.

Paper Project Page Code DOI URL BibTeX

Perceiving Systems Ph.D. Thesis Neural Shape Modeling of 3D Clothed Humans Ma, Q. October 2023 (Published)

Abstract ›

Parametric models for 3D human bodies play a crucial role in the synthesis and analysis of humans in visual computing. While current models effectively capture body pose and shape variations, a significant aspect has been overlooked – clothing. Existing 3D human models mostly produce a minimally-clothed body geometry, limiting their ability to represent the complexity of dressed people in real-world data sources. The challenge lies in the unique characteristics of garments, which make modeling clothed humans particularly difficult. Clothing exhibits diverse topologies, and as the body moves, it introduces wrinkles at various spatial scales. Moreover, pose-dependent clothing deformations are non-rigid and non-linear, exceeding the capabilities of classical body models constructed with fixed-topology surface meshes and linear approximations of pose-aware shape deformations. This thesis addresses these challenges by innovating in two key areas: the 3D shape representation and deformation modeling techniques. We demonstrate that, the seemingly old-fashioned shape representation, point clouds – when equipped with deep learning and neural fields – can be a powerful tool for modeling clothed characters. Specifically, the thesis begins by introducing a large-scale dataset of dynamic 3D humans in various clothing, which serves as a foundation for training the models presented in this work. The first model we present is CAPE: a neural generative model for 3D clothed human meshes. Here, a clothed body is straightforwardly obtained by applying per-vetex offsets to a pre-defined, unclothed body template mesh. Sampling from the CAPE model generates plausibly-looking digital humans wearing common garments, but the fixed-topology mesh representation limits its applicability to more complex garment types. To address this limitation, we present a series of point-based clothed human models: SCALE, PoP and SkiRT. The SCALE model represents a clothed human using a collection of points organized into local patches. The patches can freely move and deform to represent garments of diverse topologies, unlocking the generalization to more challenging outfits such as dresses and jackets. Unlike traditional approaches based on physics simulations, SCALE learns pose-dependent cloth deformations from data with minimal manual intervention. To further improve the geometric quality, the PoP model eliminates the concept of patches and instead learns a continuous neural deformation field from the body surface. Densely querying this field results in a highresolution point cloud of a dressed human, showcasing intricate clothing wrinkles. PoP can generalize across multiple subjects and outfits, and can even bring a single, static scan into animation. Finally, we tackle a long-standing challenge in learning-based digital human modeling: loose garments, in particular skirts and dresses. Building upon PoP, the SkiRT pipeline further learns a shape “template” and neural field of linear-blend-skinning weights for clothed bodies, improving the models’ robustness for loose garments of varied topology. Our point-based human models are “interplicit”: the output point clouds capture surfaces explicitly at discrete points but implicitly in between. The explicit points are fast, topologically flexible, and are compatible with existing graphics tools, while the implicit neural deformation field contributes to high-quality geometry. This thesis primarily demonstrates these advantages in the context of clothed human shape modeling; future work can apply our representation and techniques to general 3D deformable shapes and neural rendering.

download Thesis DOI BibTeX

Perceiving Systems Conference Paper Optimizing the 3D Plate Shape for Proximal Humerus Fractures Keller, M., Krall, M., Smith, J., Clement, H., Kerner, A. M., Gradischar, A., Schäfer, Ü., Black, M. J., Weinberg, A., Pujades, S. International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 487-496, Springer, Cham, MICCAI, October 2023 (Published)

Abstract ›

To treat bone fractures, implant manufacturers produce 2D anatomically contoured plates. Unfortunately, existing plates only fit a limited segment of the population and/or require manual bending during surgery. Patient-specific implants would provide major benefits such as reducing surgery time and improving treatment outcomes but they are still rare in clinical practice. In this work, we propose a patient-specific design for the long helical 2D PHILOS (Proximal Humeral Internal Locking System) plate, used to treat humerus shaft fractures. Our method automatically creates a custom plate from a CT scan of a patient's bone. We start by designing an optimal plate on a template bone and, with an anatomy-aware registration method, we transfer this optimal design to any bone. In addition, for an arbitrary bone, our method assesses if a given plate is fit for surgery by automatically positioning it on the bone. We use this process to generate a compact set of plate shapes capable of fitting the bones within a given population. This plate set can be pre-printed in advance and readily available, removing the fabrication time between the fracture occurrence and the surgery. Extensive experiments on ex-vivo arms and 3D-printed bones show that the generated plate shapes (personalized and plate-set) faithfully match the individual bone anatomy and are suitable for clinical practice.

Project page Code Paper Poster DOI URL BibTeX

Perceiving Systems Conference Paper Generalizing Neural Human Fitting to Unseen Poses With Articulated SE(3) Equivariance Feng, H., Kulits, P., Liu, S., Black, M. J., Fernandez Abrevaya, V. In Proc. International Conference on Computer Vision (ICCV), International Conference on Computer Vision, October 2023 (Published)

Abstract ›

We address the problem of fitting a parametric human body model (SMPL) to point cloud data. Optimization based methods require careful initialization and are prone to becoming trapped in local optima. Learning-based methods address this but do not generalize well when the input pose is far from those seen during training. For rigid point clouds, remarkable generalization has been achieved by leveraging SE(3)-equivariant networks, but these methods do not work on articulated objects. In this work we extend this idea to human bodies and propose ArtEq, a novel part-based SE(3)-equivariant neural architecture for SMPL model estimation from point clouds. Specifically, we learn a part detection network by leveraging local SO(3) invariance, and regress shape and pose using articulated SE(3) shape-invariant and pose-equivariant networks, all trained end-to-end. Our novel pose regression module leverages the permutation-equivariant property of self-attention layers to preserve rotational equivariance. Experimental results show that ArtEq generalizes to poses not seen during training, outperforming state-of-the-art methods by ~44%in terms of body reconstruction accuracy, without requiring an optimization refinement step. Furthermore, ArtEq is three orders of magnitude faster during inference than prior work and has 97.3% fewer parameters. The code and model are available for research purposes at https://arteq.is.tue.mpg.de.

arxiv project URL BibTeX

Empirical Inference Perceiving Systems Conference Paper One-shot Implicit Animatable Avatars with Model-based Priors Huang, Y., Yi, H., Liu, W., Wang, H., Wu, B., Wang, W., Lin, B., Zhang, D., Cai, D. In Proc. International Conference on Computer Vision (ICCV), 8940-8951, International Conference on Computer Vision, October 2023, *equal contribution (Published)

Abstract ›

Existing neural rendering methods for creating human avatars typically either require dense input signals such as video or multi-view images, or leverage a learned prior from large-scale specific 3D human datasets such that reconstruction can be performed with sparse-view inputs. Most of these methods fail to achieve realistic reconstruction when only a single image is available. To enable the data-efficient creation of realistic animatable 3D humans, we propose ELICIT, a novel method for learning human-specific neural radiance fields from a single image. Inspired by the fact that humans can easily reconstruct the body geometry and infer the full-body clothing from a single image, we leverage two priors in ELICIT: 3D geometry prior and visual semantic prior. Specifically, ELICIT introduces the 3D body shape geometry prior from a skinned vertex-based template model (i.e., SMPL) and implements the visual clothing semantic prior with the CLIP-based pre-trained models. Both priors are used to jointly guide the optimization for creating plausible content in the invisible areas. In order to further improve visual details, we propose a segmentation-based sampling strategy that locally refines different parts of the avatar.Comprehensive evaluations on multiple popular benchmarks, including ZJU-MoCAP, Human3.6M, and DeepFashion, show that ELICIT has outperformed current state-of-the-art avatar creation methods when only a single image is available. Code will be public for reseach purpose at https://github.com/huangyangyi/ELICIT

arXiv code project DOI BibTeX

Perceiving Systems Empirical Inference Conference Paper Pairwise Similarity Learning is SimPLE Wen, Y., Liu, W., Feng, Y., Raj, B., Singh, R., Weller, A., Black, M. J., Schölkopf, B. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), International Conference on Computer Vision, October 2023 (Published)

Abstract ›

In this paper, we focus on a general yet important learning problem, pairwise similarity learning (PSL). PSL subsumes a wide range of important applications, such as open-set face recognition, speaker verification, image retrieval and person re-identification. The goal of PSL is to learn a pairwise similarity function assigning a higher similarity score to positive pairs (i.e., a pair of samples with the same label) than to negative pairs (i.e., a pair of samples with different label). We start by identifying a key desideratum for PSL, and then discuss how existing methods can achieve this desideratum. We then propose a surprisingly simple proxy-free method, called SimPLE, which requires neither feature/proxy normalization nor angular margin and yet is able to generalize well in open-set recognition. We apply the proposed method to three challenging PSL tasks: open-set face recognition, image retrieval and speaker verification. Comprehensive experimental results on large-scale benchmarks show that our method performs significantly better than current state-of-the-art methods.

URL BibTeX

Perceiving Systems Conference Paper AG3D: Learning to Generate 3D Avatars from 2D Image Collections Dong, Z., Chen, X., Yang, J., Black, M. J., Hilliges, O., Geiger, A. In Proc. International Conference on Computer Vision (ICCV), 14916-14927, International Conference on Computer Vision (ICCV), October 2023 (Published)

Abstract ›

While progress in 2D generative models of human appearance has been rapid, many applications require 3D avatars that can be animated and rendered. Unfortunately, most existing methods for learning generative models of 3D humans with diverse shape and appearance require 3D training data, which is limited and expensive to acquire. The key to progress is hence to learn generative models of 3D avatars from abundant unstructured 2D image collections. However, learning realistic and complete 3D appearance and geometry in this under-constrained setting remains challenging, especially in the presence of loose clothing such as dresses. In this paper, we propose a new adversarial generative model of realistic 3D people from 2D images. Our method captures shape and deformation of the body and loose clothing by adopting a holistic 3D generator and integrating an efficient and flexible articulation module. To improve realism, we train our model using multiple discriminators while also integrating geometric cues in the form of predicted 2D normal maps. We experimentally find that our method outperforms previous 3D- and articulation-aware methods in terms of geometry and appearance. We validate the effectiveness of our model and the importance of each component via systematic ablation studies.

project pdf code video DOI URL BibTeX

Perceiving Systems Conference Paper D-IF: Uncertainty-aware Human Digitization via Implicit Distribution Field Yang, X., Luo, Y., Xiu, Y., Wang, W., Xu, H., Fan, Z. In Proc. International Conference on Computer Vision (ICCV), 9122-9132, International Conference on Computer Vision, October 2023 (Published)

Abstract ›

Realistic virtual humans play a crucial role in numerous industries, such as metaverse, intelligent healthcare, and self-driving simulation. But creating them on a large scale with high levels of realism remains a challenge. The utilization of deep implicit function sparks a new era of image-based 3D clothed human reconstruction, enabling pixel-aligned shape recovery with fine details. Subsequently, the vast majority of works locate the surface by regressing the deterministic implicit value for each point. However, should all points be treated equally regardless of their proximity to the surface? In this paper, we propose replacing the implicit value with an adaptive uncertainty distribution, to differentiate between points based on their distance to the surface. This simple "value to distribution" transition yields significant improvements on nearly all the baselines. Furthermore, qualitative results demonstrate that the models trained using our uncertainty distribution loss, can capture more intricate wrinkles, and realistic limbs.

Code Homepage URL BibTeX

Perceiving Systems Software Workshop Conference Paper DECO: Dense Estimation of 3D Human-Scene Contact in the Wild Tripathi, S., Chatterjee, A., Passy, J., Yi, H., Tzionas, D., Black, M. J. In Proc. International Conference on Computer Vision (ICCV), 8001-8013, International Conference on Computer Vision, October 2023 (Published)

Abstract ›

Understanding how humans use physical contact to interact with the world is key to enabling human-centric artificial intelligence. While inferring 3D contact is crucial for modeling realistic and physically-plausible human-object interactions, existing methods either focus on 2D, consider body joints rather than the surface, use coarse 3D body regions, or do not generalize to in-the-wild images. In contrast, we focus on inferring dense, 3D contact between the full body surface and objects in arbitrary images. To achieve this, we first collect DAMON, a new dataset containing dense vertex-level contact annotations paired with RGB images containing complex human-object and human-scene contact. Second, we train DECO, a novel 3D contact detector that uses both body-part-driven and scene-context-driven attention to estimate vertex-level contact on the SMPL body. DECO builds on the insight that human observers recognize contact by reasoning about the contacting body parts, their proximity to scene objects, and the surrounding scene context. We perform extensive evaluations of our detector on DAMON as well as on the RICH and BEHAVE datasets. We significantly outperform existing SOTA methods across all benchmarks. We also show qualitatively that DECO generalizes well to diverse and challenging real-world human interactions in natural images. The code, data, and models are available at https://deco.is.tue.mpg.de/login.php.

Project Video Poster Code Data DOI URL BibTeX

Perceiving Systems Conference Paper SINC: Spatial Composition of 3D Human Motions for Simultaneous Action Generation Athanasiou, N., Petrovich, M., Black, M. J., Varol, G. In Proc. International Conference on Computer Vision (ICCV), 9984-9995, International Conference on Computer Vision, October 2023 (Published)

Abstract ›

Our goal is to synthesize 3D human motions given textual inputs describing multiple simultaneous actions, for example ‘waving hand’ while ‘walking’ at the same time. We refer to generating such simultaneous movements as performing ‘spatial compositions’. In contrast to ‘temporal compositions’ that seek to transition from one action to another in a sequence, spatial compositing requires understanding which body parts are involved with which action. Motivated by the observation that the correspondence between actions and body parts is encoded in powerful language models, we extract this knowledge by prompting GPT-3 with text such as “what parts of the body are moving when someone is doing the action <action name>?”. Given this action-part mapping, we automatically create new training data by artificially combining body parts from multiple text-motion pairs together. We extend previous work on text-to-motions synthesis to train on spatial compositions, and introduce SINC (“SImultaneous actioN Compositions for 3D human motions”). We experimentally validate that our additional GPT-guided data helps to better learn compositionality compared to training only on existing real data of simultaneous actions, which is limited in quantity.

website code paper-arxiv video BibTeX

Perceiving Systems Conference Paper TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis Petrovich, M., Black, M. J., Varol, G. In Proc. International Conference on Computer Vision (ICCV), 9488-9497, International Conference on Computer Vision, October 2023 (Published)

Abstract ›

In this paper, we present TMR, a simple yet effective approach for text to 3D human motion retrieval. While previous work has only treated retrieval as a proxy evaluation metric, we tackle it as a standalone task. Our method extends the state-of-the-art text-to-motion synthesis model TEMOS, and incorporates a contrastive loss to better structure the cross-modal latent space. We show that maintaining the motion generation loss, along with the contrastive training, is crucial to obtain good performance. We introduce a benchmark for evaluation and provide an in-depth analysis by reporting results on several protocols. Our extensive experiments on the KIT-ML and HumanML3D datasets show that TMR outperforms the prior work by a significant margin, for example reducing the median rank from 54 to 19. Finally, we showcase the potential of our approach on moment retrieval. Our code and models are publicly available.

website code paper-arxiv video URL BibTeX

Perceiving Systems Conference Paper Synthetic Data-Based Detection of Zebras in Drone Imagery Bonetto, E., Ahmad, A. 2023 European Conference on Mobile Robots (ECMR), 1-8, IEEE, ECMR, September 2023 (Published)

Abstract ›

Nowadays, there is a wide availability of datasets that enable the training of common object detectors or human detectors. These come in the form of labelled real-world images and require either a significant amount of human effort, with a high probability of errors such as missing labels, or very constrained scenarios, e.g. VICON systems. On the other hand, uncommon scenarios, like aerial views, animals, like wild zebras, or difficult-to-obtain information, such as human shapes, are hardly available. To overcome this, synthetic data generation with realistic rendering technologies has recently gained traction and advanced research areas such as target tracking and human pose estimation. However, subjects such as wild animals are still usually not well represented in such datasets. In this work, we first show that a pre-trained YOLO detector can not identify zebras in real images recorded from aerial viewpoints. To solve this, we present an approach for training an animal detector using only synthetic data. We start by generating a novel synthetic zebra dataset using GRADE, a state-of-the-art framework for data generation. The dataset includes RGB, depth, skeletal joint locations, pose, shape and instance segmentations for each subject. We use this to train a YOLO detector from scratch. Through extensive evaluations of our model with real-world data from i) limited datasets available on the internet and ii) a new one collected and manually labelled by us, we show that we can detect zebras by using only synthetic data during training. The code, results, trained models, and both the generated and training data are provided as open-source at https://eliabntt.github.io/grade-rr.

Generation code pdf DOI URL BibTeX

Haptic Intelligence Perceiving Systems Article Learning to Estimate Palpation Forces in Robotic Surgery From Visual-Inertial Data Lee, Y., Mat Husin, H., Forte, M., Lee, S., Kuchenbecker, K. J. IEEE Transactions on Medical Robotics and Bionics, 5(3):496-506, August 2023, Young-Eun Lee and Haliza Mat Husin contributed equally to this publication (Published)

Abstract ›

Surgeons cannot directly touch the patient's tissue in robot-assisted minimally invasive procedures. Instead, they must palpate using instruments inserted into the body through trocars. This way of operating largely prevents surgeons from using haptic cues to localize visually undetectable structures such as tumors and blood vessels, motivating research on direct and indirect force sensing. We propose an indirect force-sensing method that combines monocular images of the operating field with measurements from IMUs attached externally to the instrument shafts. Our method is thus suitable for various robotic surgery systems as well as laparoscopic surgery. We collected a new dataset using a da Vinci Si robot, a force sensor, and four different phantom tissue samples. The dataset includes 230 one-minute-long recordings of repeated bimanual palpation tasks performed by four lay operators. We evaluated several network architectures and investigated the role of the network inputs. Using the DenseNet vision model and including inertial data best-predicted palpation forces (lowest average root-mean-square error and highest average coefficient of determination). Ablation studies revealed that video frames carry significantly more information than inertial signals. Finally, we demonstrated the model's ability to generalize to unseen tissue and predict shear contact forces.

DOI BibTeX

Perceiving Systems Conference Paper Synthesizing Physical Character-scene Interactions Hassan, M., Guo, Y., Wang, T., Black, M. J., Fidler, S., Bin Peng, X. In ACM SIGGRAPH 2023 Conference Proceedings, SIGGRAPH, August 2023 (Published)

Abstract ›

Movement is how people interact with and affect their environment. For realistic virtual character animation, it is necessary to realistically synthesize such interactions between virtual characters and their surroundings. Despite recent progress in character animation using machine learning, most systems focus on controlling an agent's movements in fairly simple and homogeneous environments, with limited interactions with other objects. Furthermore, many previous approaches that synthesize human-scene interaction require significant manual labeling of the training data. In contrast, we present a system that uses adversarial imitation learning and reinforcement learning to train physically-simulated characters that perform scene interaction tasks in a natural and life-like manner. Our method is able to learn natural scene interaction behaviors from large unstructured motion datasets, without manual annotation of the motion data. These scene interactions are learned using an adversarial discriminator that evaluates the realism of a motion within the context of a scene. The key novelty involves conditioning both the discriminator and the policy networks on scene context. We demonstrate the effectiveness of our approach through three challenging scene interaction tasks: carrying, sitting, and lying down, which require coordination of a character's movements in relation to objects in the environment. Our policies learn to seamlessly transition between different behaviors like idling, walking, and sitting. Using an efficient approach to randomize the training objects and their placements during training enables our method to generalize beyond the objects and scenarios in the training dataset, producing natural character-scene interactions despite wide variation in object shape and placement. The approach takes physics-based character motion generation a step closer to broad applicability.

video arXiv ACM paper pdf BibTeX

Perceiving Systems Article BARC: Breed-Augmented Regression Using Classification for 3D Dog Reconstruction from Images Rueegg, N., Zuffi, S., Schindler, K., Black, M. J. Int. J. of Comp. Vis. (IJCV), 131(8):1964–1979, August 2023 (Published)

Abstract ›

The goal of this work is to reconstruct 3D dogs from monocular images. We take a model-based approach, where we estimate the shape and pose parameters of a 3D articulated shape model for dogs. We consider dogs as they constitute a challenging problem, given they are highly articulated and come in a variety of shapes and appearances. Recent work has considered a similar task using the multi-animal SMAL model, with additional limb scale parameters, obtaining reconstructions that are limited in terms of realism. Like previous work, we observe that the original SMAL model is not expressive enough to represent dogs of many different breeds. Moreover, we make the hypothesis that the supervision signal used to train the network, that is 2D keypoints and silhouettes, is not sufficient to learn a regressor that can distinguish between the large variety of dog breeds. We therefore go beyond previous work in two important ways. First, we modify the SMAL shape space to be more appropriate for representing dog shape. Second, we formulate novel losses that exploit information about dog breeds. In particular, we exploit the fact that dogs of the same breed have similar body shapes. We formulate a novel breed similarity loss, consisting of two parts: One term is a triplet loss, that encourages the shape of dogs from the same breed to be more similar than dogs of different breeds. The second one is a breed classification loss. With our approach we obtain 3D dogs that, compared to previous work, are quantitatively better in terms of 2D reconstruction, and significantly better according to subjective and quantitative 3D evaluations. Our work shows that a-priori side information about similarity of shape and appearance, as provided by breed labels, can help to compensate for the lack of 3D training data. This concept may be applicable to other animal species or groups of species. We call our method BARC (Breed-Augmented Regression using Classification). Our code is publicly available for research purposes at https://barc.is.tue.mpg.de/.

On-line DOI URL BibTeX

Perceiving Systems Ph.D. Thesis Learning Clothed 3D Human Models with Articulated Neural Implicit Representations Chen, X. July 2023 (Published)

Abstract ›

3D digital humans are important for a range of applications including movie and game production, virtual and augmented reality, and human-computer interaction. However, existing industrial solutions for creating 3D digital humans rely on expensive scanning devices and intensive manual labor, preventing their broader application. To address these challenges, the research community focuses on learning 3D parametric human models from data, aiming to automatically generate realistic digital humans based on input parameters that specify pose and shape attributes. Although recent advancements have enabled the generation of faithful 3D human bodies, modeling realistic humans that include additional features such as clothing, hair, and accessories remains an open research challenge. The goal of this thesis is to develop 3D parametric human models that can generate realistic digital humans including not only human bodies but also additional features, in particular clothing. The central challenge lies in the fundamental problem of how to represent non-rigid, articulated, and topology-varying shapes. Explicit geometric representations like polygon meshes lack the flexibility needed to model varying topology between clothing and human bodies, and across different clothing styles. On the other hand, implicit representations, such as signed distance functions, are topologically flexible but do not have a robust articulation algorithm yet. To tackle this problem, we first introduce a principled algorithm that models articulation for implicit representations, in particular the recently emerging neural implicit representations which have shown impressive modeling fidelity. Our algorithm, SNARF, generalizes linear blend skinning for polygon meshes to implicit representations and can faithfully articulate implicit shapes to any pose. SNARF is fully differentiable, which enables learning skinning weights and shapes jointly from posed observations. By leveraging this algorithm, we can learn single-subject clothed human models with realistic shapes and natural deformations from 3D scans. We further improve SNARF’s efficiency with several implementation and algorithmic optimizations, including using a more compact representation of the skinning weights, factoring out redundant computations, and custom CUDA kernel implementations. Collectively, these adaptations result in a speedup of 150 times while preserving accuracy, thereby enabling the efficient learning of 3D animatable humans. Next, we go beyond single-subject modeling and tackle the more challenging task of generative modeling clothed 3D humans. By integrating our articulation module with deep generative models, we have developed a generative model capable of creating novel 3D humans with various clothing styles and identities, as well as geometric details such as wrinkles. Lastly, to eliminate the reliance on expensive 3D scans and to facilitate texture learning, we introduce a system that integrates our differentiable articulation module with differentiable volume rendering in an end-to-end manner, enabling the reconstruction of animatable 3D humans directly from 2D monocular videos. The contributions of this thesis significantly advance the realistic generation and reconstruction of clothed 3D humans and provide new tools for modeling non-rigid, articulated, and topology-varying shapes. We hope that this work will contribute to the development of 3D human modeling and pave the way for new applications in the future.

download DOI BibTeX

Perceiving Systems Conference Paper Detecting Human-Object Contact in Images Chen, Y., Kumar Dwivedi, S., Black, M. J., Tzionas, D. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 17100-17110, CVPR, June 2023 (Published)

Abstract ›

Humans constantly contact objects to move and perform tasks. Thus, detecting human-object contact is important for building human-centered artificial intelligence. However, there exists no robust method to detect contact between the body and the scene from an image, and there exists no dataset to learn such a detector. We fill this gap with HOT ("Human-Object conTact"), a new dataset of human-object contacts for images. To build HOT, we use two data sources: (1) We use the PROX dataset of 3D human meshes moving in 3D scenes, and automatically annotate 2D image areas for contact via 3D mesh proximity and projection. (2) We use the V-COCO, HAKE and Watch-n-Patch datasets, and ask trained annotators to draw polygons for the 2D image areas where contact takes place. We also annotate the involved body part of the human body. We use our HOT dataset to train a new contact detector, which takes a single color image as input, and outputs 2D contact heatmaps as well as the body-part labels that are in contact. This is a new and challenging task that extends current foot-ground or hand-object contact detectors to the full generality of the whole body. The detector uses a part-attention branch to guide contact estimation through the context of the surrounding body parts and scene. We evaluate our detector extensively, and quantitative results show that our model outperforms baselines, and that all components contribute to better performance. Results on images from an online repository show reasonable detections and generalizability.

Project Page Paper Code DOI URL BibTeX

Perceiving Systems Conference Paper Generating Holistic 3D Human Motion from Speech Yi, H., Liang, H., Liu, Y., Cao, Q., Wen, Y., Bolkart, T., Tao, D., Black, M. J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 469-480, CVPR, June 2023 (Published)

Abstract ›

This work addresses the problem of generating 3D holistic body motions from human speech. Given a speech recording, we synthesize sequences of 3D body poses, hand gestures, and facial expressions that are realistic and diverse. To achieve this, we first build a high-quality dataset of 3D holistic body meshes with synchronous speech. We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately. The separated modeling stems from the fact that face articulation strongly correlates with human speech, while body poses and hand gestures are less correlated. Specifically, we employ an autoencoder for face motions, and a compositional vector-quantized variational autoencoder (VQ-VAE) for the body and hand motions. The compositional VQ-VAE is key to generating diverse results. Additionally, we propose a cross-conditional autoregressive model that generates body poses and hand gestures, leading to coherent and realistic motions. Extensive experiments and user studies demonstrate that our proposed approach achieves state-of-the-art performance both qualitatively and quantitatively. Our novel dataset and code are released for research purposes at https://talkshow.is.tue.mpg.de.

project SHOW code TalkSHOW code arXiv paper BibTeX

Perceiving Systems Conference Paper High-Fidelity Clothed Avatar Reconstruction from a Single Image Liao, T., Zhang, X., Xiu, Y., Yi, H., Liu, X., Qi, G., Zhang, Y., Wang, X., Zhu, X., Lei, Z. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 8662-8672, CVPR, June 2023 (Published)

Abstract ›

This paper presents a framework for efficient 3D clothed avatar reconstruction. By combining the advantages of the high accuracy of optimization-based methods and the efficiency of learning-based methods, we propose a coarse-to-fine way to realize a high-fidelity clothed avatar reconstruction (CAR) from a single image. At the first stage, we use an implicit model to learn the general shape in the canonical space of a person in a learning-based way, and at the second stage, we refine the surface detail by estimating the non-rigid deformation in the posed space in an optimization way. A hyper-network is utilized to generate a good initialization so that the convergence of the optimization process is greatly accelerated. Extensive experiments on various datasets show that the proposed CAR successfully produces high-fidelity avatars for arbitrarily clothed humans in real scenes.

Code Paper Homepage Youtube URL BibTeX

Perceiving Systems Conference Paper Instant Multi-View Head Capture through Learnable Registration Bolkart, T., Li, T., Black, M. J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 768-779, CVPR, June 2023 (Published)

Abstract ›

Existing methods for capturing datasets of 3D heads in dense semantic correspondence are slow, and commonly address the problem in two separate steps; multi-view stereo (MVS) reconstruction followed by non-rigid registration. To simplify this process, we introduce TEMPEH (Towards Estimation of 3D Meshes from Performances of Expressive Heads) to directly infer 3D heads in dense correspondence from calibrated multi-view images. Registering datasets of 3D scans typically requires manual parameter tuning to find the right balance between accurately fitting the scans’ surfaces and being robust to scanning noise and outliers. Instead, we propose to jointly register a 3D head dataset while training TEMPEH. Specifically, during training we minimize a geometric loss commonly used for surface registration, effectively leveraging TEMPEH as a regularizer. Our multi-view head inference builds on a volumetric feature representation that samples and fuses features from each view using camera calibration information. To account for partial occlusions and a large capture volume that enables head movements, we use view- and surface-aware feature fusion, and a spatial transformer-based head localization module, respectively. We use raw MVS scans as supervision during training, but, once trained, TEMPEH directly predicts 3D heads in dense correspondence without requiring scans. Predicting one head takes about 0.3 seconds with a median reconstruction error of 0.26 mm, 64% lower than the current state-of-the-art. This enables the efficient capture of large datasets containing multiple people and diverse facial motions. Code, model, and data are publicly available at https://tempeh.is.tue.mpg.de.

project video paper sup. mat. poster BibTeX

Neural Capture and Synthesis Perceiving Systems Conference Paper Instant Volumetric Head Avatars Zielonka, W., Bolkart, T., Thies, J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 4574-4584, CVPR, June 2023 (Published)

Abstract ›

We present Instant Volumetric Head Avatars (INSTA),a novel approach for reconstructing photo-realistic digital avatars instantaneously. INSTA models a dynamic neural radiance field based on neural graphics primitives embedded around a parametric face model. Our pipeline is trained on a single monocular RGB portrait video that observes the subject under different expressions and views. While state-of-the-art methods take up to several days to train an avatar, our method can reconstruct a digital avatar in less than 10 minutes on modern GPU hardware, which is orders of magnitude faster than previous solutions. In addition, it allows for the interactive rendering of novel poses and expressions. By leveraging the geometry prior of the underlying parametric face model, we demonstrate that INSTA extrapolates to unseen poses. In quantitative and qualitative studies on various subjects, INSTA outperforms state-of-the-art methods regarding rendering quality and training time.

pdf project video code face tracker code dataset DOI URL BibTeX

Perceiving Systems Conference Paper Learning from Synthetic Data Generated with GRADE Bonetto, E., Xu, C., Ahmad, A. In ICRA 2023 Pretraining for Robotics (PT4R) Workshop , ICRA 2023 Pretraining for Robotics (PT4R) Workshop, June 2023 (Published)

Abstract ›

Recently, synthetic data generation and realistic rendering has advanced tasks like target tracking and human pose estimation. Simulations for most robotics applications are obtained in (semi)static environments, with specific sensors and low visual fidelity. To solve this, we present a fully customizable framework for generating realistic animated dynamic environments (GRADE) for robotics research, first introduced in~\cite{GRADE}. GRADE supports full simulation control, ROS integration, realistic physics, while being in an engine that produces high visual fidelity images and ground truth data. We use GRADE to generate a dataset focused on indoor dynamic scenes with people and flying objects. Using this, we evaluate the performance of YOLO and Mask R-CNN on the tasks of segmenting and detecting people. Our results provide evidence that using data generated with GRADE can improve the model performance when used for a pre-training step. We also show that, even training using only synthetic data, can generalize well to real-world images in the same application domain such as the ones from the TUM-RGBD dataset. The code, results, trained models, and the generated data are provided as open-source at https://eliabntt.github.io/grade-rr.

Code Data and network models pdf URL BibTeX

Haptic Intelligence Perceiving Systems Conference Paper Reconstructing Signing Avatars from Video Using Linguistic Priors Forte, M., Kulits, P., Huang, C., Choutas, V., Tzionas, D., Kuchenbecker, K. J., Black, M. J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 12791-12801, Vancouver, Canada, CVPR, June 2023 (Published)

Abstract ›

Sign language (SL) is the primary method of communication for the 70 million Deaf people around the world. Video dictionaries of isolated signs are a core SL learning tool. Replacing these with 3D avatars can aid learning and enable AR/VR applications, improving access to technology and online media. However, little work has attempted to estimate expressive 3D avatars from SL video; occlusion, noise, and motion blur make this task difficult. We address this by introducing novel linguistic priors that are universally applicable to SL and provide constraints on 3D hand pose that help resolve ambiguities within isolated signs. Our method, SGNify, captures fine-grained hand pose, facial expression, and body movement fully automatically from in-the-wild monocular SL videos. We evaluate SGNify quantitatively by using a commercial motion-capture system to compute 3D avatars synchronized with monocular video. SGNify outperforms state-of-the-art 3D body-pose- and shape-estimation methods on SL videos. A perceptual study shows that SGNify's 3D reconstructions are significantly more comprehensible and natural than those of previous methods and are on par with the source videos. Code and data are available at sgnify.is.tue.mpg.de.

pdf arXiv project code DOI URL BibTeX

Perceiving Systems Conference Paper Simulation of Dynamic Environments for SLAM Bonetto, E., Xu, C., Ahmad, A. In ICRA 2023 Workshop on the Active Methods in Autonomous Navigation, ICRA, June 2023 (Published)

Abstract ›

Simulation engines are widely adopted in robotics. However, they lack either full simulation control, ROS integration, realistic physics, or photorealism. Recently, synthetic data generation and realistic rendering has advanced tasks like target tracking and human pose estimation. However, when focusing on vision applications, there is usually a lack of information like sensor measurements or time continuity. On the other hand, simulations for most robotics tasks are performed in (semi)static environments, with specific sensors and low visual fidelity. To solve this, we introduced in our previous work a fully customizable framework for generating realistic animated dynamic environments (GRADE) [1]. We use GRADE to generate an indoor dynamic environment dataset and then compare multiple SLAM algorithms on different sequences. By doing that, we show how current research over-relies on known benchmarks, failing to generalize. Our tests with refined YOLO and Mask R-CNN models provide further evidence that additional research in dynamic SLAM is necessary. The code, results, and generated data are provided as open-source at https://eliabntt.github.io/grade-rr .

Code Evaluation code Data pdf URL BibTeX

Perceiving Systems Article Virtual Reality Exposure to a Healthy Weight Body Is a Promising Adjunct Treatment for Anorexia Nervosa Behrens, S. C., Tesch, J., Sun, P. J., Starke, S., Black, M. J., Schneider, H., Pruccoli, J., Zipfel, S., Giel, K. E. Psychotherapy Psychosomatics, 92(3):170-179, June 2023 (Published)

Abstract ›

ntroduction/Objective: Treatment results of anorexia nervosa (AN) are modest, with fear of weight gain being a strong predictor of treatment outcome and relapse. Here, we present a virtual reality (VR) setup for exposure to healthy weight and evaluate its potential as an adjunct treatment for AN. Methods: In two studies, we investigate VR experience and clinical effects of VR exposure to higher weight in 20 women with high weight concern or shape concern and in 20 women with AN. Results: In study 1, 90\% of participants (18/20) reported symptoms of high arousal but verbalized low to medium levels of fear. Study 2 demonstrated that VR exposure to healthy weight induced high arousal in patients with AN and yielded a trend that four sessions of exposure improved fear of weight gain. Explorative analyses revealed three clusters of individual reactions to exposure, which need further exploration. Conclusions: VR exposure is a well-accepted and powerful tool for evoking fear of weight gain in patients with AN. We observed a statistical trend that repeated virtual exposure to healthy weight improved fear of weight gain with large effect sizes. Further studies are needed to determine the mechanisms and differential effects.

DOI URL BibTeX

Perceiving Systems Conference Paper 3D Human Pose Estimation via Intuitive Physics Tripathi, S., Müller, L., Huang, C. P., Taheri, O., Black, M. J., Tzionas, D. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR) , 4713-4725, CVPR, June 2023 (Published)

Abstract ›

The estimation of 3D human body shape and pose from images has advanced rapidly. While the results are often well aligned with image features in the camera view, the 3D pose is often physically implausible; bodies lean, float, or penetrate the floor. This is because most methods ignore the fact that bodies are typically supported by the scene. To address this, some methods exploit physics engines to enforce physical plausibility. Such methods, however, are not differentiable, rely on unrealistic proxy bodies, and are difficult to integrate into existing optimization and learning frameworks. To account for this, we take a different approach that exploits novel intuitive-physics (IP) terms that can be inferred from a 3D SMPL body interacting with the scene. Specifically, we infer biomechanically relevant features such as the pressure heatmap of the body on the floor, the Center of Pressure (CoP) from the heatmap, and the SMPL body’s Center of Mass (CoM) projected on the floor. With these, we develop IPMAN, to estimate a 3D body from a color image in a “stable” configuration by encouraging plausible floor contact and overlapping CoP and CoM. Our IP terms are intuitive, easy to implement, fast to compute, and can be integrated into any SMPL-based optimization or regression method; we show examples of both. To evaluate our method, we present MoYo, a dataset with synchronized multi-view color images and 3D bodies with complex poses, body-floor contact, and ground-truth CoM and pressure. Evaluation on MoYo, RICH and Human3.6M show that our IP terms produce more plausible results than the state of the art; they improve accuracy for static poses, while not hurting dynamic ones.

Project Page Moyo Dataset DOI URL BibTeX

Perceiving Systems Conference Paper ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation Fan, Z., Taheri, O., Tzionas, D., Kocabas, M., Kaufmann, M., Black, M. J., Hilliges, O. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 12943-12954, CVPR, June 2023 (Published)

Abstract ›

Humans intuitively understand that inanimate objects do not move by themselves, but that state changes are typically caused by human manipulation (e.g., the opening of a book). This is not yet the case for machines. In part this is because there exist no datasets with ground-truth 3D annotations for the study of physically consistent and synchronised motion of hands and articulated objects. To this end, we introduce ARCTIC -- a dataset of two hands that dexterously manipulate objects, containing 2.1M video frames paired with accurate 3D hand and object meshes and detailed, dynamic contact information. It contains bi-manual articulation of objects such as scissors or laptops, where hand poses and object states evolve jointly in time. We propose two novel articulated hand-object interaction tasks: (1) Consistent motion reconstruction: Given a monocular video, the goal is to reconstruct two hands and articulated objects in 3D, so that their motions are spatio-temporally consistent. (2) Interaction field estimation: Dense relative hand-object distances must be estimated from images. We introduce two baselines ArcticNet and InterField, respectively and evaluate them qualitatively and quantitatively on ARCTIC.

Project Page Code Paper arXiv Video DOI URL BibTeX

Perceiving Systems Conference Paper BEDLAM: A Synthetic Dataset of Bodies Exhibiting Detailed Lifelike Animated Motion Black, M. J., Patel, P., Tesch, J., Yang, J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 8726-8737, CVPR, June 2023 (Published)

Abstract ›

We show, for the first time, that neural networks trained only on synthetic data achieve state-of-the-art accuracy on the problem of 3D human pose and shape (HPS) estimation from real images. Previous synthetic datasets have been small, unrealistic, or lacked realistic clothing. Achieving sufficient realism is non-trivial and we show how to do this for full bodies in motion. Specifically, our BEDLAM dataset contains monocular RGB videos with ground-truth 3D bodies in SMPL-X format. It includes a diversity of body shapes, motions, skin tones, hair, and clothing. The clothing is realistically simulated on the moving bodies using commercial clothing physics simulation. We render varying numbers of people in realistic scenes with varied lighting and camera motions. We then train various HPS regressors using BEDLAM and achieve state-of-the-art accuracy on real-image benchmarks despite training with synthetic data. We use BEDLAM to gain insights into what model design choices are important for accuracy. With good synthetic training data, we find that a basic method like HMR approaches the accuracy of the current SOTA method (CLIFF). BEDLAM is useful for a variety of tasks and all images, ground truth bodies, 3D clothing, support code, and more are available for research purposes. Additionally, we provide detailed information about our synthetic data generation pipeline, enabling others to generate their own datasets. See the project page: https://bedlam.is.tue.mpg.de/.

pdf project CVF code DOI URL BibTeX

Perceiving Systems Conference Paper BITE: Beyond Priors for Improved Three-D Dog Pose Estimation Rüegg, N., Tripathi, S., Schindler, K., Black, M. J., Zuffi, S. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 8867-8876, CVPR, June 2023 (Published)

Abstract ›

We address the problem of inferring the 3D shape and pose of dogs from images. Given the lack of 3D training data, this problem is challenging, and the best methods lag behind those designed to estimate human shape and pose. To make progress, we attack the problem from multiple sides at once. First, we need a good 3D shape prior, like those available for humans. To that end, we learn a dog-specific 3D parametric model, called D-SMAL. Second, existing methods focus on dogs in standing poses because when they sit or lie down, their legs are self occluded and their bodies deform. Without access to a good pose prior or 3D data, we need an alternative approach. To that end, we exploit contact with the ground as a form of side information. We consider an existing large dataset of dog images and label any 3D contact of the dog with the ground. We exploit body-ground contact in estimating dog pose and find that it significantly improves results. Third, we develop a novel neural network architecture to infer and exploit this contact information. Fourth, to make progress, we have to be able to measure it. Current evaluation metrics are based on 2D features like keypoints and silhouettes, which do not directly correlate with 3D errors. To address this, we create a synthetic dataset containing rendered images of scanned 3D dogs. With these advances, our method recovers significantly better dog shape and pose than the state of the art, and we evaluate this improvement in 3D. Our code, model and test dataset are publicly available for research purposes at https://bite.is.tue.mpg.de.

pdf supp project DOI URL BibTeX

Perceiving Systems Conference Paper ECON: Explicit Clothed humans Optimized via Normal integration Xiu, Y., Yang, J., Cao, X., Tzionas, D., Black, M. J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 512-523, CVPR, June 2023 (Published)

Abstract ›

The combination of artist-curated scans, and deep implicit functions (IF), is enabling the creation of detailed, clothed, 3D humans from images. However, existing methods are far from perfect. IF-based methods recover free-form geometry but produce disembodied limbs or degenerate shapes for unseen poses or clothes. To increase robustness for these cases, existing work uses an explicit parametric body model to constrain surface reconstruction, but this limits the recovery of free-form surfaces such as loose clothing that deviates from the body. What we want is a method that combines the best properties of implicit and explicit methods. To this end, we make two key observations:(1) current networks are better at inferring detailed 2D maps than full-3D surfaces, and (2) a parametric model can be seen as a “canvas” for stitching together detailed surface patches. ECON infers high-fidelity 3D humans even in loose clothes and challenging poses, while having realistic faces and fingers. This goes beyond previous methods. Quantitative, evaluation of the CAPE and Renderpeople datasets shows that ECON is more accurate than the state of the art. Perceptual studies also show that ECON’s perceived realism is better by a large margin.

Page Paper Demo Code Video Colab DOI URL BibTeX

Perceiving Systems Conference Paper HOOD: Hierarchical Graphs for Generalized Modelling of Clothing Dynamics Grigorev, A., Thomaszewski, B., Black, M. J., Hilliges, O. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 16965-16974, CVPR, June 2023 (Published)

Abstract ›

We propose a method that leverages graph neural networks, multi-level message passing, and unsupervised training to enable real-time prediction of realistic clothing dynamics. Whereas existing methods based on linear blend skinning must be trained for specific garments, our method is agnostic to body shape and applies to tight-fitting garments as well as loose, free-flowing clothing. Our method furthermore handles changes in topology (e.g., garments with buttons or zippers) and material properties at inference time. As one key contribution, we propose a hierarchical message-passing scheme that efficiently propagates stiff stretching modes while preserving local detail. We empirically show that our method outperforms strong baselines quantitatively and that its results are perceived as more realistic than state-of-the-art methods.

arXiv project pdf supp URL BibTeX

Perceiving Systems Neural Capture and Synthesis Conference Paper MIME: Human-Aware 3D Scene Generation Yi, H., Huang, C. P., Tripathi, S., Hering, L., Thies, J., Black, M. J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 12965-12976, CVPR, June 2023 (Published)

Abstract ›

Generating realistic 3D worlds occupied by moving humans has many applications in games, architecture, and synthetic data creation. But generating such scenes is expensive and labor intensive. Recent work generates human poses and motions given a 3D scene. Here, we take the opposite approach and generate 3D indoor scenes given 3D human motion. Such motions can come from archival motion capture or from IMU sensors worn on the body, effectively turning human movement in a “scanner” of the 3D world. Intuitively, human movement indicates the free-space in a room and human contact indicates surfaces or objects that support activities such as sitting, lying or touching. We propose MIME (Mining Interaction and Movement to infer 3D Environments), which is a generative model of indoor scenes that produces furniture layouts that are consistent with the human movement. MIME uses an auto-regressive transformer architecture that takes the already generated objects in the scene as well as the human motion as input, and outputs the next plausible object. To train MIME, we build a dataset by populating the 3D FRONT scene dataset with 3D humans. Our experiments show that MIME produces more diverse and plausible 3D scenes than a recent generative scene method that does not know about human movement. Code and data will be available for research at https://mime.is.tue.mpg.de.

project arXiv paper URL BibTeX

Perceiving Systems Conference Paper PointAvatar: Deformable Point-Based Head Avatars From Videos Zheng, Y., Yifan, W., Wetzstein, G., Black, M. J., Hilliges, O. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 21057-21067, CVPR, June 2023 (Published)

Abstract ›

The ability to create realistic animatable and relightable head avatars from casual video sequences would open up wide ranging applications in communication and entertainment. Current methods either build on explicit 3D morphable meshes (3DMM) or exploit neural implicit representations. The former are limited by fixed topology, while the latter are non-trivial to deform and inefficient to render. Furthermore, existing approaches entangle lighting and albedo, limiting the ability to re-render the avatar in new environments. In contrast, we propose PointAvatar, a deformable point-based representation that disentangles the source color into intrinsic albedo and normal-dependent shading. We demonstrate that PointAvatar bridges the gap between existing mesh- and implicit representations, combining high-quality geometry and appearance with topological flexibility, ease of deformation and rendering efficiency. We show that our method is able to generate animatable 3D avatars using monocular videos from multiple sources including hand-held smartphones, laptop webcams and internet videos, achieving state-of-the-art quality in challenging cases where previous methods fail, e.g., thin hair strands, while being significantly more efficient in training than competing methods.

pdf project code video DOI URL BibTeX

Perceiving Systems Conference Paper SLOPER4D: A Scene-Aware Dataset for Global 4D Human Pose Estimation in Urban Environments Dai, Y., Lin, Y., Lin, X., Wen, C., Xu, L., Yi, H., Shen, S., Ma, Y., Wang, C. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 682-692, CVF, CVPR, June 2023 (Published)

Abstract ›

We present SLOPER4D, a novel scene-aware dataset collected in large urban environments to facilitate the research of global human pose estimation (GHPE) with human-scene interaction in the wild. Employing a head-mounted device integrated with a LiDAR and camera, we record 12 human subjects’ activities over 10 diverse urban scenes from an egocentric view. Frame-wise annotations for 2D key points, 3D pose parameters, and global translations are provided, together with reconstructed scene point clouds. To obtain accurate 3D ground truth in such large dynamic scenes, we propose a joint optimization method to fit local SMPL meshes to the scene and fine-tune the camera calibration during dynamic motions frame by frame, resulting in plausible and scene-natural 3D human poses. Eventually, SLOPER4D consists of 15 sequences of human motions, each of which has a trajectory length of more than 200 meters (up to 1,300 meters) and covers an area of more than 200 square meters (up to 30,000 square meters), including more than 100K LiDAR frames, 300k video frames, and 500K IMU-based motion frames. With SLOPER4D, we provide a detailed and thorough analysis of two critical tasks, including camera-based 3D HPE and LiDAR-based 3D HPE in urban environments, and benchmark a new task, GHPE. The in-depth analysis demonstrates SLOPER4D poses significant challenges to existing methods and produces great research opportunities. The dataset and code are released https://github.com/climbingdaily/SLOPER4D.

project dataset codebase paper arXiv BibTeX

Perceiving Systems Conference Paper TRACE: 5D Temporal Regression of Avatars With Dynamic Cameras in 3D Environments Sun, Y., Bao, Q., Liu, W., Mei, T., Black, M. J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 8856-8866, CVPR, June 2023 (Published)

Abstract ›

Although the estimation of 3D human pose and shape (HPS) is rapidly progressing, current methods still cannot reliably estimate moving humans in global coordinates, which is critical for many applications. This is particularly challenging when the camera is also moving, entangling human and camera motion. To address these issues, we adopt a novel 5D representation (space, time, and identity) that enables end-to-end reasoning about people in scenes. Our method, called TRACE, introduces several novel architectural components. Most importantly, it uses two new "maps" to reason about the 3D trajectory of people over time in camera, and world, coordinates. An additional memory unit enables persistent tracking of people even during long occlusions. TRACE is the first one-stage method to jointly recover and track 3D humans in global coordinates from dynamic cameras. By training it end-to-end, and using full image information, TRACE achieves state-of-the-art performance on tracking and HPS benchmarks. The code and dataset are released for research purposes.

pdf supp code video URL BibTeX

Perceiving Systems Empirical Inference Conference Paper MeshDiffusion: Score-based Generative 3D Mesh Modeling Liu, Z., Feng, Y., Black, M. J., Nowrouzezahrai, D., Paull, L., Liu, W. The Eleventh International Conference on Learning Representations (ICLR), ICLR, May 2023 (Published)

Abstract ›

We consider the task of generating realistic 3D shapes, which is useful for a variety of applications such as automatic scene generation and physical simulation. Compared to other 3D representations like voxels and point clouds, meshes are more desirable in practice, because (1) they enable easy and arbitrary manipulation of shapes for relighting and simulation, and (2) they can fully leverage the power of modern graphics pipelines which are mostly optimized for meshes. Previous scalable methods for generating meshes typically rely on sub-optimal post-processing, and they tend to produce overly-smooth or noisy surfaces without fine-grained geometric details. To overcome these shortcomings, we take advantage of the graph structure of meshes and use a simple yet very effective generative modeling method to generate 3D meshes. Specifically, we represent meshes with deformable tetrahedral grids, and then train a diffusion model on this direct parametrization. We demonstrate the effectiveness of our model on multiple generative tasks.

Home Code URL BibTeX

Perceiving Systems Ph.D. Thesis Reining in the Deep Generative Models Ghosh, P. University of Tübingen, May 2023 (Published)

Abstract ›

This thesis studies controllability of generative models (specifically VAEs and GANs) applied primarily to images. We improve 1. generation quality, by removing the arbitrary prior assumptions, 2. classification by suitably choosing the latent space distribution, and 3. inference performance by optimizing the generative and inference objective simultaneously. Variational autoencoders (VAEs) are an incredibly useful tool as they can be used as a backbone for a variety of machine learning tasks e.g., semi-supervised learning, representation learning, unsupervised learning, etc. However, the generated samples are overly smooth and this limits their practical usage tremendously. There are two leading hypotheses to explain this: 1. bad likelihood model and 2. overly simplistic prior. We investigate these by designing a deterministic yet samplable autoencoder named Regularized Autoencoders (RAE). This redesign helps us enforce arbitrary priors over the latent distribution of a VAE addressing hypothesis (1) above. This leads us to conclude that a poor likelihood model is the predominant factor that makes VAEs blurry. Furthermore, we show that combining generative (e.g., VAE objective) and discriminative objectives (e.g., classification objective) improve performance of both. Specifically, We use a special case of an RAE to build a classifier that offers robustness against adversarial attack. Conditional generative models have the potential to revolutionize the animation industry, among others. However, to do so, the two key requirements are, 1. they must be of high quality (i.e., generate high-resolution images) and 2. must follow their conditioning (i.e., generate images that have the properties specified by the condition). We exploit pixel-localized correlation between the conditioning variable and generated image to ensure strong association between the two and thereby gain precise control over the generated content. We further show that closing the generation-inference loop (training them together) in latent variable models benefits both the generation and the inference component. This opens up the possibility to train an inference and a generative model simultaneously in one unified framework, in the fully or semi supervised setting. With the proposed approach, one can build a robust classifier by introducing the marginal likelihood of a data point, removing arbitrary assumptions about the prior distribution, mitigating posterior-prior distribution mismatch and completing the generation inference loop. In this thesis, we study real-life implications of each of the themes using various image classification and generation frameworks.

download pdf DOI BibTeX

Perceiving Systems Article Fast-SNARF: A Fast Deformer for Articulated Neural Fields Chen, X., Jiang, T., Song, J., Rietmann, M., Geiger, A., Black, M. J., Hilliges, O. IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 1-15, April 2023 (Published)

Abstract ›

Neural fields have revolutionized the area of 3D reconstruction and novel view synthesis of rigid scenes. A key challenge in making such methods applicable to articulated objects, such as the human body, is to model the deformation of 3D locations between the rest pose (a canonical space) and the deformed space. We propose a new articulation module for neural fields, Fast-SNARF, which finds accurate correspondences between canonical space and posed space via iterative root finding. Fast-SNARF is a drop-in replacement in functionality to our previous work, SNARF, while significantly improving its computational efficiency. We contribute several algorithmic and implementation improvements over SNARF, yielding a speed-up of 150× . These improvements include voxel-based correspondence search, pre-computing the linear blend skinning function, and an efficient software implementation with CUDA kernels. Fast-SNARF enables efficient and simultaneous optimization of shape and skinning weights given deformed observations without correspondences (e.g. 3D meshes). Because learning of deformation maps is a crucial component in many 3D human avatar methods and since Fast-SNARF provides a computationally efficient solution, we believe that this work represents a significant step towards the practical creation of 3D virtual humans.

pdf publisher site code DOI URL BibTeX

Perceiving Systems Patent Method and systems for labelling motion-captured points J., B. M., Ghorbani, N. (US Patent App.~17/949,087), April 2023 (Published)

Abstract ›

Computer-implemented methods are provided for labelling motion-captured points that correspond to markers on an object. The methods include obtaining the motion-captured points, processing a representation of the motion-captured points in a trained self-attention unit to obtain label scores for the motion-captured points, and assigning labels based on the label scores.

BibTeX

Publications

Filter by