Putting People into Scenes

While prior work focused on the body as a stick figure, we place full 3D SMPL-X bodies in scenes. The body surface is critical to establishing appropriate semantic and physical interactions. To create training data, we use the PROX dataset [], which includes 3D SMPL-X bodies fit to real 3D scenes with ground truth contact information.

Our first work, PSI [], uses a conditional variational autoencoder to predict semantically plausible 3D human poses conditioned on latent scene representations. We then refine the generated 3D bodies using scene constraints to enforce feasible physical interaction.

To synthesize realistic human-scene interactions, it is essential to represent the physical contact and proximity between the body and the world. With PLACE [], we explicitly model the proximity between the human body and the 3D scene around it. Specifically, given a set of basis points on a scene mesh, we train a conditional VAE to synthesize the distances from the basis points to the human body surface.

POSA [] flips this around to model human-scene interaction in a body-centric representation that enables it to generalize to new scenes. POSA augments SMPL-X such that, for every mesh vertex, it encodes (a) the contact probability with the scene surface and (b) the corresponding semantic scene label. We learn POSA with a VAE conditioned on the SMPL-X vertices, and train on the PROX dataset.

While the above methods produce static poses, SAMP [] generates goal-directed human movement in novel scenes. Given a task like "sit on the sofa", SAMP uses a GoalNet to extract the affordances of the sofa. A MotionNet generates sequences of poses to achieve the goal, while an A* algorithm plans a collision-free path through the scene.

These methods are just the beginning but provide a path for creating digital humans that can behave autonomously in 3D worlds.

Members

Perceiving Systems

Michael Black

Emeritus / Acting Director

Perceiving Systems

Mohamed Hassan

Ph.D. Student

Perceiving Systems

Dimitris Tzionas

Guest Scientist

Empirical Inference

Partha Ghosh

Postdoctoral Researcher

Perceiving Systems

Joachim Tesch

Software Engineer, Real-time Graphics (VR/MR)

Perceiving Systems

Qianli Ma

Guest Scientist

Perceiving Systems

Yan Zhang

Perceiving Systems

Siyu Tang

Guest Scientist

Perceiving Systems

Vassilis Choutas

Publications

Perceiving Systems Conference Paper Stochastic Scene-Aware Motion Prediction Hassan, M., Ceylan, D., Villegas, R., Saito, J., Yang, J., Zhou, Y., Black, M. In Proc. International Conference on Computer Vision (ICCV), 11354-11364, IEEE, Piscataway, NJ, International Conference on Computer Vision, October 2021 (Published)

Abstract ›

A long-standing goal in computer vision is to capture, model, and realistically synthesize human behavior. Specifically, by learning from data, our goal is to enable virtual humans to navigate within cluttered indoor scenes and naturally interact with objects. Such embodied behavior has applications in virtual reality, computer games, and robotics, while synthesized behavior can be used as training data. The problem is challenging because real human motion is diverse and adapts to the scene. For example, a person can sit or lie on a sofa in many places and with varying styles. We must model this diversity to synthesize virtual humans that realistically perform human-scene interactions. We present a novel data-driven, stochastic motion synthesis method that models different styles of performing a given action with a target object. Our Scene-Aware Motion Prediction method (SAMP) generalizes to target objects of various geometries while enabling the character to navigate in cluttered scenes. To train SAMP, we collected mocap data covering various sitting, lying down, walking, and running styles. We demonstrate SAMP on complex indoor scenes and achieve superior performance than existing solutions.

Project Page pdf DOI BibTeX

Perceiving Systems Conference Paper Populating 3D Scenes by Learning Human-Scene Interaction Hassan, M., Ghosh, P., Tesch, J., Tzionas, D., Black, M. J. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), 14703-14713, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2021), June 2021 (Published)

Abstract ›

Humans live within a 3D space and constantly interact with it to perform tasks. Such interactions involve physical contact between surfaces that is semantically meaningful. Our goal is to learn how humans interact with scenes and leverage this to enable virtual characters to do the same. To that end, we introduce a novel Human-Scene Interaction (HSI) model that encodes proximal relationships, called POSA for “Pose with prOximitieS and contActs”. The representation of interaction is body-centric, which enables it to generalize to new scenes. Specifically, POSA augments the SMPL-X parametric human body model such that, for every mesh vertex, it encodes (a) the contact probability with the scene surface and (b) the corresponding semantic scene label. We learn POSA with a VAE conditioned on the SMPL-X vertices, and train on the PROX dataset, which contains SMPL-X meshes of people interacting with 3D scenes, and the corresponding scene semantics from the PROX-E dataset. We demonstrate the value of POSA with two applications. First, we automatically place 3D scans of people in scenes. We use a SMPL-X model fit to the scan as a proxy and then find its most likely placement in 3D. POSA provides an effective representation to search for “affordances” in the scene that match the likely contact relationships for that pose. We perform a perceptual study that shows significant improvement over the state of the art on this task. Second, we show that POSA’s learned representation of body-scene interaction supports monocular human pose estimation that is consistent with a 3D scene, improving on the state of the art. Our model and code are available for research purposes at https://posa.is.tue.mpg.de.

project pdf poster video DOI BibTeX

Perceiving Systems Conference Paper PLACE: Proximity Learning of Articulation and Contact in 3D Environments Zhang, S., Zhang, Y., Ma, Q., Black, M. J., Tang, S. In 2020 International Conference on 3D Vision (3DV 2020), 1:642-651, IEEE, Piscataway, NJ, International Conference on 3D Vision (3DV 2020), November 2020 (Published)

Abstract ›

High fidelity digital 3D environments have been proposed in recent years, however, it remains extremely challenging to automatically equip such environment with realistic human bodies. Existing work utilizes images, depth or semantic maps to represent the scene, and parametric human models to represent 3D bodies. While being straight-forward, their generated human-scene interactions often lack of naturalness and physical plausibility. Our key observation is that humans interact with the world through body-scene contact. To synthesize realistic human-scene interactions, it is essential to effectively represent the physical contact and proximity between the body and the world. To that end, we propose a novel interaction generation method, named PLACE(Proximity Learning of Articulation and Contact in 3D Environments), which explicitly models the proximity between the human body and the 3D scene around it. Specifically, given a set of basis points on a scene mesh, we leverage a conditional variational autoencoder to synthesize the minimum distances from the basis points to the human body surface. The generated proximal relationship exhibits which region of the scene is in contact with the person. Furthermore, based on such synthesized proximity, we are able to effectively obtain expressive 3D human bodies that interact with the 3D scene naturally. Our perceptual study shows that PLACE significantly improves the state-of-the-art method, approaching the realism of real human-scene interaction. We believe our method makes an important step towards the fully automatic synthesis of realistic 3D human bodies in 3D scenes. The code and model are available for research at https://sanweiliti.github.io/PLACE/PLACE.html

pdf arXiv project code DOI BibTeX

Perceiving Systems Conference Paper Generating 3D People in Scenes without People Zhang, Y., Hassan, M., Neumann, H., Black, M. J., Tang, S. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), 6193-6203, IEEE, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR 2020), June 2020 (Published)

Abstract ›

We present a fully automatic system that takes a 3D scene and generates plausible 3D human bodies that are posed naturally in that 3D scene. Given a 3D scene without people, humans can easily imagine how people could interact with the scene and the objects in it. However, this is a challenging task for a computer as solving it requires that (1) the generated human bodies to be semantically plausible within the 3D environment (e.g. people sitting on the sofa or cooking near the stove), and (2) the generated human-scene interaction to be physically feasible such that the human body and scene do not interpenetrate while, at the same time, body-scene contact supports physical interactions. To that end, we make use of the surface-based 3D human model SMPL-X. We first train a conditional variational autoencoder to predict semantically plausible 3D human poses conditioned on latent scene representations, then we further refine the generated 3D bodies using scene constraints to enforce feasible physical interaction. We show that our approach is able to synthesize realistic and expressive 3D human bodies that naturally interact with 3D environment. We perform extensive experiments demonstrating that our generative framework compares favorably with existing methods, both qualitatively and quantitatively. We believe that our scene-conditioned 3D human generation pipeline will be useful for numerous applications; e.g. to generate training data for human pose estimation, in video games and in VR/AR. Our project page for data and code can be seen at: \url{https://vlg.inf.ethz.ch/projects/PSI/}.

Code Video PDF DOI BibTeX

Perceiving Systems Conference Paper Resolving 3D Human Pose Ambiguities with 3D Scene Constraints Hassan, M., Choutas, V., Tzionas, D., Black, M. J. In International Conference on Computer Vision (ICCV), 2282-2292, October 2019 (Published)

Abstract ›

To understand and analyze human behavior, we need to capture humans moving in, and interacting with, the world. Most existing methods perform 3D human pose estimation without explicitly considering the scene. We observe however that the world constrains the body and vice-versa. To motivate this, we show that current 3D human pose estimation methods produce results that are not consistent with the 3D scene. Our key contribution is to exploit static 3D scene structure to better estimate human pose from monocular images. The method enforces Proximal Relationships with Object eXclusion and is called PROX. To test this, we collect a new dataset composed of 12 different 3D scenes and RGB sequences of 20 subjects moving in and interacting with the scenes. We represent human pose using the 3D human body model SMPL-X and extend SMPLify-X to estimate body pose using scene constraints. We make use of the 3D scene information by formulating two main constraints. The interpenetration constraint penalizes intersection between the body model and the surrounding 3D scene. The contact constraint encourages specific parts of the body to be in contact with scene surfaces if they are close enough in distance and orientation. For quantitative evaluation we capture a separate dataset with 180 RGB frames in which the ground-truth body pose is estimated using a motion-capture system. We show quantitatively that introducing scene constraints significantly reduces 3D joint error and vertex error. Our code and data are available for research at https://prox.is.tue.mpg.de.

pdf poster DOI URL BibTeX