Back
Understanding Human-Scene Interaction through Perception and Generation
Humans are in constant contact with the world as they move through it and interact with it. Understanding Human-Scene Interactions (HSIs) is key to enhancing our perception and manipulation of three-dimensional (3D) environments, which is crucial for various applications such as gaming, architecture, and synthetic data creation. However, creating realistic 3D scenes populated by moving humans is a challenging and labor-intensive task. Existing human-scene interaction datasets are scarce and captured motion datasets often lack scene information. This thesis addresses these challenges by leveraging three specific types of HSI con- straints: (1) depth ordering constraint: humans that move in a scene are occluded or occlude objects, thus, defining the relative depth ordering of the objects, (2) collision constraint: humans move through free space and do not interpenetrate objects, (3) in- teraction constraint: when humans and objects are in contact, the contact surfaces oc- cupy the same place in space. Building on these constraints, we propose three distinct methodologies: capturing HSI from a monocular RGB video, generating HSI by gen- erating scenes from input human motions (scenes from humans) and generating human motion from scenes (humans from scenes). Firstly, we introduce MOVER , which jointly reconstructs 3D human motion and the interactive scenes from a RGB video. This optimization-based approach leverages these three aforementioned constraints to enhance the consistency and plausibility of recon- structed scene layouts and to refine the initial 3D human pose and shape estimations. Secondly, we present MIME , which takes 3D humans and a floor map as input to create realistic and interactive 3D environments. This method applies collision and interaction constraints, and employs an auto-regressive transformer architecture that integrates ob- jects into the scene based on existing human motion. The training data is enriched by populating the 3D FRONT scene dataset with 3D humans. By treating human movement as a “scanner” of the environment, this method results in furniture layouts that reflect true human activities, increasing the diversity and authenticity of the environments. Lastly, we introduce TeSMo , which generates 3D human motion from given 3D scenes and text descriptions, adhering to the collision and interaction constraints. It utilizes a text-controlled scene-aware motion generation framework based on denoising diffusion models. Annotated navigation and interaction motions are embedded within scenes to support the model’s training, allowing for the generation of diverse and realistic human- scene interactions tailored to specific settings and object arrangements. In conclusion, these methodologies significantly advance our understanding and syn- thesis of human-scene interactions, offering realistic modeling of 3D environments.
@phdthesis{UnderstandingHumanSceneInteractionthroughPerceptionandGeneration, title = {Understanding Human-Scene Interaction through Perception and Generation}, abstract = {Humans are in constant contact with the world as they move through it and interact with it. Understanding Human-Scene Interactions (HSIs) is key to enhancing our perception and manipulation of three-dimensional (3D) environments, which is crucial for various applications such as gaming, architecture, and synthetic data creation. However, creating realistic 3D scenes populated by moving humans is a challenging and labor-intensive task. Existing human-scene interaction datasets are scarce and captured motion datasets often lack scene information. This thesis addresses these challenges by leveraging three specific types of HSI con- straints: (1) depth ordering constraint: humans that move in a scene are occluded or occlude objects, thus, defining the relative depth ordering of the objects, (2) collision constraint: humans move through free space and do not interpenetrate objects, (3) in- teraction constraint: when humans and objects are in contact, the contact surfaces oc- cupy the same place in space. Building on these constraints, we propose three distinct methodologies: capturing HSI from a monocular RGB video, generating HSI by gen- erating scenes from input human motions (scenes from humans) and generating human motion from scenes (humans from scenes). Firstly, we introduce MOVER , which jointly reconstructs 3D human motion and the interactive scenes from a RGB video. This optimization-based approach leverages these three aforementioned constraints to enhance the consistency and plausibility of recon- structed scene layouts and to refine the initial 3D human pose and shape estimations. Secondly, we present MIME , which takes 3D humans and a floor map as input to create realistic and interactive 3D environments. This method applies collision and interaction constraints, and employs an auto-regressive transformer architecture that integrates ob- jects into the scene based on existing human motion. The training data is enriched by populating the 3D FRONT scene dataset with 3D humans. By treating human movement as a “scanner” of the environment, this method results in furniture layouts that reflect true human activities, increasing the diversity and authenticity of the environments. Lastly, we introduce TeSMo , which generates 3D human motion from given 3D scenes and text descriptions, adhering to the collision and interaction constraints. It utilizes a text-controlled scene-aware motion generation framework based on denoising diffusion models. Annotated navigation and interaction motions are embedded within scenes to support the model’s training, allowing for the generation of diverse and realistic human- scene interactions tailored to specific settings and object arrangements. In conclusion, these methodologies significantly advance our understanding and syn- thesis of human-scene interactions, offering realistic modeling of 3D environments.}, month = apr, year = {2025}, slug = {understandinghumansceneinteractionthroughperceptionandgeneration}, author = {Yi, Hongwei}, month_numeric = {4} }
More information