Understanding Human-Scene Interaction through Perception and Generation

Institute Homepage

Institute Homepage DE Sign In

Back

Perceiving Systems Ph.D. Thesis 2025

Perceiving Systems

Hongwei Yi

Guest Scientist

Humans are in constant contact with the world as they move through it and interact with it. Understanding Human-Scene Interactions (HSIs) is key to enhancing our perception and manipulation of three-dimensional (3D) environments, which is crucial for various applications such as gaming, architecture, and synthetic data creation. However, creating realistic 3D scenes populated by moving humans is a challenging and labor-intensive task. Existing human-scene interaction datasets are scarce and captured motion datasets often lack scene information. This thesis addresses these challenges by leveraging three specific types of HSI con- straints: (1) depth ordering constraint: humans that move in a scene are occluded or occlude objects, thus, defining the relative depth ordering of the objects, (2) collision constraint: humans move through free space and do not interpenetrate objects, (3) in- teraction constraint: when humans and objects are in contact, the contact surfaces oc- cupy the same place in space. Building on these constraints, we propose three distinct methodologies: capturing HSI from a monocular RGB video, generating HSI by gen- erating scenes from input human motions (scenes from humans) and generating human motion from scenes (humans from scenes). Firstly, we introduce MOVER , which jointly reconstructs 3D human motion and the interactive scenes from a RGB video. This optimization-based approach leverages these three aforementioned constraints to enhance the consistency and plausibility of recon- structed scene layouts and to refine the initial 3D human pose and shape estimations. Secondly, we present MIME , which takes 3D humans and a floor map as input to create realistic and interactive 3D environments. This method applies collision and interaction constraints, and employs an auto-regressive transformer architecture that integrates ob- jects into the scene based on existing human motion. The training data is enriched by populating the 3D FRONT scene dataset with 3D humans. By treating human movement as a “scanner” of the environment, this method results in furniture layouts that reflect true human activities, increasing the diversity and authenticity of the environments. Lastly, we introduce TeSMo , which generates 3D human motion from given 3D scenes and text descriptions, adhering to the collision and interaction constraints. It utilizes a text-controlled scene-aware motion generation framework based on denoising diffusion models. Annotated navigation and interaction motions are embedded within scenes to support the model’s training, allowing for the generation of diverse and realistic human- scene interactions tailored to specific settings and object arrangements. In conclusion, these methodologies significantly advance our understanding and syn- thesis of human-scene interactions, offering realistic modeling of 3D environments.

Author(s):	Yi, Hongwei
Year:	2025
Month:	April
Day:	07

Bibtex Type:	Ph.D. Thesis (phdthesis)

State:	Published
Attachments:	thesis

BibTex

@phdthesis{UnderstandingHumanSceneInteractionthroughPerceptionandGeneration,
  title = {Understanding Human-Scene Interaction through Perception and Generation},
  abstract = {Humans are in constant contact with the world as they move through it and interact with
  it. Understanding Human-Scene Interactions (HSIs) is key to enhancing our perception
  and manipulation of three-dimensional (3D) environments, which is crucial for various
  applications such as gaming, architecture, and synthetic data creation. However, creating
  realistic 3D scenes populated by moving humans is a challenging and labor-intensive
  task. Existing human-scene interaction datasets are scarce and captured motion datasets
  often lack scene information.
  This thesis addresses these challenges by leveraging three specific types of HSI con-
  straints: (1) depth ordering constraint: humans that move in a scene are occluded or
  occlude objects, thus, defining the relative depth ordering of the objects, (2) collision
  constraint: humans move through free space and do not interpenetrate objects, (3) in-
  teraction constraint: when humans and objects are in contact, the contact surfaces oc-
  cupy the same place in space. Building on these constraints, we propose three distinct
  methodologies: capturing HSI from a monocular RGB video, generating HSI by gen-
  erating scenes from input human motions (scenes from humans) and generating human
  motion from scenes (humans from scenes).
  Firstly, we introduce
  MOVER
  , which jointly reconstructs 3D human motion and the
  interactive scenes from a RGB video. This optimization-based approach leverages these
  three aforementioned constraints to enhance the consistency and plausibility of recon-
  structed scene layouts and to refine the initial 3D human pose and shape estimations.
  Secondly, we present
  MIME
  , which takes 3D humans and a floor map as input to create
  realistic and interactive 3D environments. This method applies collision and interaction
  constraints, and employs an auto-regressive transformer architecture that integrates ob-
  jects into the scene based on existing human motion. The training data is enriched by
  populating the
  3D FRONT
  scene dataset with 3D humans. By treating human movement
  as a “scanner” of the environment, this method results in furniture layouts that reflect
  true human activities, increasing the diversity and authenticity of the environments.
  Lastly, we introduce
  TeSMo
  , which generates 3D human motion from given 3D scenes
  and text descriptions, adhering to the collision and interaction constraints. It utilizes a
  text-controlled scene-aware motion generation framework based on denoising diffusion
  models. Annotated navigation and interaction motions are embedded within scenes to
  support the model’s training, allowing for the generation of diverse and realistic human-
  scene interactions tailored to specific settings and object arrangements.
  In conclusion, these methodologies significantly advance our understanding and syn-
  thesis of human-scene interactions, offering realistic modeling of 3D environments.},
  month = apr,
  year = {2025},
  slug = {understandinghumansceneinteractionthroughperceptionandgeneration},
  author = {Yi, Hongwei},
  month_numeric = {4}
}

Research

Departments

Research Groups

People

Contact

Our Institute

Our History

Career

Doctoral Programs

Training

Service Units

Central Scientific Facilities

Workshops

Campus Services

Impact

Cooperation

Partners and Initiatives