HOLD -- inferring 3D hand and object shape from video | Perceiving Systems – MPI-IS

Institute Homepage

Institute Homepage Sign In

Research Overview

Inferring and exploiting contact

Generative Proxemics: A Prior for 3D Social Interaction from Images

BITE -- Dog Shape and Pose from an Image

HOLD -- inferring 3D hand and object shape from video

MOVER -- Reconstructing 3D Scenes and People using Interaction

Datasets for understanding humans and animals

The Poses for Equine Research Dataset (PFERD)

BEAT2 Dataset for Holistic Co-Speech Gesture Generation

ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation

The BioAMASS Dataset

OpenCapBench dataset

Human health and the 3D body

Body Shape Models in Treating Anorexia Nervosa

Customized Bone Plants for Humerus Shaft Fractures

Reconstructing Signing Avatars From Video Using Linguistic Priors

The AI animator

HAAR: Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles

Gaussian Garments

PuzzleAvatar: Assembling 3D Avatars from Personal Albums

FLARE: Fast Learning of Animatable and Relightable Mesh Avatars

Language, Vision, and World Models

AWOL: Analysis WithOut synthesis using Language

Re-Thinking Inverse Graphics with Large Language Models

TeCH: Text-guided Reconstruction of Clothed Humans

Human pose, shape, and motion capture

WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion

3D Human Pose Estimation via Intuitive Physics

Accurate 3D Body Shape Regression using Metric and Semantic Attributes

BEV

Generating human motion

Generating Human Interaction Motions in Scenes with Text Control

TEMOS: Generating Diverse Human Motions from Text

EMAGE: Full-body Gestures from Audio

TEACH: Temporal Action Compositions for 3D Humans

Robot Perception Group

AirCap: 3D Motion Capture

AirCap: Perception-Based Control

Lab Tours and Public Outreach

Collecting Data - From the Idea to the Publication

Capture Technologies Setup

Completed Projects

Clothing Capture and Modeling

Modeling Human Movement

/ps/projects/action-and-behavior

/ps/projects/inverse-graphics-proj

/ps/projects/image-segmentation-and-semantics

/ps/projects/groups-and-crowds

/ps/projects/learning-from-synthetic-data

/ps/projects/learning-high-dimensional-deep-representations

/ps/projects/efficient-and-scalable-inference

Hierarchical Graphs for Generalized Modelling of Clothing Dynamics

Learning to Resolve Intersections in Neural Multi-Garment Simulations

Reconstructing Simulation-Ready Clothing with Photo-Realistic Appearance from Multi-View Video

Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles

Human Hair Reconstruction with Strand-Aligned 3D Gaussians

Joker: Conditional 3D Head Synthesis with Extreme Facial Expressions

Goal Driven Motion Generation

Perceiving Systems Publications

HOLD – inferring 3D hand and object shape from video

HOLD [] reconstructs detailed 3D geometries of novel objects and hands in interaction from videos. HOLD is agnostic to the object category and trains a compositional articulated implicit model at runtime to disentangle 3D hand and object shape.

Publications

Perceiving Systems Conference Paper HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video Fan, Z., Parelli, M., Kadoglou, M. E., Kocabas, M., Chen, X., Black, M. J., Hilliges, O. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 494-504, Piscataway, NJ, CVPR, September 2024 (Published)

Abstract ›

Since humans interact with diverse objects every day, the holistic 3D capture of these interactions is important to understand and model human behaviour. However, most existing methods for hand-object reconstruction from RGB either assume pre-scanned object templates or heavily rely on limited 3D hand-object data, restricting their ability to scale and generalize to more unconstrained interaction settings. To this end, we introduce HOLD -- the first category-agnostic method that reconstructs an articulated hand and object jointly from a monocular interaction video. We develop a compositional articulated implicit model that can reconstruct disentangled 3D hand and object from 2D images. We also further incorporate hand-object constraints to improve hand-object poses and consequently the reconstruction quality. Our method does not rely on 3D hand-object annotations while outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings. Moreover, we qualitatively show its robustness in reconstructing from in-the-wild videos.

Paper Project Code DOI URL BibTeX