Back
Hands are our primary interface for acting on the world. From everyday tasks like preparing food to skilled procedures like surgery, human activity is shaped by rich and varied hand interactions. These include not only manipulation of external objects but also coordinated actions between both hands. For physical AI systems to learn from human behavior, assist in physical tasks, or collaborate safely in shared environments, they must perceive and understand hands in action, how we use them to interact with each other and with the objects around us. A key component of this understanding is the ability to reconstruct human hand motion and hand-object interactions in 3D from RGB images or videos. However, existing methods focus largely on estimating the pose of a single hand, often in isolation. They struggle with scenarios involving two hands in strong interactions or the interactions with objects, particularly when those objects are articulated or previously unseen. This is because reconstructing 3D hands in action poses significant challenges, such as severe occlusions, appearance ambiguities, and the need to reason about both hand and object geometry in dynamic configurations. As a result, current systems fall short in complex real-world environments. This dissertation addresses these challenges by introducing methods and data for reconstructing hands in action from monocular RGB inputs. We begin by tackling the problem of interacting hand pose estimation. We present DIGIT, a method that leverages a part-aware semantic prior to disambiguate closely interacting hands. By explicitly modeling hand part interactions and encoding the semantics of finger parts, DIGIT robustly recovers accurate hand poses, outperforms prior baselines and provides a step forward for more complete 3D hands in action understanding. Since hands frequently manipulate objects, jointly reconstructing both is crucial. Existing methods for hand-object reconstruction are limited to rigid objects and cannot handle tools with articulation, such as scissors or laptops. This severely restricts their ability to model the full range of everyday manipulations. We present the first method that jointly reconstructs two hands and an articulated object from a single RGB image, enabling unified reasoning across both rigid and articulated object interactions. To support this, we introduce ARCTIC, a large-scale motion capture dataset of humans performing dexterous bimanual manipulation with articulated tools. ARCTIC includes both articulated and fixed (rigid) configurations, along with accurate 3D annotations of hand poses and object motions. Leveraging this dataset, our method jointly infers object articulation states, and hand poses, advancing the state of hand-object understanding in complex object manipulation settings. Finally, we address generalization to in-the-wild object interactions. Prior approaches either rely on synthetic data with limited realism or require object models at test time. We introduce HOLD, a self-supervised method that learns to reconstruct 3D hand-object interactions from monocular RGB videos, without paired 3D annotations or known object models. HOLD learns via an appearance- and motion-consistent objective across views and time, enabling strong generalization to unseen objects in interaction. Experiments demonstrate HOLD's ability to generalize to in-the-wild monocular settings, outperforming fully-supervised baselines trained on synthetic or lab-captured datasets. Together, DIGIT, ARCTIC, and HOLD advance the 3D understanding of hands in action, covering both hand-hand and hand-object interactions. These contributions improve the robustness in interacting hand pose estimation, introduce a dataset for bimanual manipulation with rigid and articulated tools, and include the first singe-image method for jointly reconstructing hands and articulated objects learned directly from this dataset. In addition, HOLD removes the need for object templates by enabling hand-object reconstruction in the wild. These developments move toward more scalable physical AI systems capable of interpreting and imitating human manipulation, with applications in teleoperation, human-robot collaboration, and embodied learning from demonstration.
@phdthesis{fan2025thesis, title = {Learning Hands in Action}, abstract = {Hands are our primary interface for acting on the world. From everyday tasks like preparing food to skilled procedures like surgery, human activity is shaped by rich and varied hand interactions. These include not only manipulation of external objects but also coordinated actions between both hands. For physical AI systems to learn from human behavior, assist in physical tasks, or collaborate safely in shared environments, they must perceive and understand hands in action, how we use them to interact with each other and with the objects around us. A key component of this understanding is the ability to reconstruct human hand motion and hand-object interactions in 3D from RGB images or videos. However, existing methods focus largely on estimating the pose of a single hand, often in isolation. They struggle with scenarios involving two hands in strong interactions or the interactions with objects, particularly when those objects are articulated or previously unseen. This is because reconstructing 3D hands in action poses significant challenges, such as severe occlusions, appearance ambiguities, and the need to reason about both hand and object geometry in dynamic configurations. As a result, current systems fall short in complex real-world environments. This dissertation addresses these challenges by introducing methods and data for reconstructing hands in action from monocular RGB inputs. We begin by tackling the problem of interacting hand pose estimation. We present DIGIT, a method that leverages a part-aware semantic prior to disambiguate closely interacting hands. By explicitly modeling hand part interactions and encoding the semantics of finger parts, DIGIT robustly recovers accurate hand poses, outperforms prior baselines and provides a step forward for more complete 3D hands in action understanding. Since hands frequently manipulate objects, jointly reconstructing both is crucial. Existing methods for hand-object reconstruction are limited to rigid objects and cannot handle tools with articulation, such as scissors or laptops. This severely restricts their ability to model the full range of everyday manipulations. We present the first method that jointly reconstructs two hands and an articulated object from a single RGB image, enabling unified reasoning across both rigid and articulated object interactions. To support this, we introduce ARCTIC, a large-scale motion capture dataset of humans performing dexterous bimanual manipulation with articulated tools. ARCTIC includes both articulated and fixed (rigid) configurations, along with accurate 3D annotations of hand poses and object motions. Leveraging this dataset, our method jointly infers object articulation states, and hand poses, advancing the state of hand-object understanding in complex object manipulation settings. Finally, we address generalization to in-the-wild object interactions. Prior approaches either rely on synthetic data with limited realism or require object models at test time. We introduce HOLD, a self-supervised method that learns to reconstruct 3D hand-object interactions from monocular RGB videos, without paired 3D annotations or known object models. HOLD learns via an appearance- and motion-consistent objective across views and time, enabling strong generalization to unseen objects in interaction. Experiments demonstrate HOLD's ability to generalize to in-the-wild monocular settings, outperforming fully-supervised baselines trained on synthetic or lab-captured datasets. Together, DIGIT, ARCTIC, and HOLD advance the 3D understanding of hands in action, covering both hand-hand and hand-object interactions. These contributions improve the robustness in interacting hand pose estimation, introduce a dataset for bimanual manipulation with rigid and articulated tools, and include the first singe-image method for jointly reconstructing hands and articulated objects learned directly from this dataset. In addition, HOLD removes the need for object templates by enabling hand-object reconstruction in the wild. These developments move toward more scalable physical AI systems capable of interpreting and imitating human manipulation, with applications in teleoperation, human-robot collaboration, and embodied learning from demonstration.}, degree_type = {PhD}, month = dec, year = {2025}, author = {Fan, Zicong}, month_numeric = {12} }
More information