Publications

DEPARTMENTS

Emperical Interference

Haptic Intelligence

Modern Magnetic Systems

Perceiving Systems

Physical Intelligence

Robotic Materials

Social Foundations of Computation


Research Groups

Autonomous Vision

Autonomous Learning

Bioinspired Autonomous Miniature Robots

Dynamic Locomotion

Embodied Vision

Human Aspects of Machine Learning

Intelligent Control Systems

Learning and Dynamical Systems

Locomotion in Biorobotic and Somatic Systems

Micro, Nano, and Molecular Systems

Movement Generation and Control

Neural Capture and Synthesis

Physics for Inference and Optimization

Organizational Leadership and Diversity

Probabilistic Learning Group


Topics

Robot Learning

Conference Paper

2022

Autonomous Learning

Robotics

AI

Career

Award


Perceiving Systems Conference Paper Predicting 4D Hand Trajectory from Monocular Videos Ye, Y., Feng, Y., Taheri, O., Feng, H., Black, M. J., Tulsiani, S. In Int. Conf. on 3D Vision (3DV), March 2026 (Accepted)
We present HAPTIC, an approach that infers coherent 4D hand trajectories from monocular videos. Current video-based hand pose reconstruction methods primarily focus on improving frame-wise 3D pose using adjacent frames rather than studying consistent 4D hand trajectories in space. Despite the additional temporal cues, they generally underperform compared to image-based methods due to the scarcity of annotated video data. To address these issues, we repurpose a state-of-the-art image-based transformer to take in multiple frames and directly predict a coherent trajectory. We introduce two types of lightweight attention layers: cross-view self-attention to fuse temporal information, and global cross-attention to bring in larger spatial context. Our method infers 4D hand trajectories similar to the ground truth while maintaining strong 2D reprojection alignment. We apply the method to both egocentric and allocentric videos. It significantly outperforms existing methods in global trajectory accuracy while being comparable to the state-of-the-art in single-image pose estimation.
project arXiv code BibTeX
Thumb ticker lg haptic

Perceiving Systems Conference Paper Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis Danecek, R., Schmitt, C., Polikovsky, S., Black, M. J. In Int. Conf. on 3D Vision (3DV), March 2026 (Accepted)
In order to be widely applicable, speech-driven 3D head avatars must articulate their lips in accordance with speech, while also conveying the appropriate emotions with dynamically changing facial expressions. The key problem is that deterministic models produce high-quality lip-sync but without rich expressions, whereas stochastic models generate diverse expressions but with lower lip-sync quality. To get the best of both, we seek a stochastic model with accurate lip-sync. To that end, we develop a new approach based on the following observation: if a method generates realistic 3D lip motions, it should be possible to infer the spoken audio from the lip motion. The inferred speech should match the original input audio, and erroneous predictions create a novel supervision signal for training 3D talking head avatars with accurate lip-sync. To demonstrate this effect, we propose THUNDER (Talking Heads Under Neural Differentiable Elocution Reconstruction), a 3D talking head avatar framework that introduces a novel supervision mechanism via differentiable sound production. First, we train a novel mesh-to-speech model that regresses audio from facial animation. Then, we incorporate this model into a diffusion-based talking avatar framework. During training, the mesh-to-speech model takes the generated animation and produces a sound that is compared to the input speech, creating a differentiable analysis-by-audio-synthesis supervision loop. Our extensive qualitative and quantitative experiments demonstrate that THUNDER significantly improves the quality of the lip-sync of talking head avatars while still allowing for generation of diverse, high-quality, expressive facial animations.
project arXiv BibTeX
Thumb ticker lg thunder

Perceiving Systems Conference Paper NeuralFur: Animal Fur Reconstruction from Multi-view Images Skliarova, V., Kabadayi, B., Yiannakidis, A., Becherini, G., Black, M. J., Thies, J. In Int. Conf. on 3D Vision (3DV), March 2026 (Accepted)
Reconstructing realistic animal fur geometry from images is a challenging task due to the fine-scale details, self-occlusion, and view-dependent appearance of fur. In contrast to human hairstyle reconstruction, there are also no datasets that could be leveraged to learn a fur prior for different animals. In this work, we present a first multi-view-based method for high-fidelity 3D fur modeling of animals using a strand-based representation, leveraging the general knowledge of a vision language model. Given calibrated multi-view RGB images, we first reconstruct a coarse surface geometry using traditional multi-view stereo techniques. We then use a visual question answering (VQA) system to retrieve information about the realistic length structure of the fur for each part of the body. We use this knowledge to construct the animal’s furless geometry and grow strands atop it. The fur reconstruction is supervised with both geometric and photometric losses computed from multi-view images. To mitigate orientation ambiguities stemming from the Gabor filters that are applied to the input images, we additionally utilize the VQA to guide the strands' growth direction and their relation to the gravity vector that we incorporate as a loss. With this new schema of using a VQA model to guide 3D reconstruction from multi-view inputs, we show generalization across a variety of animals with different fur types.
project arXiv code BibTeX
Thumb ticker lg neuralfur

Perceiving Systems Article Textile suit for anywhere full-body motion capture Sun, H., Feng, Y., Kao, P., Black, M. J., Kramer-Bottiglio, R. Science Advances, 12(10):1-15, March 2026 (Published)
Wearable technology has shown notable promise for tracking human motion, offering valuable insights for fields ranging from biomechanics to healthcare. Traditional motion capture systems, however, are often bulky and disruptive, making them impractical for daily use. Advances in textile-based sensing offer a promising alternative, enabling seamless integration of air- and sweat-permeable sensors into everyday clothing. Here, a sensorized textile suit designed for unobtrusive full-body motion capture is presented. The suit is capable of accurately tracking complex movements without interfering with routine activities. This wearable, using an individual-customized network of fabric-based sensors, autonomously identifies and monitors movement angles and patterns, providing insights into physical range, activity frequency, and exertion levels. Language models are shown to interpret motion data into descriptive language, enhancing its potential for real-world applications. This sensorized textile suit and corresponding algorithms represent a step forward in accessible, continuous movement monitoring in the form of everyday clothing, opening avenues for studying human behavior and health in natural environments. YSuit is a fully textile suit for accurate, customizable, washable, and comfortable anywhere full-body motion capture.
pdf publisher site data DOI BibTeX
Thumb ticker lg textilesuit

Perceiving Systems Ph.D. Thesis From Perception to Actions: Autonomous Exploration, Synthetic Data, and Dynamic Worlds Bonetto, E. January 2026 (Published)
This thesis explores innovative methods and frameworks to enhance intelligent systems' visual perception capabilities. Vision is the primary means by which many animals perceive, understand, learn, reason about, and interact with the world to achieve their goals. Unlike animals, intelligent systems must acquire these capabilities by processing raw visual data captured by cameras using computer vision and deep learning. First, we consider a crucial aspect of visual perception in intelligent systems: understanding the structure and layout of the environment. To enable applications such as object interaction or extended reality in previously unseen spaces, these systems are often required to estimate their own motion. When operating in novel environments, they must also construct a map of the space. Together, we have the essence of the Simultaneous Localization and Mapping (SLAM) problem. However, pre-mapping environments can be impractical, costly, and unscalable in scenarios like disaster response or home automation. This makes it essential to develop robots capable of autonomously exploring and mapping unknown areas, a process known as Active SLAM. Active SLAM typically involves a multi-step process in which the robot acts on the available information to decide the next best actions. The goal is to autonomously and efficiently explore environments without using prior information. Despite an extensive history, Active SLAM methods focused only on short- or long-term objectives, without considering the totality of the process or adapting to the ever-changing states. Addressing these gaps, we introduce iRotate to capitalize on continuous information-gain prediction. Distinct from prevailing approaches, iRotate constantly (pre)optimizes camera viewpoints acting on i) long-term, ii) short-term, and iii) real time objectives. By doing this, iRotate significantly reduces energy consumption and localization errors, thus diminishing the exploration effort - a substantial leap in efficiency and effectiveness. iRotate, like many other SLAM approaches, leverages the assumption of operating in a static environment. Dynamic components in the scene significantly impact SLAM performance in the localization, place recognition, and optimization steps, hindering the widespread adoption of autonomous robots. This stems from the difficulties of collecting diverse ground truth information in the real world and the long-standing limitations of simulation tools. Testing directly in the real world is costly and risky without prior simulation validation. Datasets instead are inherently static and non-interactive making them useless for developing autonomous approaches. Then, existing simulation tools often lack the visual realism and flexibility to create and control fully customized experiments to bridge the gap between simulation and the real world. This thesis addresses the challenges of obtaining ground truth data and simulating dynamic environments by introducing the GRADE framework. Through a photorealistic rendering engine, we enable online and offline testing of robotic systems and the generation of richly annotated synthetic ground truth data. By ensuring flexibility and repeatability, we allow the extension of previous experiments through variations, for example, in scene content or sensor settings. Synthetic data can first be used to address several challenges in the context of Deep Learning (DL) approaches, e.g. mismatched data distribution between applications, costs and limits of data collection procedures, and errors caused by incorrect or inconsistent labeling in training datasets. However, the gap between the real and simulated worlds often limits the direct use of synthetic data making style transfer, adaptation techniques, or real-world information necessary. Here, we leverage the photorealism obtainable with GRADE to generate synthetic data and overcome these issues. First, since humans are significant sources of dynamic behavior in environments and the target of many applications, we focus on their detection and segmentation. We train models on real, synthetic, and mixed datasets, and show that using only synthetic data can lead to state-of-the-art performance in indoor scenarios. Then, we leverage GRADE to benchmark several Dynamic Visual SLAM methods. These often rely on semantic segmentation and optical flow techniques to identify moving objects and exclude their visual features from the pose estimation and optimization processes. Our evaluations show how they tend to reject too many features, leading to failures in accurately and fully tracking camera trajectories. Surprisingly, we observed low tracking rates not only on simulated sequences but also in real-world datasets. Moreover, we also show that the performance of the segmentation and detection models used are not always positively correlated with the ones of the Dynamic Visual SLAM methods. These failures are mainly due to incorrect estimations, crowded scenes, and not considering the different motion states that the object can have. Addressing this, we introduce DynaPix. This Dynamic Visual SLAM method estimates per-pixel motion probabilities and incorporates them into a new enhanced pose estimation and optimization processes within the SLAM backend, resulting in longer tracking times and lower trajectory errors. Finally, we use GRADE to address the challenge of limited and inaccurate annotations of wild zebras, particularly for their detection and pose estimation when observed by unmanned aerial vehicles. Leveraging the flexibility of GRADE, we introduce ZebraPose - the first full top-down synthetic-to-real detection and 2D pose estimation method. Unlike previous approaches, ZebraPose demonstrates that both tasks can be performed using only synthetic data, eliminating the need for costly data collection campaigns, time-consuming annotation procedures, or syn-to-real transfer techniques. Ultimately, this thesis demonstrates how combining perception with action can overcome critical limitations in robotics and environmental perception, thereby advancing the deployment of intelligent and autonomous systems for real-world applications. Through innovations like iRotate, GRADE, and ZebraPose, it paves the way for more robust, flexible, and efficient intelligent systems capable of navigating dynamic environments.
Thesis: From Perception to Actions BibTeX
Thumb ticker lg eiabonetto thesis cover

Perceiving Systems Ph.D. Thesis Physics-Informed Modeling of Dynamic Humans and Their Interactions Shashank, T. January 2026 (Published)
Building convincing digital humans is central to the vision of shared virtual worlds for AR, VR, and telepresence. Yet, despite rapid progress in 3D vision, today’s virtual humans often fall into a physical "uncanny valley”—bodies float above or penetrate objects, motions ignore balance and biomechanics, and human object interactions miss the rich contact patterns that make behavior look real. Enforcing physics through simulation is possible, but remains too slow, restrictive, and brittle for real-world, in-the-wild settings. This thesis argues that physical realism does not require full simulation. Instead, it can emerge from the same principles humans rely on every day: intuitive physics and contact. Inspired by insights from biomechanics and cognitive science, I present a unified framework that embeds these ideas directly into learning-based 3D human modeling. In this thesis, I present a suite of methods that bridge the gap between 3D human reconstruction and physical plausibility. I first introduce IPMAN, which incorporates differentiable biomechanical cues, such as center of mass and center of pressure, to produce stable, balanced, and grounded static poses. I then extend this framework to dynamic motion with HUMOS, a shape-conditioned motion generation model that accounts for how individual physiology influences movement, without requiring paired training data. Moving beyond locomotion, I address complex human-object interactions with DECO, a 3D contact detector that estimates dense, vertex-level contact across the full body surface. Finally, I present PICO, which establishes contact correspondences between the human body and arbitrary objects to recover full 3D interactions from single images. Together, these contributions bring physics-aware human modeling closer to practical deployment. The result is a step toward digital humans that not only look right, but move and interact with the world in ways that feel intuitively real.
Thesis BibTeX

Perceiving Systems Ph.D. Thesis Aerial Robot Formations for Dynamic Environment Perception Price, E. University of Tübingen, Tübingen, Germany, December 2025 (Published)
Perceiving moving subjects, like humans and animals, outside an enclosed and controlled environment in a lab is inherently challenging, since subjects could move outside the view and range of cameras and sensors that are static and extrinsically calibrated. Previous state-of-the-art methods for such perception in outdoor scenarios use markers or sensors on the subject, which are both intrusive and unscalable for animal subjects. To address this problem, we introduce robotic flying cameras that autonomously follow the subjects. To enable functions such as monitoring, behaviour analysis or motion capture, a single point of view is often insufficient due to self-occlusion, lack of depth perception and coverage from all sides. Therefore, we propose a team of such robotic cameras that fly in formation to provide continuous coverage from multiple view-points. The position of the subject must be determined using markerless, remote sensing methods in real time. To solve this, we combine a convolutional neural network-based detector to detect the subject with a novel cooperative Bayesian fusion method to track the detected subject from multiple robots. The robots need to then plan and control their own flight path and orientation relative to the subject to achieve and maintain continuous coverage from multiple view-points. This, we address with a model-predictive-control-based method to predict and plan the motion of every robot in the formation around the subject. A preliminary demonstrator is implemented with multi-rotor drones. However, drones are noisy and potentially unsafe for the observed subjects. To address this, we introduce non-holonomic lighter-than-air autonomous airships (blimps) as the robotic camera platform. This type of robot requires dynamically constrained orbiting formations to achieve omnidirectional visual coverage of a moving subject in the presence of wind. Therefore, we introduce a novel model-predictive formation controller for a team of airships. We demonstrate and evaluate our complete system in field experiments involving both human and wild animals as subjects. The collected data enables both human outdoor motion capture and animal behaviour analysis. Additionally, we propose our method for autonomous long-term wildlife monitoring. This dissertation covers the design and evaluation of aerial robots suitable to this task, including computer vision/sensing, data annotation and network training, sensor fusion, planning, control, simulation, and modelling.
Thesis DOI BibTeX

Perceiving Systems Conference Paper BEDLAM2.0: Synthetic humans and cameras in motion Tesch, J., Becherini, G., Achar, P., Yiannakidis, A., Kocabas, M., Patel, P., Black, M. J. In Advances in Neural Information Processing Systems (NeurIPS), Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, December 2025 (Published)
Inferring 3D human motion from video remains a challenging problem with many applications. While traditional methods estimate the human in image coordinates, many applications require human motion to be estimated in world coordinates. This is particularly challenging when there is both human and camera motion. Progress on this topic has been limited by the lack of rich video data with ground truth human and camera movement. We address this with BEDLAM2.0, a new dataset that goes beyond the popular BEDLAM dataset in important ways. In addition to introducing more diverse and realistic cameras and camera motions, BEDLAM2.0 increases diversity and realism of body shape, motions, clothing, hair, and 3D environments. Additionally, it adds shoes, which were missing in BEDLAM. BEDLAM has become a key resource for training 3D human pose and motion regressors today and we show that BEDLAM2.0 is significantly better, particularly for training methods that estimate humans in world coordinates. We compare state-of-the art methods trained on BEDLAM and BEDLAM2.0, and find that BEDLAM2.0 significantly improves accuracy over BEDLAM. For research purposes, we provide the rendered videos, ground truth body parameters, and camera motions. We also provide the 3D assets to which we have rights and links to those from third parties.
Project Paper Video URL BibTeX
Thumb ticker lg b2 videothumbnail

Perceiving Systems Conference Paper HairFree: Compositional 2D Head Prior for Text-Driven 360° Bald Texture Synthesis Ostrek, M., Black, M., Thies, J. In Advances in Neural Information Processing Systems (NeurIPS), Advances in Neural Information Processing Systems (NeurIPS), December 2025 (Published)
Synthesizing high-quality 3D head textures is crucial for gaming, virtual reality, and digital humans. Achieving seamless 360° textures typically requires expensive multi-view datasets with precise tracking. However, traditional methods struggle without back-view data or precise geometry, especially for human heads, where even minor inconsistencies disrupt realism. We introduce HairFree, an unsupervised texturing framework guided by textual descriptions and 2D diffusion priors, producing high-consistency 360° bald head textures—including non-human skin with fine details—without any texture, back-view, bald, non-human, or synthetic training data. We fine-tune a diffusion prior on a dataset of mostly frontal faces, conditioned on predicted 3D head geometry and face parsing. During inference, HairFree uses precise skin masks and 3D FLAME geometry as input conditioning, ensuring high 3D consistency and alignment. We synthesize the full 360° texture by first generating a frontal RGB image aligned to the 3D FLAME pose and mapping it to UV space. As the virtual camera moves, we inpaint and merge missing regions. A built-in semantic prior enables precise region separation—particularly for isolating and removing hair—allowing seamless integration with various assets like customizable 3D hair, eyeglasses, jewelry, etc. We evaluate HairFree quantitatively and qualitatively, demonstrating its superiority over state-of-the-art 3D head avatar generation methods. https://hairfree.is.tue.mpg.de/
pdf project poster BibTeX
Thumb ticker lg textures

Perceiving Systems Conference Paper GenLit: Reformulating Single Image Relighting as Video Generation Bharadwaj, S., Feng, H., Becherini, G., Abrevaya, V. F., Black, M. J. In SIGGRAPH Asia Conference Papers ’25, Association for Computing Machinery, SIGGRAPH Asia, December 2025 (To be published)
Manipulating the illumination of a 3D scene within a single image represents a fundamental challenge in computer vision and graphics. This problem has traditionally been addressed using inverse rendering techniques, which involve explicit 3D asset reconstruction and costly ray-tracing simulations. Meanwhile, recent advancements in visual foundation models suggest that a new paradigm could soon be possible -- one that replaces explicit physical models with networks that are trained on large amounts of image and video data. In this paper, we exploit the implicit scene understanding of a video diffusion model, particularly Stable Video Diffusion, to relight a single image. We introduce GenLit, a framework that distills the ability of a graphics engine to perform light manipulation into a video-generation model, enabling users to directly insert and manipulate a point light in the 3D world within a given image and generate results directly as a video sequence. We find that a model fine-tuned on only a small synthetic dataset generalizes to real-world scenes, enabling single-image relighting with plausible and convincing shadows and inter-reflections. Our results highlight the ability of video foundation models to capture rich information about lighting, material, and shape, and our findings indicate that such models, with minimal training, can be used to perform relighting without explicit asset reconstruction or ray-tracing.
Project Page Paper DOI URL BibTeX
Thumb ticker lg genlit thumbnail

Perceiving Systems Ph.D. Thesis Learning Hands in Action Fan, Z. December 2025 (Published)
Hands are our primary interface for acting on the world. From everyday tasks like preparing food to skilled procedures like surgery, human activity is shaped by rich and varied hand interactions. These include not only manipulation of external objects but also coordinated actions between both hands. For physical AI systems to learn from human behavior, assist in physical tasks, or collaborate safely in shared environments, they must perceive and understand hands in action, how we use them to interact with each other and with the objects around us. A key component of this understanding is the ability to reconstruct human hand motion and hand-object interactions in 3D from RGB images or videos. However, existing methods focus largely on estimating the pose of a single hand, often in isolation. They struggle with scenarios involving two hands in strong interactions or the interactions with objects, particularly when those objects are articulated or previously unseen. This is because reconstructing 3D hands in action poses significant challenges, such as severe occlusions, appearance ambiguities, and the need to reason about both hand and object geometry in dynamic configurations. As a result, current systems fall short in complex real-world environments. This dissertation addresses these challenges by introducing methods and data for reconstructing hands in action from monocular RGB inputs. We begin by tackling the problem of interacting hand pose estimation. We present DIGIT, a method that leverages a part-aware semantic prior to disambiguate closely interacting hands. By explicitly modeling hand part interactions and encoding the semantics of finger parts, DIGIT robustly recovers accurate hand poses, outperforms prior baselines and provides a step forward for more complete 3D hands in action understanding. Since hands frequently manipulate objects, jointly reconstructing both is crucial. Existing methods for hand-object reconstruction are limited to rigid objects and cannot handle tools with articulation, such as scissors or laptops. This severely restricts their ability to model the full range of everyday manipulations. We present the first method that jointly reconstructs two hands and an articulated object from a single RGB image, enabling unified reasoning across both rigid and articulated object interactions. To support this, we introduce ARCTIC, a large-scale motion capture dataset of humans performing dexterous bimanual manipulation with articulated tools. ARCTIC includes both articulated and fixed (rigid) configurations, along with accurate 3D annotations of hand poses and object motions. Leveraging this dataset, our method jointly infers object articulation states, and hand poses, advancing the state of hand-object understanding in complex object manipulation settings. Finally, we address generalization to in-the-wild object interactions. Prior approaches either rely on synthetic data with limited realism or require object models at test time. We introduce HOLD, a self-supervised method that learns to reconstruct 3D hand-object interactions from monocular RGB videos, without paired 3D annotations or known object models. HOLD learns via an appearance- and motion-consistent objective across views and time, enabling strong generalization to unseen objects in interaction. Experiments demonstrate HOLD's ability to generalize to in-the-wild monocular settings, outperforming fully-supervised baselines trained on synthetic or lab-captured datasets. Together, DIGIT, ARCTIC, and HOLD advance the 3D understanding of hands in action, covering both hand-hand and hand-object interactions. These contributions improve the robustness in interacting hand pose estimation, introduce a dataset for bimanual manipulation with rigid and articulated tools, and include the first singe-image method for jointly reconstructing hands and articulated objects learned directly from this dataset. In addition, HOLD removes the need for object templates by enabling hand-object reconstruction in the wild. These developments move toward more scalable physical AI systems capable of interpreting and imitating human manipulation, with applications in teleoperation, human-robot collaboration, and embodied learning from demonstration.
PDF BibTeX
Thumb ticker lg cleanshot 0119 at 11.20.11 2x

Haptic Intelligence Perceiving Systems Ph.D. Thesis An Interdisciplinary Approach to Human Pose Estimation: Application to Sign Language Forte, M. University of Tübingen, Tübingen, Germany, November 2025, Department of Computer Science (Published)
Accessibility legislation mandates equal access to information for Deaf communities. While videos of human interpreters provide optimal accessibility, they are costly and impractical for frequently updated content. AI-driven signing avatars offer a promising alternative, but their development is limited by the lack of high-quality 3D motion-capture data at scale. Vision-based motion-capture methods are scalable but struggle with the rapid hand movements, self-occlusion, and self-touch that characterize sign language. To address these limitations, this dissertation develops two complementary solutions. SGNify improves hand pose estimation by incorporating universal linguistic rules that apply to all sign languages as computational priors. Proficient signers recognize the reconstructed signs as accurately as those in the original videos, but depth ambiguities along the camera axis can still produce incorrect reconstructions for signs involving self-touch. To overcome this remaining limitation, BioTUCH integrates electrical bioimpedance sensing between the wrists of the person being captured. Systematic measurements show that skin-to-skin contact produces distinctive bioimpedance reductions at high frequencies (240 kHz to 4.1 MHz), enabling reliable contact detection. BioTUCH uses the timing of these self-touch events to refine arm poses, producing physically plausible arm configurations and significantly reducing reconstruction error. Together, these contributions support the scalable collection of high-quality 3D sign language motion data, facilitating progress toward AI-driven signing avatars.
BibTeX

Haptic Intelligence Perceiving Systems Conference Paper Contact-Aware Refinement of Human Pose Pseudo-Ground Truth via Bioimpedance Sensing Forte, M., Athanasiou, N., Ballardini, G., Bartels, U., Kuchenbecker, K. J., Black, M. J. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 5071-5080, Honolulu, USA, October 2025, Nikos Athanasiou and Giulia Ballardini contributed equally to this publication (Published) pdf URL BibTeX
Thumb ticker lg rgb image noblur small2  2  2

Perceiving Systems Conference Paper Generative Zoo Niewiadomski, T., Yiannakidis, A., Cuevas-Velasquez, H., Sanyal, S., Black, M. J., Zuffi, S., Kulits, P. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, International Conference on Computer Vision, ICCV, October 2025 (Published)
The model-based estimation of 3D animal pose and shape from images enables computational modeling of animal behavior. Training models for this purpose requires large amounts of labeled image data with precise pose and shape annotations. However, capturing such data requires the use of multi-view or marker-based motion-capture systems, which are impractical to adapt to wild animals in situ and impossible to scale across a comprehensive set of animal species. Some have attempted to address the challenge of procuring training data by pseudo-labeling individual real-world images through manual 2D annotation, followed by 3D-parameter optimization to those labels. While this approach may produce silhouette-aligned samples, the obtained pose and shape parameters are often implausible due to the ill-posed nature of the monocular fitting problem. Sidestepping real-world ambiguity, others have designed complex synthetic-data-generation pipelines leveraging video-game engines and collections of artist-designed 3D assets. Such engines yield perfect ground-truth annotations but are often lacking in visual realism and require considerable manual effort to adapt to new species or environments. Motivated by these shortcomings, we propose an alternative approach to synthetic-data generation: rendering with a conditional image-generation model. We introduce a pipeline that samples a diverse set of poses and shapes for a variety of mammalian quadrupeds and generates realistic images with corresponding ground-truth pose and shape parameters. To demonstrate the scalability of our approach, we introduce GenZoo, a synthetic dataset containing one million images of distinct subjects. We train a 3D pose and shape regressor on GenZoo, which achieves state-of-the-art performance on a real-world multi-species 3D animal pose and shape estimation benchmark, despite being trained solely on synthetic data. We will release our dataset and generation pipeline to support future research.
project page code demo pdf BibTeX
Thumb ticker lg teaser

Perceiving Systems Conference Paper ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness Li, B., Feng, H., Cai, Z., Black, M. J., Xiu, Y. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025 (Published)
itting a body to a 3D clothed human point cloud is a common yet challenging task. Traditional optimization-based approaches use multi-stage pipelines that are sensitive to pose initialization, while recent learning-based methods often struggle with generalization across diverse poses and garment types. We propose Equivariant Tightness Fitting for Clothed Humans, or ETCH, a novel pipeline that estimates cloth-to-body surface mapping through locally approximate SE(3) equivariance, encoding tightness as displacement vectors from the cloth surface to the underlying body. Following this mapping, pose-invariant body features regress sparse body markers, simplifying clothed human fitting into an inner-body marker fitting task. Extensive experiments on CAPE and 4D-Dress show that ETCH significantly outperforms state-of-the-art methods -- both tightness-agnostic and tightness-aware -- in body fitting accuracy on loose clothing (16.7% ~ 69.5%) and shape accuracy (average 49.9%). Our equivariant tightness design can even reduce directional errors by (67.2% ~ 89.8%) in one-shot (or out-of-distribution) settings (~ 1% data). Qualitative results demonstrate strong generalization of ETCH, regardless of challenging poses, unseen shapes, loose clothing, and non-rigid dynamics.
project arXiv code video BibTeX
Thumb ticker lg tightness

Perceiving Systems Conference Paper Im2Haircut: Single-view Strand-based Hair Reconstruction for Human Avatars Vanessa, S., Egor, Z., Malte, P., Giorgio, B., Michael, B., Justus, T. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, USA, October 2025 (Accepted)
We present a novel approach for 3D hair reconstruction from single photographs based on a global hair prior combined with local optimization. Capturing strand-based hair geometry from single photographs is challenging due to the variety and geometric complexity of hairstyles and the lack of ground truth training data. Classical reconstruction methods like multi-view stereo only reconstruct the visible hair strands, missing the inner structure of hairstyles and hampering realistic hair simulation. To address this, existing methods leverage hairstyle priors trained on synthetic data. Such data, however, is limited in both quantity and quality since it requires manual work from skilled artists to model the 3D hairstyles and create near-photorealistic renderings. To address this, we propose a novel approach that uses both, real and synthetic data to learn an effective hairstyle prior. Specifically, we train a transformer-based prior model on synthetic data to obtain knowledge of the internal hairstyle geometry and introduce real data in the learning process to model the outer structure. This training scheme is able to model the visible hair strands depicted in an input image, while preserving the general 3D structure of hairstyles. We exploit this prior to create a Gaussian-splatting-based reconstruction method that creates hairstyles from one or more images. Qualitative and quantitative comparisons with existing reconstruction pipelines demonstrate the effectiveness and superior performance of our method for capturing detailed hair orientation, overall silhouette, and backside consistency. For additional results and code, please refer to https://im2haircut.is.tue.mpg.de.
arXiv project code BibTeX
Thumb ticker lg im2haircut

Perceiving Systems Conference Paper MagicHOI: Leveraging 3D Priors for Accurate Hand-object Reconstruction from Short Monocular Video Clips Wang, S., He, H., Parelli, M., Gebhardt, C., Fan, Z., Song, J. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), ICCV, October 2025 (Published)
Most RGB-based hand-object reconstruction methods rely on object templates, while template-free methods typically assume full object visibility. This assumption often breaks in real-world settings, where fixed camera viewpoints and static grips leave parts of the object unobserved, resulting in implausible reconstructions. To overcome this, we present MagicHOI, a method for reconstructing hands and objects from short monocular interaction videos, even under limited viewpoint variation. Our key insight is that, despite the scarcity of paired 3D hand-object data, largescale novel view synthesis diffusion models offer rich object supervision. This supervision serves as a prior to regularize unseen object regions during hand interactions. Leveraging this insight, we integrate a novel view synthesis model into our hand-object reconstruction framework. We further align hand to object by incorporating visible contact constraints. Our results demonstrate that MagicHOI significantly outperforms existing state-of-the-art hand-object reconstruction methods. We also show that novel view synthesis diffusion priors effectively regularize unseen object regions, enhancing 3D hand-object reconstruction.
Project Video Code URL BibTeX
Thumb ticker lg cleanshot 1104 at 09.29.58 2x

Perceiving Systems Conference Paper MoGA: 3D Generative Avatar Prior for Monocular Gaussian Avatar Reconstruction Dong, Z., Duan, L., Song, J., Black, M. J., Geiger, A. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025 (Published)
We present MoGA, a novel method to reconstruct high-fidelity 3D Gaussian avatars from a single-view image. The main challenge lies in inferring unseen appearance and geometric details while ensuring 3D consistency and realism. Most previous methods rely on 2D diffusion models to synthesize unseen views; however, these generated views are sparse and inconsistent, resulting in unrealistic 3D artifacts and blurred appearance. To address these limitations, we leverage a generative avatar model, that can generate diverse 3D avatars by sampling deformed Gaussians from a learned prior distribution. Due to the limited amount of 3D training data such a 3D model alone cannot capture all image details of unseen identities. Consequently, we integrate it as a prior, ensuring 3D consistency by projecting input images into its latent space and enforcing additional 3D appearance and geometric constraints. Our novel approach formulates Gaussian avatar creation as a model inversion process by fitting the generative avatar to synthetic views from 2D diffusion models. The generative avatar provides a meaningful initialization for model fitting, enforces 3D regularization, and helps in refining pose estimation. Experiments show that our method surpasses state-of-the-art techniques and generalizes well to real-world scenarios. Our Gaussian avatars are also inherently animatable.
pdf project code video BibTeX
Thumb ticker lg moga

Perceiving Systems Conference Paper PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning Zhang, Y., Feng, Y., Cseke, A., Saini, N., Bajandas, N., Heron, N., Black, M. J. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025 (Published)
We formulate the motor system of an interactive avatar as a generative motion model that can drive the body to move through 3D space in a perpetual, realistic, controllable, and responsive manner. Although human motion generation has been extensively studied, many existing methods lack the responsiveness and realism of real human movements. Inspired by recent advances in foundation models, we propose PRIMAL, which is learned with a two-stage paradigm. In the pretraining stage, the model learns body movements from a large number of sub-second motion segments, providing a generative foundation from which more complex motions are built. This training is fully unsupervised without annotations. Given a single-frame initial state during inference, the pretrained model not only generates unbounded, realistic, and controllable motion, but also enables the avatar to be responsive to induced impulses in real time. In the adaptation phase, we employ a novel ControlNet-like adaptor to fine-tune the base model efficiently, adapting it to new tasks such as few-shot personalized action generation and spatial target reaching. Evaluations show that our proposed method outperforms state-of-the-art baselines. We leverage the model to create a real-time character animation system in Unreal Engine that feels highly responsive and natural.
pdf project code video BibTeX
Thumb ticker lg primal

Perceiving Systems Conference Paper SDFit: 3D Object Pose and Shape by Fitting a Morphable SDF to a Single Image Antić, D., Paschalidis, G., Tripathi, S., Gevers, T., Dwivedi, S. K., Tzionas, D. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025 (Published)
Recovering 3D object pose and shape from a single image is a challenging and ill-posed problem. This is due to strong (self-)occlusions, depth ambiguities, the vast intra- and inter-class shape variance, and the lack of 3D ground truth for natural images. Existing deep-network methods are trained on synthetic datasets to predict 3D shapes, so they often struggle generalizing to real-world images. Moreover, they lack an explicit feedback loop for refining noisy estimates, and primarily focus on geometry without directly considering pixel alignment. To tackle these limitations, we develop a novel render-and-compare optimization framework, called SDFit. This has three key innovations: First, it uses a learned category-specific and morphable signed-distance-function (mSDF) model, and fits this to an image by iteratively refining both 3D pose and shape. The mSDF robustifies inference by constraining the search on the manifold of valid shapes, while allowing for arbitrary shape topologies. Second, SDFit retrieves an initial 3D shape that likely matches the image, by exploiting foundational models for efficient look-up into 3D shape databases. Third, SDFit initializes pose by establishing rich 2D-3D correspondences between the image and the mSDF through foundational features. We evaluate SDFit on three image datasets, i.e., Pix3D, Pascal3D+, and COMIC. SDFit performs on par with SotA feed-forward networks for unoccluded images and common poses, but is uniquely robust to occlusions and uncommon poses. Moreover, it requires no retraining for unseen images. Thus, SDFit contributes new insights for generalizing in the wild.
Project arXiv Code Video Poster BibTeX
Thumb ticker lg sdfit teaser final

Perceiving Systems Conference Paper St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World Feng, H., Zhang, J., Wang, Q., Ye, Y., Yu, P., Black, M., Darrell, T., Kanazawa, A. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025 (Published)
Dynamic 3D reconstruction and point tracking in videos are typically treated as separate tasks, despite their deep connection. We propose St4RTrack, a feed-forward framework that simultaneously reconstructs and tracks dynamic video content in a world coordinate frame from RGB inputs. This is achieved by predicting two appropriately defined pointmaps for a pair of frames captured at different moments. Specifically, we predict both pointmaps at the same moment, in the same world, capturing both static and dynamic scene geometry while maintaining 3D correspondences. Chaining these predictions through the video sequence with respect to a reference frame naturally computes long-range correspondences, effectively combining 3D reconstruction with 3D tracking. Unlike prior methods that rely heavily on 4D ground truth supervision we employ a novel adaptation scheme based on a reprojection loss. We establish a new extensive benchmark for world-frame reconstruction and tracking, demonstrating the effectiveness and efficiency of our unified, data-driven framework.
pdf arXiv project code demo video BibTeX
Thumb ticker lg startrack

Perceiving Systems Conference Paper Reconstructing Animals and the Wild Kulits, P., Black, M. J., Zuffi, S. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2025 (Published)
The idea of 3D reconstruction as scene understanding is foundational in computer vision. Reconstructing 3D scenes from 2D visual observations requires strong priors to disambiguate structure. Much work has been focused on the anthropocentric, which, characterized by smooth surfaces, coherent normals, and regular edges, allows for the integration of strong geometric inductive biases. Here, we consider a more challenging problem where such assumptions do not hold: the reconstruction of natural scenes containing trees, bushes, boulders, and animals. While numerous works have attempted to tackle the problem of reconstructing animals in the wild, they have focused solely on the animal, neglecting environmental context. This limits their usefulness for analysis tasks, as animals exist inherently within the 3D world, and information is lost when environmental factors are disregarded. We propose a method to reconstruct natural scenes from single images. We base our approach on recent advances leveraging the strong world priors ingrained in Large Language Models and train an autoregressive model to decode a CLIP embedding into a structured compositional scene representation, encompassing both animals and the wild (RAW). To enable this, we propose a synthetic dataset comprising one million images and thousands of assets. Our approach, having been trained solely on synthetic data, generalizes to the task of reconstructing animals and their environments in real-world images. We will release our dataset and code to encourage future research.
project arXiv code BibTeX
Thumb ticker lg raw2

Perceiving Systems Conference Paper DiffLocks: Generating 3D Hair from a Single Image using Diffusion Models Rosu, R. A., Wu, K., Feng, Y., Zheng, Y., Black, M. J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2025 (Published)
We address the task of reconstructing 3D hair geometry from a single image, which is challenging due to the diversity of hairstyles and the lack of paired image-to-3D hair data. Previous methods are primarily trained on synthetic data and cope with the limited amount of such data by using low-dimensional intermediate representations, such as guide strands and scalp-level embeddings, that require post-processing to decode, upsample, and add realism. These approaches fail to reconstruct detailed hair, struggle with curly hair, or are limited to handling only a few hairstyles. To overcome these limitations, we propose DiffLocks, a novel framework that enables detailed reconstruction of a wide variety of hairstyles directly from a single image. First, we address the lack of 3D hair data by automating the creation of the largest synthetic hair dataset to date, containing 40K hairstyles. Second, we leverage the synthetic hair dataset to learn an image-conditioned diffusion-transfomer model that reconstructs accurate 3D strands from a single frontal image. By using a pretrained image backbone, our method generalizes to in-the-wild images despite being trained only on synthetic data. Our diffusion model predicts a scalp texture map in which any point in the map contains the latent code for an individual hair strand. These codes are directly decoded to 3D strands without post-processing techniques. Representing individual strands, instead of guide strands, enables the transformer to model the detailed spatial structure of complex hairstyles. With this, DiffLocks can reconstruct highly curled hair, like afro hairstyles, from a single image for the first time. Qualitative and quantitative results demonstrate that DiffLocks outperforms exising state-of-the-art approaches. Data and code is available for research.
project paper code dataset BibTeX
Thumb ticker lg difflocks

Perceiving Systems Conference Paper InterDyn: Controllable Interactive Dynamics with Video Diffusion Models Akkerman, R., Feng, H., Black, M. J., Tzionas, D., Abrevaya, V. F. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2025 (Published)
Predicting the dynamics of interacting objects is essential for both humans and intelligent systems. However, existing approaches are limited to simplified, toy settings and lack generalizability to complex, real-world environments. Recent advances in generative models have enabled the prediction of state transitions based on interventions, but focus on generating a single future state which neglects the continuous dynamics resulting from the interaction. To address this gap, we propose InterDyn, a novel framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor. Our key insight is that large video generation models can act as both neural renderers and implicit physics ``simulators'', having learned interactive dynamics from large-scale video data. To effectively harness this capability, we introduce an interactive control mechanism that conditions the video generation process on the motion of the driving entity. Qualitative results demonstrate that InterDyn generates plausible, temporally consistent videos of complex object interactions while generalizing to unseen objects. Quantitative evaluations show that InterDyn outperforms baselines that focus on static state transitions. This work highlights the potential of leveraging video generative models as implicit physics engines
project arXiv BibTeX
Thumb ticker lg interdyn

Perceiving Systems Conference Paper PICO: Reconstructing 3D People In Contact with Objects Cseke, A., Tripathi, S., Dwivedi, S. K., Lakshmipathy, A. S., Chatterjee, A., Black, M. J., Tzionas, D. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2025 (Published)
Recovering 3D Human-Object Interaction (HOI) from single color images is challenging due to depth ambiguities, occlusions, and the huge variation in object shape and appearance. Thus, past work requires controlled settings such as known object shapes and contacts, and tackles only limited object classes. Instead, we need methods that generalize to natural images and novel object classes. We tackle this in two main ways: (1) We collect PICO-db, a new dataset of natural images uniquely paired with dense 3D contact on both body and object meshes. To this end, we use images from the recent DAMON dataset that are paired with contacts, but these contacts are only annotated on a canonical 3D body. In contrast, we seek contact labels on both the body and the object. To infer these given an image, we retrieve an appropriate 3D object mesh from a database by leveraging vision foundation models. Then, we project DAMON's body contact patches onto the object via a novel method needing only 2 clicks per patch. This minimal human input establishes rich contact correspondences between bodies and objects. (2) We exploit our new dataset of contact correspondences in a novel render-and-compare fitting method, called PICO-fit, to recover 3D body and object meshes in interaction. PICO-fit infers contact for the SMPL-X body, retrieves a likely 3D object mesh and contact from PICO-db for that object, and uses the contact to iteratively fit the 3D body and object meshes to image evidence via optimization. Uniquely, PICO-fit works well for many object categories that no existing method can tackle. This is crucial to enable HOI understanding to scale in the wild.
project arXiv video code dataset BibTeX
Thumb ticker lg pico teaser

Perceiving Systems Conference Paper ChatGarment: Garment Estimation, Generation and Editing via Large Language Models Bian, S., Xu, C., Xiu, Y., Grigorev, A., Liu, Z., Lu, C., Black, M. J., Feng, Y. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2025 (Published)
We introduce ChatGarment, a novel approach that leverages large vision-language models (VLMs) to automate the estimation, generation, and editing of 3D garment sewing patterns from images or text descriptions. Unlike previous methods that often lack robustness and interactive editing capabilities, ChatGarment finetunes a VLM to produce GarmentCode, a JSON-based, language-friendly format for 2D sewing patterns, enabling both estimating and editing from images and text instructions. To optimize performance, we refine GarmentCode by expanding its support for more diverse garment types and simplifying its structure, making it more efficient for VLM finetuning. Additionally, we develop an automated data construction pipeline to generate a large-scale dataset of image-to-sewing-pattern and text-to-sewing-pattern pairs, empowering ChatGarment with strong generalization across various garment types. Extensive evaluations demonstrate ChatGarment’s ability to accurately reconstruct, generate, and edit garments from multimodal inputs, highlighting its potential to revolutionize workflows in fashion and gaming applications.
project arXiv video code data BibTeX
Thumb ticker lg chatgarment

Empirical Inference Perceiving Systems Conference Paper ChatHuman: Chatting about 3D Humans with Tools Lin, J., Feng, Y., Liu, W., Black, M. J. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8150-8161, June 2025 (Published)
Numerous methods have been proposed to detect, estimate, and analyze properties of people in images, including 3D pose, shape, contact, human-object interaction, and emotion. While widely applicable in vision and other areas, such methods require expert knowledge to select, use, and interpret the results. To address this, we introduce ChatHuman, a language-driven system that integrates the capabilities of specialized methods into a unified framework. ChatHuman functions as an assistant proficient in utilizing, analyzing, and interacting with tools specific to 3D human tasks, adeptly discussing and resolving related challenges. Built on a Large Language Model (LLM) framework, ChatHuman is trained to autonomously select, apply, and interpret a diverse set of tools in response to user inputs. Our approach overcomes significant hurdles in adapting LLMs to 3D human tasks, including the need for domain-specific knowledge and the ability to interpret complex 3D outputs. The innovations of ChatHuman include leveraging academic publications to instruct the LLM on tool usage, employing a retrieval-augmented generation model to create in-context learning examples for managing new tools, and effectively discriminating between and integrating tool results by transforming specialized 3D outputs into comprehensible formats. Experiments demonstrate that ChatHuman surpasses existing models in both tool selection accuracy and overall performance across various 3D human tasks, and it supports interactive chatting with users. ChatHuman represents a significant step toward consolidating diverse analytical methods into a unified, robust system for 3D human tasks.
project pdf Paper DOI BibTeX
Thumb ticker lg chathumanwide

Perceiving Systems Conference Paper InteractVLM: 3D Interaction Reasoning from 2D Foundational Models Dwivedi, S. K., Antić, D., Tripathi, S., Taheri, O., Schmid, C., Black, M. J., Tzionas, D. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 22605-22615, June 2025 (Published)
We introduce InteractVLM, a novel method to estimate 3D contact points on human bodies and objects from single in-the-wild images, enabling accurate human-object joint reconstruction in 3D. This is challenging due to occlusions, depth ambiguities, and widely varying object shapes. Existing methods rely on 3D contact annotations collected via expensive motion-capture systems or tedious manual labeling, limiting scalability and generalization. To overcome this, InteractVLM harnesses the broad visual knowledge of large Vision-Language Models (VLMs), fine-tuned with limited 3D contact data. However, directly applying these models is non-trivial, as they reason only in 2D, while human-object contact is inherently 3D. Thus we introduce a novel Render-Localize-Lift module that: (1) embeds 3D body and object surfaces in 2D space via multi-view rendering, (2) trains a novel multi-view localization model (MV-Loc) to infer contacts in 2D, and (3) lifts these to 3D. Additionally, we propose a new task called Semantic Human Contact estimation, where human contact predictions are conditioned explicitly on object semantics, enabling richer interaction modeling. InteractVLM outperforms existing work on contact estimation and also facilitates 3D reconstruction from an in-the wild image.
Project Paper Code Video BibTeX
Thumb ticker lg ivlm teaser crop

Perceiving Systems Conference Paper PromptHMR: Promptable Human Mesh Recovery Wang, Y., Sun, Y., Patel, P., Daniilidis, K., Black, M. J., Kocabas, M. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2025 (Published)
Human pose and shape (HPS) estimation presents challenges in diverse scenarios such as crowded scenes, person-person interactions, and single-view reconstruction. Existing approaches lack mechanisms to incorporate auxiliary "side information" that could enhance reconstruction accuracy in such challenging scenarios. Furthermore, the most accurate methods rely on cropped person detections and cannot exploit scene context while methods that process the whole image often fail to detect people and are less accurate than methods that use crops. While recent language-based methods explore HPS reasoning through large language or vision-language models, their metric accuracy is well below the state of the art. In contrast, we present PromptHMR, a transformer-based promptable method that reformulates HPS estimation through spatial and semantic prompts. Our method processes full images to maintain scene context and accepts multiple input modalities: spatial prompts like bounding boxes and masks, and semantic prompts like language descriptions or interaction labels. PromptHMR demonstrates robust performance across challenging scenarios: estimating people from bounding boxes as small as faces in crowded scenes, improving body shape estimation through language descriptions, modeling person-person interactions, and producing temporally coherent motions in videos. Experiments on benchmarks show that PromptHMR achieves state-of-the-art performance while offering flexible prompt-based control over the HPS estimation process.
arXiv project video BibTeX
Thumb ticker lg phmrteaser

Perceiving Systems Ph.D. Thesis Estimating Human and Camera Motion From RGB Data Kocabas, M. April 2025 (Published)
This thesis presents a unified framework for markerless 3D human motion analysis from monocular videos, addressing three interrelated challenges that have limited the fidelity of existing approaches: (i) achieving temporally consistent and physically plausible human motion estimation, (ii) accurately modeling perspective camera effects in unconstrained settings, and (iii) disentangling human motion from camera motion in dynamic scenes. Our contributions are realized through three complementary methods. First, we introduce VIBE (Video Inference for Body Pose and Shape Estimation), a novel video pose and shape estimation framework. Despite progress on single-image 3D pose and shape estimation, existing video-based state-of-the-art methods fail to produce accurate and natural motion sequences due to a lack of ground-truth 3D motion data for training. To address this problem, we propose VIBE, which makes use of an existing large-scale motion capture dataset (AMASS) together with unpaired, in-the-wild, 2D keypoint annotations. Our key novelty is an adversarial learning framework that leverages AMASS to discriminate between real human motions and those produced by our temporal pose and shape regression networks. We define a temporal network architecture and show that adversarial training, at the sequence level, produces kinematically plausible motion sequences without in-the-wild ground-truth 3D labels. Second, we propose SPEC (Seeing People in the wild with Estimated Cameras), the first in-the-wild 3D human and shape (HPS) method that estimates the perspective camera from a single image and employs this to reconstruct 3D human bodies more accurately. Due to the lack of camera parameter information for in-the-wild images, existing 3D HPS estimation methods make several simplifying assumptions: weak-perspective projection, large constant focal length, and zero camera rotation. These assumptions often do not hold and we show, quantitatively and qualitatively, that they cause errors in the reconstructed 3D shape and pose. To address this, we introduce SPEC, the first in-the-wild 3D HPS method that estimates the perspective camera from a single image and employs this to reconstruct 3D human bodies more accurately. First, we train a neural network to estimate the field of view, camera pitch, and roll given an input image. We employ novel losses that improve the camera calibration accuracy over previous work. We then train a novel network that concatenates the camera calibration to the image features and uses these together to regress 3D body shape and pose. SPEC is more accurate than the prior art on the standard benchmark (3DPW) as well as two new datasets with more challenging camera views and varying focal lengths. Specifically, we create a new photorealistic synthetic dataset (SPEC-SYN) with ground truth 3D bodies and a novel in-the-wild dataset (SPEC-MTP) with calibration and high-quality reference bodies. Third, we develop PACE (Person And Camera Estimation), a method to estimate human motion in a global scene from moving cameras. This is a highly challenging task due to the entangling of human and camera motions in the video. Existing works assume camera is static and focus on solving the human motion in camera space. To address this problem, we propose a joint optimization framework that disentangles human and camera motions using both foreground human motion priors and background scene features. Unlike existing methods that use Simultaneous Localization and Mapping (SLAM) as initialization, we propose to tightly integrate SLAM and human motion priors in an optimization that is inspired by bundle adjustment. Specifically, we optimize human and camera motions to match both the observed human pose and scene features. This design combines the strengths of SLAM and motion priors, which leads to significant improvements in human and camera motion estimation. We additionally introduce a motion prior that is suitable for batch optimization, making our approach significantly more efficient than existing approaches. Finally, we propose a novel synthetic dataset that enables evaluating camera motion in addition to human motion from dynamic videos. Experiments on the synthetic and real-world datasets demonstrate that our approach substantially outperforms prior art in recovering both human and camera motions. Extensive experiments on standard benchmarks and new datasets we introduced demonstrate that our integrated approach substantially outperforms prior methods in terms of temporal consistency, reconstruction accuracy, and global motion estimation. While these results represent a significant advance in markerless human motion analysis, further work is needed to extend these techniques to multi-person scenarios, severe occlusions, and real-time applications. Overall, this thesis lays a strong foundation for more robust and accurate human motion analysis in unconstrained environments, with promising applications in robotics, augmented reality, sports analysis, and beyond.
Thesis PDF BibTeX
Thumb ticker lg phd thesis

Perceiving Systems Ph.D. Thesis Understanding Human-Scene Interaction through Perception and Generation Yi, H. April 2025 (Published)
Humans are in constant contact with the world as they move through it and interact with it. Understanding Human-Scene Interactions (HSIs) is key to enhancing our perception and manipulation of three-dimensional (3D) environments, which is crucial for various applications such as gaming, architecture, and synthetic data creation. However, creating realistic 3D scenes populated by moving humans is a challenging and labor-intensive task. Existing human-scene interaction datasets are scarce and captured motion datasets often lack scene information. This thesis addresses these challenges by leveraging three specific types of HSI con- straints: (1) depth ordering constraint: humans that move in a scene are occluded or occlude objects, thus, defining the relative depth ordering of the objects, (2) collision constraint: humans move through free space and do not interpenetrate objects, (3) in- teraction constraint: when humans and objects are in contact, the contact surfaces oc- cupy the same place in space. Building on these constraints, we propose three distinct methodologies: capturing HSI from a monocular RGB video, generating HSI by gen- erating scenes from input human motions (scenes from humans) and generating human motion from scenes (humans from scenes). Firstly, we introduce MOVER , which jointly reconstructs 3D human motion and the interactive scenes from a RGB video. This optimization-based approach leverages these three aforementioned constraints to enhance the consistency and plausibility of recon- structed scene layouts and to refine the initial 3D human pose and shape estimations. Secondly, we present MIME , which takes 3D humans and a floor map as input to create realistic and interactive 3D environments. This method applies collision and interaction constraints, and employs an auto-regressive transformer architecture that integrates ob- jects into the scene based on existing human motion. The training data is enriched by populating the 3D FRONT scene dataset with 3D humans. By treating human movement as a “scanner” of the environment, this method results in furniture layouts that reflect true human activities, increasing the diversity and authenticity of the environments. Lastly, we introduce TeSMo , which generates 3D human motion from given 3D scenes and text descriptions, adhering to the collision and interaction constraints. It utilizes a text-controlled scene-aware motion generation framework based on denoising diffusion models. Annotated navigation and interaction motions are embedded within scenes to support the model’s training, allowing for the generation of diverse and realistic human- scene interactions tailored to specific settings and object arrangements. In conclusion, these methodologies significantly advance our understanding and syn- thesis of human-scene interactions, offering realistic modeling of 3D environments.
thesis BibTeX
Thumb ticker lg hongweithesis

Empirical Inference Perceiving Systems Conference Paper Can Large Language Models Understand Symbolic Graphics Programs? Qiu, Z., Liu, W., Feng, H., Liu, Z., Xiao, T. Z., Collins, K. M., Tenenbaum, J. B., Weller, A., Black, M. J., Schölkopf, B. The Thirteenth International Conference on Learning Representations (ICLR), April 2025 (Published)
Against the backdrop of enthusiasm for large language models (LLMs), there is a growing need to scientifically assess their capabilities and shortcomings. This is nontrivial in part because it is difficult to find tasks which the models have not encountered during training. Utilizing symbolic graphics programs, we propose a domain well-suited to test multiple spatial-semantic reasoning skills of LLMs. Popular in computer graphics, these programs procedurally generate visual data. While LLMs exhibit impressive skills in general program synthesis and analysis, symbolic graphics programs offer a new layer of evaluation: they allow us to test an LLM’s ability to answer semantic questions about the images or 3D geometries without a vision encoder. To semantically understand the symbolic programs, LLMs would need to possess the ability to “imagine” and reason how the corresponding graphics content would look with only the symbolic description of the local curvatures and strokes. We use this task to evaluate LLMs by creating a large benchmark for the semantic visual understanding of symbolic graphics programs, built procedurally with minimal human effort. Particular emphasis is placed on transformations of images that leave the image level semantics invariant while introducing significant changes to the underlying program. We evaluate commercial and open-source LLMs on our benchmark to assess their ability to reason about visual output of programs, finding that LLMs considered stronger at reasoning generally perform better. Lastly, we introduce a novel method to improve this ability – Symbolic Instruction Tuning (SIT), in which the LLM is finetuned with pre-collected instruction data on symbolic graphics programs. Interestingly, we find that SIT not only improves LLM’s understanding on symbolic programs, but it also improves general reasoning ability on various other benchmarks.
arXiv Paper BibTeX
Thumb ticker lg svg

Empirical Inference Perceiving Systems Conference Paper Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets Liu, Z., Xiao, T. Z., Liu, W., Bengio, Y., Zhang, D. The Thirteenth International Conference on Learning Representations (ICLR), April 2025 (Published)
While one commonly trains large diffusion models by collecting datasets on target downstream tasks, it is often desired to align and finetune pretrained diffusion models with some reward functions that are either designed by experts or learned from small-scale datasets. Existing post-training methods for reward finetuning of diffusion models typically suffer from lack of diversity in generated samples, lack of prior preservation, and/or slow convergence in finetuning. Inspired by recent successes in generative flow networks (GFlowNets), a class of probabilistic models that sample with the unnormalized density of a reward function, we propose a novel GFlowNet method dubbed Nabla-GFlowNet (abbreviated as ∇-GFlowNet), the first GFlowNet method that leverages the rich signal in reward gradients, together with an objective called ∇-DB plus its variant residual ∇-DB designed for prior-preserving diffusion finetuning. We show that our proposed method achieves fast yet diversity- and prior-preserving finetuning of Stable Diffusion, a large-scale text-conditioned image diffusion model, on different realistic reward functions.
arXiv BibTeX
Thumb ticker lg residualdb

Haptic Intelligence Perceiving Systems Article Wrist-to-Wrist Bioimpedance Can Reliably Detect Discrete Self-Touch Forte, M., Vardar, Y., Javot, B., Kuchenbecker, K. J. IEEE Transactions on Instrumentation and Measurement, 74(4006511):1-11, April 2025 (Published)
Self-touch is crucial in human communication, psychology, and disease transmission, yet existing methods for detecting self-touch are often invasive or limited in scope. This study systematically investigates the feasibility of using non-invasive electrical bioimpedance for detecting discrete self-touch poses across individuals. While previous research has focused on classifying defined self-touch poses, our work explores how various poses cause bioimpedance changes, providing insights into the underlying physiological mechanisms. We thus created a dataset of 27 genuine self-touch poses, including skin-to-skin contact between the hands and face and skin-to-clothing contact between the hands and chest, alongside six adversarial mid-air gestures. We then measured the wrist-to-wrist bioimpedance of 30 adults (15 female, 15 male) across these poses, with each measurement preceded by a no-touch pose serving as a baseline. Statistical analysis of the measurements showed that skin-to-skin contacts cause significant changes in bioimpedance magnitude between 237.8 kHz and 4.1 MHz, while adversarial gestures do not; skin-to-clothing contacts cause less-significant changes due to the influence and variability of the clothing material. Furthermore, our analysis highlights the sensitivity of bioimpedance to the body parts involved, skin contact area, and individual's characteristics. Our contributions are two-fold: (1) we demonstrate that bioimpedance offers a practical, non-invasive solution for detecting self-touch poses involving skin-to-skin contact, (2) researchers can leverage insights from our study to determine whether a pose can be detected without extensive testing.
DOI BibTeX
Thumb ticker lg circuits

Perceiving Systems Ph.D. Thesis Democratizing 3D Human Digitization Xiu, Y. March 2025 (Published)
Richard Feynman once said, “What I cannot create, I do not understand.” Similarly, making virtual humans more realistic helps us better grasp human nature. Simulating lifelike avatars has scientific value (such as in biomechanics) and practical applications (like the Metaverse). However, creating them affordably at scale with high quality remains challenging. Reconstructing complex poses, varied clothing, and unseen areas from casual photos under real-world conditions is still difficult. We address this through a series of works—ICON, ECON, TeCH, PuzzleAvatar—bridging pixel-based reconstruction with text-guided generation to reframe reconstruction as conditional generation. This allows us to turn everyday photos, like personal albums featuring random poses, diverse clothing, tricky angles, and arbitrary cropping, into 3D avatars. The process converts unstructured data into structured output without unnecessary complexity. With these techniques, we can efficiently scale up the creation of digital humans using readily available imagery.
Thesis BibTeX
Thumb ticker lg dall e 2025 03 16 11.46.56

Perceiving Systems Conference Paper Gaussian Garments: Reconstructing Simulation-Ready Clothing with Photo-Realistic Appearance from Multi-View Video Rong, B., Grigorev, A., Wang, W., Black, M. J., Thomaszewski, B., Tsalicoglou, C., Hilliges, O. In International Conference on 3D Vision (3DV), International Conference on 3D Vision, March 2025 (Published)
We introduce Gaussian Garments, a novel approach for reconstructing realistic-looking, simulation-ready garment assets from multi-view videos. Our method represents garments with a combination of a 3D mesh and a Gaussian texture that encodes both the color and high-frequency surface details. This representation enables accurate registration of garment geometries to multi-view videos and helps disentangle albedo textures from lighting effects. Furthermore, we demonstrate how a pre-trained Graph Neural Network (GNN) can be fine-tuned to replicate the real behavior of each garment. The reconstructed Gaussian Garments can be automatically combined into multi-garment outfits and animated with the fine-tuned GNN.
arXiv project video URL BibTeX
Thumb ticker lg teaser4

Perceiving Systems Conference Paper CameraHMR: Aligning People with Perspective Patel, P., Black, M. J. In International Conference on 3D Vision (3DV), International Conference on 3D Vision, March 2025 (Published)
We address the challenge of accurate 3D human pose and shape estimation from monocular images. The key to accuracy and robustness lies in high-quality training data. Existing training datasets containing real images with pseudo ground truth (pGT) use SMPLify to fit SMPL to sparse 2D joint locations, assuming a simplified camera with default intrinsics. We make two contributions that improve pGT accuracy. First, to estimate camera intrinsics, we develop a field-of-view prediction model (HumanFoV) trained on a dataset of images containing people. We use the estimated intrinsics to enhance the 4D-Humans dataset by incorporating a full perspective camera model during SMPLify fitting. Second, 2D joints provide limited constraints on 3D body shape, resulting in average-looking bodies. To address this, we use the BEDLAM dataset to train a dense surface keypoint detector. We apply this detector to the 4D-Humans dataset and modify SMPLify to fit the detected keypoints, resulting in significantly more realistic body shapes. Finally, we upgrade the HMR2.0 architecture to include the estimated camera parameters. We iterate model training and SMPLify fitting initialized with the previously trained model. This leads to more accurate pGT and a new model, CameraHMR, with state-of-the-art accuracy. Code and pGT are available for research purposes.
arXiv project BibTeX
Thumb ticker lg pexels

Perceiving Systems Conference Paper CHOIR: A Versatile and Differentiable Hand-Object Interaction Representation Morales, T., Taheri, O., Lacey, G. In Winter Conference on Applications of Computer Vision (WACV), February 2025 (Published)
Synthesizing accurate hands-object interactions (HOI) is critical for applications in Computer Vision, Augmented Reality (AR), and Mixed Reality (MR). Despite recent advances, the accuracy of reconstructed or generated HOI leaves room for refinement. Some techniques have improved the accuracy of dense correspondences by shifting focus from generating explicit contacts to using rich HOI fields. Still, they lack full differentiability or continuity and are tailored to specific tasks. In contrast, we present a Coarse Hand-Object Interaction Representation (CHOIR), a novel, versatile and fully differentiable field for HOI modelling. CHOIR leverages discrete unsigned distances for continuous shape and pose encoding, alongside multivariate Gaussian distributions to represent dense contact maps with few parameters. To demonstrate the versatility of CHOIR we design JointDiffusion, a diffusion model to learn a grasp distribution conditioned on noisy hand-object interactions or only object geometries, for both refinement and synthesis applications. We demonstrate JointDiffusion’s improvements over the SOTA in both applications: it increases the contact F1 score by 5% for refinement and decreases the sim. displacement by 46% for synthesis. Our experiments show that JointDiffusion with CHOIR yield superior contact accuracy and physical realism compared to SOTA methods designed for specific tasks.
GitHub Paper URL BibTeX
Thumb ticker lg choir1

Perceiving Systems Conference Paper OpenCapBench: A Benchmark to Bridge Pose Estimation and Biomechanics Gozlan, Y., Falisse, A., Uhlrich, S., Gatti, A., Black, M., Chaudhari, A. In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , January 2025 (Published)
Pose estimation has promised to impact healthcare by enabling more practical methods to quantify nuances of human movement and biomechanics. However, despite the inherent connection between pose estimation and biomechanics, these disciplines have largely remained disparate. For example, most current pose estimation benchmarks use metrics such as Mean Per Joint Position Error, Percentage of Correct Keypoints, or mean Average Precision to assess performance, without quantifying kinematic and physiological correctness - key aspects for biomechanics. To alleviate this challenge, we develop OpenCapBench to offer an easy-to-use unified benchmark to assess common tasks in human pose estimation, evaluated under physiological constraints. OpenCapBench computes consistent kinematic metrics through joints angles provided by an open-source musculoskeletal modeling software (OpenSim). Through OpenCapBench, we demonstrate that current pose estimation models use keypoints that are too sparse for accurate biomechanics analysis. To mitigate this challenge, we introduce SynthPose, a new approach that enables finetuning of pre-trained 2D human pose models to predict an arbitrarily denser set of keypoints for accurate kinematic analysis through the use of synthetic data. Incorporating such finetuning on synthetic data of prior models leads to twofold reduced joint angle errors. Moreover, OpenCapBench allows users to benchmark their own developed models on our clinically relevant cohort. Overall, OpenCapBench bridges the computer vision and biomechanics communities, aiming to drive simultaneous advances in both areas.
arXiv code/data URL BibTeX
Thumb ticker lg benchcap

Perceiving Systems Thesis Dynamic 3D Synthesis: From Video-Based Animatable Head Avatars to Text-Guided 4D Content Creation Zheng, Y. 2025 (Published)
The synthesis of 4D content—dynamic 3D content that evolves over time—has become increasingly important across a wide range of applications, including virtual communication, gaming, AR/VR, and digital content creation. Despite recent advances, generating realistic 4D content from accessible inputs remains a significant challenge. Existing approaches often rely on dense multi-camera capture systems, which are costly and impractical for everyday use, or yield results with limited geometric and visual fidelity. This thesis investigates two sub tasks in 4D content creation: (1) the reconstruction of high-fidelity, animatable head avatars from accessible inputs such as monocular RGB videos, and (2) the generation of dynamic 4D scenes from text prompts and optionally sparse visual input, such as reference images. These two directions are unified by a common goal—enabling controllable and high-quality 4D content creation from minimal visual supervision. The first part of this thesis presents IMavatar, a morphable implicit surface representation for reconstructing personalized head avatars from monocular videos. Implicit surfaces provide topological flexibility and can recover detailed 3D geometry directly from RGB images, making them well-suited for head avatar reconstruction. However, modeling expression- and pose-dependent deformations in an interpretable and generalizable way remains a major challenge when working with implicit representations. Inspired by 3D morphable models, IMavatar models deformation by learning expression blendshapes and skinning weight fields in a canonical space, enabling structured and generalizable control over novel expressions and poses. To enable end-to-end optimization from monocular videos, we propose a novel analytical gradient formulation that supports joint training of the geometry and deformation directly from RGB supervision. By combining the geometric fidelity of neural implicit fields with the controllability of morphable models, IMavatar achieves high-quality 4D reconstructions and strong generalization to unseen expressions and head poses. The second part of this thesis presents PointAvatar, a deformable point-based representation for animatable 3D head avatars. While implicit representations are effective at learning detailed geometry from image observations, they are inherently difficult to animate and computationally expensive to render. To address these limitations, this work explores point clouds as the underlying geometric representation for head avatars, offering the efficiency of explicit representations while avoiding the fixed-topology constraints of meshes. PointAvatar uses a canonical point cloud combined with learned blendshape and skinning weight fields, and further disentangles intrinsic albedo from view-dependent shading to support relighting under novel illumination. To improve training stability and reconstruction quality, we adopt a coarse-to-fine strategy that gradually increases point cloud resolution during learning. This enables the model to effectively capture accurate geometry and high-quality texture from monocular RGB videos, including challenging cases such as eyeglasses and complex hairstyles. Compared to IMavatar, PointAvatar achieves an 8× speed-up during training and a 100× speed-up during inference rendering, while maintaining high visual and geometric quality. In the final part, this thesis explores Dream-in-4D, a diffusion-guided framework for generating creative 4D content from natural language. The focus is on synthesizing imaginative 4D scenes from minimal visual input—either a single image or no visual input at all. To this end, the method leverages prior knowledge from pre-trained image and video diffusion models to optimize a 4D representation. Dream-in-4D follows a two-stage pipeline. In the first stage, a static 3D model is optimized as a neural radiance field using guidance from both image and 3D-aware diffusion models, resulting in high-quality, view-consistent assets. In the second stage, a time-dependent, multi-resolution deformation field is introduced to represent motion and is optimized using video diffusion guidance, equipping the static 3D asset with detailed and plausible motion driven by text prompts. The resulting system supports text-to-4D, image-to-4D, and personalized 4D generation within a unified framework, enabling intuitive and flexible dynamic scene synthesis from highly accessible inputs. Together, these methods address two essential aspects of 4D content creation: the reconstruction of animatable head avatars from monocular videos, and the generation of dynamic, imaginative 4D scenes from text and image prompts. We hope these contributions advance the field toward more accessible, controllable, and high-quality 4D content creation—enabling a broad range of applications across research, industry, and creative practice.
DOI URL BibTeX

Perceiving Systems Conference Paper Joker: Conditional 3D Head Synthesis with Extreme Facial Expressions Prinzler, M., Zakharov, E., Sklyarova, V., Kabadayi, B., Thies, J. In International Conference on 3D Vision (3DV), International Conference on 3D Vision, 2025 (Published)
We introduce Joker, a new method for the conditional synthesis of 3D human heads with extreme expressions. Given a single reference image of a person, we synthesize a volumetric human head with the reference’s identity and a new expression. We offer control over the expression via a 3D morphable model (3DMM) and textual inputs. This multi-modal conditioning signal is essential since 3DMMs alone fail to define subtle emotional changes and extreme expressions, including those involving the mouth cavity and tongue articulation. Our method is built upon a 2D diffusion-based prior that generalizes well to out-of-domain samples, such as sculptures, heavy makeup, and paintings while achieving high levels of expressiveness. To improve view consistency, we propose a new 3D distillation technique that converts predictions of our 2D prior into a neural radiance field (NeRF). Both the 2D prior and our distillation technique produce state-of-the-art results, which are confirmed by our extensive evaluations. Also, to the best of our knowledge, our method is the first to achieve view-consistent extreme tongue articulation.
project page arxiv BibTeX
Thumb ticker lg joker

Perceiving Systems Book Chapter ElephantBook: Participatory Human–AI Elephant Population Monitoring Kulits, P., Wall, J., Beery, S. In Collaborative Intelligence: How Humans and AI Are Transforming Our World, 173-196, 7, (Editors: Lane, Mira and Sethumadhavan, Arathi), The MIT Press, Cambridge, Massachusetts, December 2024 (Published) URL BibTeX
Thumb ticker lg 9780262550789

Perceiving Systems Conference Paper MotionFix: Text-Driven 3D Human Motion Editing Athanasiou, N., Cseke, A., Diomataris, M., Black, M. J., Varol, G. In SIGGRAPH Asia 2024 Conference Proceedings, ACM, SIGGRAPH Asia , December 2024 (Published)
The focus of this paper is 3D motion editing. Given a 3D human motion and a textual description of the desired modification, our goal is to generate an edited motion as described by the text. The challenges include the lack of training data and the design of a model that faithfully edits the source motion. In this paper, we address both these challenges. We build a methodology to semi-automatically collect a dataset of triplets in the form of (i) a source motion, (ii) a target motion, and (iii) an edit text, and create the new dataset. Having access to such data allows us to train a conditional diffusion model that takes both the source motion and the edit text as input. We further build various baselines trained only on text-motion pairs datasets and show superior performance of our model trained on triplets. We introduce new retrieval-based metrics for motion editing and establish a new benchmark on the evaluation set. Our results are encouraging, paving the way for further research on fine-grained motion generation. Code and models will be made publicly available.
Code (GitHub) Website Data Exploration ArXiv URL BibTeX
Thumb ticker lg image 900x600

Perceiving Systems Article PuzzleAvatar: Assembling 3D Avatars from Personal Albums Xiu, Y., Liu, Z., Tzionas, D., Black, M. J. ACM Transactions on Graphics, 43(6):1-15, ACM, December 2024 (Published)
Generating personalized 3D avatars is crucial for AR/VR. However, recent text-to-3D methods that generate avatars for celebrities or fictional characters, struggle with everyday people. Methods for faithful reconstruction typically require full-body images in controlled settings. What if a user could just upload their personal "OOTD" (Outfit Of The Day) photo collection and get a faithful avatar in return? The challenge is that such casual photo collections contain diverse poses, challenging viewpoints, cropped views, and occlusion (albeit with a consistent outfit, accessories and hairstyle). We address this novel "Album2Human" task by developing PuzzleAvatar, a novel model that generates a faithful 3D avatar (in a canonical pose) from a personal OOTD album, while bypassing the challenging estimation of body and camera pose. To this end, we fine-tune a foundational vision-language model (VLM) on such photos, encoding the appearance, identity, garments, hairstyles, and accessories of a person into (separate) learned tokens and instilling these cues into the VLM. In effect, we exploit the learned tokens as "puzzle pieces" from which we assemble a faithful, personalized 3D avatar. Importantly, we can customize avatars by simply inter-changing tokens. As a benchmark for this new task, we collect a new dataset, called PuzzleIOI, with 41 subjects in a total of nearly 1K OOTD configurations, in challenging partial photos with paired ground-truth 3D bodies. Evaluation shows that PuzzleAvatar not only has high reconstruction accuracy, outperforming TeCH and MVDreamBooth, but also a unique scalability to album photos, and strong robustness. Our code and data are publicly available for research purpose.
DOI URL BibTeX
Thumb ticker lg puzzleavatar

Perceiving Systems Conference Paper SPARK: Self-supervised Personalized Real-time Monocular Face Capture Baert, K., Bharadwaj, S., Castan, F., Maujean, B., Christie, M., Abrevaya, V., Boukhayma, A. In SIGGRAPH Asia 2024 Conference Proceedings, SIGGRAPH Asia, December 2024 (Published)
Feedforward monocular face capture methods seek to reconstruct posed faces from a single image of a person. Current state of the art approaches have the ability to regress parametric 3D face models in real-time across a wide range of identities, lighting conditions and poses by leveraging large image datasets of human faces. These methods however suffer from clear limitations in that the underlying parametric face model only provides a coarse estimation of the face shape, thereby limiting their practical applicability in tasks that require precise 3D reconstruction (aging, face swapping, digital make-up, ...). In this paper, we propose a method for high-precision 3D face capture taking advantage of a collection of unconstrained videos of a subject as prior information. Our proposal builds on a two stage approach. We start with the reconstruction of a detailed 3D face avatar of the person, capturing both precise geometry and appearance from a collection of videos. We then use the encoder from a pre-trained monocular face reconstruction method, substituting its decoder with our personalized model, and proceed with transfer learning on the video collection. Using our pre-estimated image formation model, we obtain a more precise self-supervision objective, enabling improved expression and pose alignment. This results in a trained encoder capable of efficiently regressing pose and expression parameters in real-time from previously unseen images, which combined with our personalized geometry model yields more accurate and high fidelity mesh inference. Through extensive qualitative and quantitative evaluation, we showcase the superiority of our final model as compared to state-of-the-art baselines, and demonstrate its generalization ability to unseen pose, expression and lighting.
DOI URL BibTeX
Thumb ticker lg spark img

Perceiving Systems Article StableNormal: Reducing Diffusion Variance for Stable and Sharp Normal Ye, C., Qiu, L., Gu, X., Zuo, Q., Wu, Y., Dong, Z., Bo, L., Xiu, Y., Han, X. ACM Transactions on Graphics, 43(6):1-18, ACM, December 2024 (Published)
This work addresses the challenge of high-quality surface normal estimation from monocular colored inputs (i.e., images and videos), a field which has recently been revolutionized by repurposing diffusion priors. However, previous attempts still struggle with stochastic inference, conflicting with the deterministic nature of the Image2Normal task, and costly ensembling step, which slows down the estimation process. Our method, StableNormal, mitigates the stochasticity of the diffusion process by reducing inference variance, thus producing "Stable-and-Sharp" normal estimates without any additional ensembling process. StableNormal works robustly under challenging imaging conditions, such as extreme lighting, blurring, and low quality. It is also robust against transparent and reflective surfaces, as well as cluttered scenes with numerous objects. Specifically, StableNormal employs a coarse-to-fine strategy, which starts with a one-step normal estimator (YOSO) to derive an initial normal guess, that is relatively coarse but reliable, then followed by a semantic-guided refinement process (SG-DRN) that refines the normals to recover geometric details. The effectiveness of StableNormal is demonstrated through competitive performance in standard datasets such as DIODE-indoor, iBims, ScannetV2 and NYUv2, and also in various downstream tasks, such as surface reconstruction and normal enhancement. These results evidence that StableNormal retains both the "stability" and "sharpness" for accurate normal estimation. StableNormal represents a baby attempt to repurpose diffusion priors for deterministic estimation. To democratize this, code and models have been publicly available.
DOI BibTeX
Thumb ticker lg stablenormal

Perceiving Systems Ph.D. Thesis Beyond the Surface: Statistical Approaches to Internal Anatomy Prediction Keller, M. University of Tübingen, November 2024 (Published)
The creation of personalized anatomical digital twins is important in the fields of medicine, computer graphics, sports science, and biomechanics. But to observe a subject’s anatomy, expensive medical devices (MRI or CT) are required and creating a digital model is often time-consuming and involves manual effort. Instead, we can leverage the fact that the shape of the body surface is correlated with the internal anatomy; indeed, the external body shape is related to the bone lengths, the angle of skeletal articulation, and the thickness of various soft tissues. In this thesis, we leverage the correlation between body shape and anatomy and aim to infer the internal anatomy solely from the external appearance. Learning this correlation requires paired observations of people’s body shape, and their internal anatomy, which raises three challenges. First, building such datasets requires specific capture modalities. Second, these data must be annotated, i.e. the body shape and anatomical structures must be identified and segmented, which is often a tedious manual task requiring expertise. Third, to learn a model able to capture the correlation between body shape and internal anatomy, the data of people with various shapes and poses has to be put into correspondence. In this thesis, we cover three works that focus on learning this correlation. We show that we can infer the skeleton geometry, the bone location inside the body, and the soft tissue location solely from the external body shape. First, in the OSSO project, we leverage 2D medical scans to construct a paired dataset of 3D body shapes and corresponding 3D skeleton shapes. This dataset allows us to learn the correlation between body and skeleton shapes, enabling the inference of a custom skeleton based on an individual’s body. However, since this learning process is based on static views of subjects in specific poses, we cannot evaluate the accuracy of skeleton inference in different poses. To predict the bone orientation within the body in various poses, we need dynamic data. To track bones inside the body in motion, we can leverage methods from the biomechanics field. So in the second work, instead of medical imaging, we use a biomechanical skeletal model along with simulation to build a paired dataset of bodies in motion and their corresponding skeletons. In this work, we build such a dataset and learn SKEL, a body shape and skeleton model that includes the locations of anatomical bones from any body shape and in any pose. After dealing with the skeletal structure, we broaden our focus to include different layers of soft tissues. In the third work, HIT, we leverage segmented medical data to learn to predict the distribution of adipose tissues (fat) and lean tissues (muscle, organs, etc.) inside the body.
pdf URL BibTeX
Thumb ticker lg kellerthesis

Perceiving Systems Ph.D. Thesis Aerial Markerless Motion Capture Saini, N. November 2024 (Published)
Human motion capture (mocap) is important for several applications such as healthcare, sports, animation etc. Existing markerless mocap methods employ multiple static and calibrated RGB cameras to infer the subject’s pose. These methods are not suitable for outdoor and unstructured scenarios. They need an extra calibration step before the mocap session and cannot dynamically adapt the viewpoint for the best mocap performance. A mocap setup consisting of multiple unmanned aerial vehicles with onboard cameras is ideal for such situations. However, estimating the subject’s motion together with the camera motions is an under-constrained problem. In this thesis, we explore multiple approaches where we split this problem into multiple stages. We obtain the prior knowledge or rough estimates of the subject’s or the cameras’ motion in the initial stages and exploit them in the final stages. In our work AirCap-Pose-Estimator, we use extra sensors (an IMU and a GPS receiver) on the multiple moving cameras to obtain the approximate camera poses. We use these estimates to jointly optimize the camera poses, the 3D body pose and the subject’s shape to robustly fit the 2D keypoints of the subject. We show that the camera pose estimates using just the sensors are not accurate enough, and our joint optimization formulation improves the accuracy of the camera poses while estimating the subject’s poses. Placing extra sensors on the cameras is not always feasible. That is why, in our work AirPose, we introduce a distributed neural network that runs on board, estimating the subject’s motion and calibrating the cameras relative to the subject. We utilize realistic human scans with ground truth to train our network. We further fine-tune it using a small amount of real-world data. Finally, we propose a bundle-adjustment method (AirPose+), which utilizes the initial estimates from our network to recover high-quality motions of the subject and the cameras. Finally, we consider a generic setup consisting of multiple static and moving cameras. We propose a method that estimates the poses of the cameras and the human relative to the ground plane using only 2D human keypoints. We learn a human motion prior using a large amount of human mocap data and use it in a novel multi-stage optimization approach to fit the SMPL human body model and the camera poses to the 2D keypoints. We show that in addition to the aerial cameras, our method works for smartphone cameras and standard RGB ground cameras. This thesis advances the field of markerless mocap which is currently limited to multiple static calibrated RGB cameras. Our methods allow the user to use moving RGB cameras and skip the extrinsic calibration. In the future, we will explore the usage of a single moving camera without even needing camera intrinsics.
thesis BibTeX
Thumb ticker lg title

Perceiving Systems Ph.D. Thesis Leveraging Unpaired Data for the Creation of Controllable Digital Humans Sanyal, S. Max Planck Institute for Intelligent Systems and Eberhard Karls Universität Tübingen, November 2024 (Published)
Digital humans have grown increasingly popular, offering transformative potential across various fields such as education, entertainment, and healthcare. They enrich user experiences by providing immersive and personalized interactions. Enhancing these experiences involves making digital humans controllable, allowing for manipulation of aspects like pose and appearance, among others. Learning to create such controllable digital humans necessitates extensive data from diverse sources. This includes 2D human images alongside their corresponding 3D geometry and texture, 2D images showcasing similar appearances across a wide range of body poses, etc., for effective control over pose and appearance. However, the availability of such “paired data” is limited, making its collection both time-consuming and expensive. Despite these challenges, there is an abundance of unpaired 2D images with accessible, inexpensive labels—such as identity, type of clothing, appearance of clothing, etc. This thesis capitalizes on these affordable labels, employing informed observations from “unpaired data” to facilitate the learning of controllable digital humans through reconstruction, transposition, and generation processes. The presented methods—RingNet, SPICE, and SCULPT—each tackles different aspects of controllable digital human modeling. RingNet (Sanyal et al. [2019]) exploits the consistent facial geometry across different images of the same individual to estimate 3D face shapes and poses without 2D-to-3D supervision. This method illustrates how leveraging the inherent properties of unpaired images—such as identity consistency—can circumvent the need for expensive paired datasets. Similarly, SPICE (Sanyal et al. [2021]) employs a self-supervised learning framework that harnesses unpaired images to generate realistic transpositions of human poses by understanding the underlying 3D body structure and maintaining consistency in body shape and appearance features across different poses. Finally, SCULPT (Sanyal et al. [2024] generates clothed and textured 3D meshes by integrating insights from unpaired 2D images and medium-sized 3D scans. This process employs an unpaired learning approach, conditioning texture and geometry generation on attributes easily derived from data, like the type and appearance of clothing. In conclusion, this thesis highlights how unpaired data and innovative learning techniques can address the challenges of data scarcity and high costs in developing controllable digital humans by advancing reconstruction, transposition, and generation techniques.
download BibTeX
Thumb ticker lg  phd thesis defence 3.0 presented. 001

Perceiving Systems Conference Paper On predicting 3D bone locations inside the human body Dakri, A., Arora, V., Challier, L., Keller, M., Black, M. J., Pujades, S. In Medical Image Computing and Computer Assisted Intervention - MICCAI 2024, 336-346, Springer, Cham, 27th International Conference on Medical Image Computing and Computer assisted Intervention (MICCAI 2024) , October 2024 (Published)
Knowing the precise location of the bones inside the human body is key in several medical tasks, such as patient placement inside an imaging device or surgical navigation inside a patient. Our goal is to predict the bone locations using only an external 3D body surface obser- vation. Existing approaches either validate their predictions on 2D data (X-rays) or with pseudo-ground truth computed from motion capture using biomechanical models. Thus, methods either suffer from a 3D-2D projection ambiguity or directly lack validation on clinical imaging data. In this work, we start with a dataset of segmented skin and long bones obtained from 3D full body MRI images that we refine into individual bone segmentations. To learn the skin to bones correlations, one needs to register the paired data. Few anatomical models allow to register a skeleton and the skin simultaneously. One such method, SKEL, has a skin and skeleton that is jointly rigged with the same pose parameters. How- ever, it lacks the flexibility to adjust the bone locations inside its skin. To address this, we extend SKEL into SKEL-J to allow its bones to fit the segmented bones while its skin fits the segmented skin. These precise fits allow us to train SKEL-J to more accurately infer the anatomical joint locations from the skin surface. Our qualitative and quantitative results show how our bone location predictions are more accurate than all existing approaches. To foster future research, we make available for research purposes the individual bone segmentations, the fitted SKEL-J models as well as the new inference methods.
Project page DOI URL BibTeX
Thumb ticker lg teaser refined no teeth