Publications

DEPARTMENTS

Emperical Interference

Haptic Intelligence

Modern Magnetic Systems

Perceiving Systems

Physical Intelligence

Robotic Materials

Social Foundations of Computation


Research Groups

Autonomous Vision

Autonomous Learning

Bioinspired Autonomous Miniature Robots

Dynamic Locomotion

Embodied Vision

Human Aspects of Machine Learning

Intelligent Control Systems

Learning and Dynamical Systems

Locomotion in Biorobotic and Somatic Systems

Micro, Nano, and Molecular Systems

Movement Generation and Control

Neural Capture and Synthesis

Physics for Inference and Optimization

Organizational Leadership and Diversity

Probabilistic Learning Group


Topics

Robot Learning

Conference Paper

2022

Autonomous Learning

Robotics

AI

Career

Award


Perceiving Systems Ph.D. Thesis Aerial Robot Formations for Dynamic Environment Perception Price, E. University of Tübingen, Tübingen, Germany, December 2025 (Published)
Perceiving moving subjects, like humans and animals, outside an enclosed and controlled environment in a lab is inherently challenging, since subjects could move outside the view and range of cameras and sensors that are static and extrinsically calibrated. Previous state-of-the-art methods for such perception in outdoor scenarios use markers or sensors on the subject, which are both intrusive and unscalable for animal subjects. To address this problem, we introduce robotic flying cameras that autonomously follow the subjects. To enable functions such as monitoring, behaviour analysis or motion capture, a single point of view is often insufficient due to self-occlusion, lack of depth perception and coverage from all sides. Therefore, we propose a team of such robotic cameras that fly in formation to provide continuous coverage from multiple view-points. The position of the subject must be determined using markerless, remote sensing methods in real time. To solve this, we combine a convolutional neural network-based detector to detect the subject with a novel cooperative Bayesian fusion method to track the detected subject from multiple robots. The robots need to then plan and control their own flight path and orientation relative to the subject to achieve and maintain continuous coverage from multiple view-points. This, we address with a model-predictive-control-based method to predict and plan the motion of every robot in the formation around the subject. A preliminary demonstrator is implemented with multi-rotor drones. However, drones are noisy and potentially unsafe for the observed subjects. To address this, we introduce non-holonomic lighter-than-air autonomous airships (blimps) as the robotic camera platform. This type of robot requires dynamically constrained orbiting formations to achieve omnidirectional visual coverage of a moving subject in the presence of wind. Therefore, we introduce a novel model-predictive formation controller for a team of airships. We demonstrate and evaluate our complete system in field experiments involving both human and wild animals as subjects. The collected data enables both human outdoor motion capture and animal behaviour analysis. Additionally, we propose our method for autonomous long-term wildlife monitoring. This dissertation covers the design and evaluation of aerial robots suitable to this task, including computer vision/sensing, data annotation and network training, sensor fusion, planning, control, simulation, and modelling.
Thesis DOI BibTeX

Perceiving Systems Conference Paper BEDLAM2.0: Synthetic humans and cameras in motion Tesch, J., Becherini, G., Achar, P., Yiannakidis, A., Kocabas, M., Patel, P., Black, M. J. In Advances in Neural Information Processing Systems (NeurIPS), Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, December 2025 (Published)
Inferring 3D human motion from video remains a challenging problem with many applications. While traditional methods estimate the human in image coordinates, many applications require human motion to be estimated in world coordinates. This is particularly challenging when there is both human and camera motion. Progress on this topic has been limited by the lack of rich video data with ground truth human and camera movement. We address this with BEDLAM2.0, a new dataset that goes beyond the popular BEDLAM dataset in important ways. In addition to introducing more diverse and realistic cameras and camera motions, BEDLAM2.0 increases diversity and realism of body shape, motions, clothing, hair, and 3D environments. Additionally, it adds shoes, which were missing in BEDLAM. BEDLAM has become a key resource for training 3D human pose and motion regressors today and we show that BEDLAM2.0 is significantly better, particularly for training methods that estimate humans in world coordinates. We compare state-of-the art methods trained on BEDLAM and BEDLAM2.0, and find that BEDLAM2.0 significantly improves accuracy over BEDLAM. For research purposes, we provide the rendered videos, ground truth body parameters, and camera motions. We also provide the 3D assets to which we have rights and links to those from third parties.
Project Paper Video URL BibTeX
Thumb ticker lg b2 videothumbnail

Perceiving Systems Conference Paper HairFree: Compositional 2D Head Prior for Text-Driven 360° Bald Texture Synthesis Ostrek, M., Black, M., Thies, J. In Advances in Neural Information Processing Systems (NeurIPS), Advances in Neural Information Processing Systems (NeurIPS), December 2025 (Published)
Synthesizing high-quality 3D head textures is crucial for gaming, virtual reality, and digital humans. Achieving seamless 360° textures typically requires expensive multi-view datasets with precise tracking. However, traditional methods struggle without back-view data or precise geometry, especially for human heads, where even minor inconsistencies disrupt realism. We introduce HairFree, an unsupervised texturing framework guided by textual descriptions and 2D diffusion priors, producing high-consistency 360° bald head textures—including non-human skin with fine details—without any texture, back-view, bald, non-human, or synthetic training data. We fine-tune a diffusion prior on a dataset of mostly frontal faces, conditioned on predicted 3D head geometry and face parsing. During inference, HairFree uses precise skin masks and 3D FLAME geometry as input conditioning, ensuring high 3D consistency and alignment. We synthesize the full 360° texture by first generating a frontal RGB image aligned to the 3D FLAME pose and mapping it to UV space. As the virtual camera moves, we inpaint and merge missing regions. A built-in semantic prior enables precise region separation—particularly for isolating and removing hair—allowing seamless integration with various assets like customizable 3D hair, eyeglasses, jewelry, etc. We evaluate HairFree quantitatively and qualitatively, demonstrating its superiority over state-of-the-art 3D head avatar generation methods. https://hairfree.is.tue.mpg.de/
pdf project poster BibTeX
Thumb ticker lg textures

Perceiving Systems Conference Paper GenLit: Reformulating Single Image Relighting as Video Generation Bharadwaj, S., Feng, H., Becherini, G., Abrevaya, V. F., Black, M. J. In SIGGRAPH Asia Conference Papers ’25, Association for Computing Machinery, SIGGRAPH Asia, December 2025 (To be published)
Manipulating the illumination of a 3D scene within a single image represents a fundamental challenge in computer vision and graphics. This problem has traditionally been addressed using inverse rendering techniques, which involve explicit 3D asset reconstruction and costly ray-tracing simulations. Meanwhile, recent advancements in visual foundation models suggest that a new paradigm could soon be possible -- one that replaces explicit physical models with networks that are trained on large amounts of image and video data. In this paper, we exploit the implicit scene understanding of a video diffusion model, particularly Stable Video Diffusion, to relight a single image. We introduce GenLit, a framework that distills the ability of a graphics engine to perform light manipulation into a video-generation model, enabling users to directly insert and manipulate a point light in the 3D world within a given image and generate results directly as a video sequence. We find that a model fine-tuned on only a small synthetic dataset generalizes to real-world scenes, enabling single-image relighting with plausible and convincing shadows and inter-reflections. Our results highlight the ability of video foundation models to capture rich information about lighting, material, and shape, and our findings indicate that such models, with minimal training, can be used to perform relighting without explicit asset reconstruction or ray-tracing.
Project Page Paper DOI URL BibTeX
Thumb ticker lg genlit thumbnail

Perceiving Systems Ph.D. Thesis Learning Hands in Action Fan, Z. December 2025 (Published)
Hands are our primary interface for acting on the world. From everyday tasks like preparing food to skilled procedures like surgery, human activity is shaped by rich and varied hand interactions. These include not only manipulation of external objects but also coordinated actions between both hands. For physical AI systems to learn from human behavior, assist in physical tasks, or collaborate safely in shared environments, they must perceive and understand hands in action, how we use them to interact with each other and with the objects around us. A key component of this understanding is the ability to reconstruct human hand motion and hand-object interactions in 3D from RGB images or videos. However, existing methods focus largely on estimating the pose of a single hand, often in isolation. They struggle with scenarios involving two hands in strong interactions or the interactions with objects, particularly when those objects are articulated or previously unseen. This is because reconstructing 3D hands in action poses significant challenges, such as severe occlusions, appearance ambiguities, and the need to reason about both hand and object geometry in dynamic configurations. As a result, current systems fall short in complex real-world environments. This dissertation addresses these challenges by introducing methods and data for reconstructing hands in action from monocular RGB inputs. We begin by tackling the problem of interacting hand pose estimation. We present DIGIT, a method that leverages a part-aware semantic prior to disambiguate closely interacting hands. By explicitly modeling hand part interactions and encoding the semantics of finger parts, DIGIT robustly recovers accurate hand poses, outperforms prior baselines and provides a step forward for more complete 3D hands in action understanding. Since hands frequently manipulate objects, jointly reconstructing both is crucial. Existing methods for hand-object reconstruction are limited to rigid objects and cannot handle tools with articulation, such as scissors or laptops. This severely restricts their ability to model the full range of everyday manipulations. We present the first method that jointly reconstructs two hands and an articulated object from a single RGB image, enabling unified reasoning across both rigid and articulated object interactions. To support this, we introduce ARCTIC, a large-scale motion capture dataset of humans performing dexterous bimanual manipulation with articulated tools. ARCTIC includes both articulated and fixed (rigid) configurations, along with accurate 3D annotations of hand poses and object motions. Leveraging this dataset, our method jointly infers object articulation states, and hand poses, advancing the state of hand-object understanding in complex object manipulation settings. Finally, we address generalization to in-the-wild object interactions. Prior approaches either rely on synthetic data with limited realism or require object models at test time. We introduce HOLD, a self-supervised method that learns to reconstruct 3D hand-object interactions from monocular RGB videos, without paired 3D annotations or known object models. HOLD learns via an appearance- and motion-consistent objective across views and time, enabling strong generalization to unseen objects in interaction. Experiments demonstrate HOLD's ability to generalize to in-the-wild monocular settings, outperforming fully-supervised baselines trained on synthetic or lab-captured datasets. Together, DIGIT, ARCTIC, and HOLD advance the 3D understanding of hands in action, covering both hand-hand and hand-object interactions. These contributions improve the robustness in interacting hand pose estimation, introduce a dataset for bimanual manipulation with rigid and articulated tools, and include the first singe-image method for jointly reconstructing hands and articulated objects learned directly from this dataset. In addition, HOLD removes the need for object templates by enabling hand-object reconstruction in the wild. These developments move toward more scalable physical AI systems capable of interpreting and imitating human manipulation, with applications in teleoperation, human-robot collaboration, and embodied learning from demonstration.
PDF BibTeX
Thumb ticker lg cleanshot 0119 at 11.20.11 2x

Haptic Intelligence Perceiving Systems Ph.D. Thesis An Interdisciplinary Approach to Human Pose Estimation: Application to Sign Language Forte, M. University of Tübingen, Tübingen, Germany, November 2025, Department of Computer Science (Published)
Accessibility legislation mandates equal access to information for Deaf communities. While videos of human interpreters provide optimal accessibility, they are costly and impractical for frequently updated content. AI-driven signing avatars offer a promising alternative, but their development is limited by the lack of high-quality 3D motion-capture data at scale. Vision-based motion-capture methods are scalable but struggle with the rapid hand movements, self-occlusion, and self-touch that characterize sign language. To address these limitations, this dissertation develops two complementary solutions. SGNify improves hand pose estimation by incorporating universal linguistic rules that apply to all sign languages as computational priors. Proficient signers recognize the reconstructed signs as accurately as those in the original videos, but depth ambiguities along the camera axis can still produce incorrect reconstructions for signs involving self-touch. To overcome this remaining limitation, BioTUCH integrates electrical bioimpedance sensing between the wrists of the person being captured. Systematic measurements show that skin-to-skin contact produces distinctive bioimpedance reductions at high frequencies (240 kHz to 4.1 MHz), enabling reliable contact detection. BioTUCH uses the timing of these self-touch events to refine arm poses, producing physically plausible arm configurations and significantly reducing reconstruction error. Together, these contributions support the scalable collection of high-quality 3D sign language motion data, facilitating progress toward AI-driven signing avatars.
BibTeX

Haptic Intelligence Perceiving Systems Conference Paper Contact-Aware Refinement of Human Pose Pseudo-Ground Truth via Bioimpedance Sensing Forte, M., Athanasiou, N., Ballardini, G., Bartels, U., Kuchenbecker, K. J., Black, M. J. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 5071-5080, Honolulu, USA, October 2025, Nikos Athanasiou and Giulia Ballardini contributed equally to this publication (Published) pdf URL BibTeX
Thumb ticker lg rgb image noblur small2  2  2

Perceiving Systems Conference Paper Generative Zoo Niewiadomski, T., Yiannakidis, A., Cuevas-Velasquez, H., Sanyal, S., Black, M. J., Zuffi, S., Kulits, P. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, HI, International Conference on Computer Vision, ICCV, October 2025 (Published)
The model-based estimation of 3D animal pose and shape from images enables computational modeling of animal behavior. Training models for this purpose requires large amounts of labeled image data with precise pose and shape annotations. However, capturing such data requires the use of multi-view or marker-based motion-capture systems, which are impractical to adapt to wild animals in situ and impossible to scale across a comprehensive set of animal species. Some have attempted to address the challenge of procuring training data by pseudo-labeling individual real-world images through manual 2D annotation, followed by 3D-parameter optimization to those labels. While this approach may produce silhouette-aligned samples, the obtained pose and shape parameters are often implausible due to the ill-posed nature of the monocular fitting problem. Sidestepping real-world ambiguity, others have designed complex synthetic-data-generation pipelines leveraging video-game engines and collections of artist-designed 3D assets. Such engines yield perfect ground-truth annotations but are often lacking in visual realism and require considerable manual effort to adapt to new species or environments. Motivated by these shortcomings, we propose an alternative approach to synthetic-data generation: rendering with a conditional image-generation model. We introduce a pipeline that samples a diverse set of poses and shapes for a variety of mammalian quadrupeds and generates realistic images with corresponding ground-truth pose and shape parameters. To demonstrate the scalability of our approach, we introduce GenZoo, a synthetic dataset containing one million images of distinct subjects. We train a 3D pose and shape regressor on GenZoo, which achieves state-of-the-art performance on a real-world multi-species 3D animal pose and shape estimation benchmark, despite being trained solely on synthetic data. We will release our dataset and generation pipeline to support future research.
project page code demo pdf BibTeX
Thumb ticker lg teaser

Perceiving Systems Conference Paper ETCH: Generalizing Body Fitting to Clothed Humans via Equivariant Tightness Li, B., Feng, H., Cai, Z., Black, M. J., Xiu, Y. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025 (Published)
itting a body to a 3D clothed human point cloud is a common yet challenging task. Traditional optimization-based approaches use multi-stage pipelines that are sensitive to pose initialization, while recent learning-based methods often struggle with generalization across diverse poses and garment types. We propose Equivariant Tightness Fitting for Clothed Humans, or ETCH, a novel pipeline that estimates cloth-to-body surface mapping through locally approximate SE(3) equivariance, encoding tightness as displacement vectors from the cloth surface to the underlying body. Following this mapping, pose-invariant body features regress sparse body markers, simplifying clothed human fitting into an inner-body marker fitting task. Extensive experiments on CAPE and 4D-Dress show that ETCH significantly outperforms state-of-the-art methods -- both tightness-agnostic and tightness-aware -- in body fitting accuracy on loose clothing (16.7% ~ 69.5%) and shape accuracy (average 49.9%). Our equivariant tightness design can even reduce directional errors by (67.2% ~ 89.8%) in one-shot (or out-of-distribution) settings (~ 1% data). Qualitative results demonstrate strong generalization of ETCH, regardless of challenging poses, unseen shapes, loose clothing, and non-rigid dynamics.
project arXiv code video BibTeX
Thumb ticker lg tightness

Perceiving Systems Conference Paper Im2Haircut: Single-view Strand-based Hair Reconstruction for Human Avatars Vanessa, S., Egor, Z., Malte, P., Giorgio, B., Michael, B., Justus, T. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Honolulu, USA, October 2025 (Accepted)
We present a novel approach for 3D hair reconstruction from single photographs based on a global hair prior combined with local optimization. Capturing strand-based hair geometry from single photographs is challenging due to the variety and geometric complexity of hairstyles and the lack of ground truth training data. Classical reconstruction methods like multi-view stereo only reconstruct the visible hair strands, missing the inner structure of hairstyles and hampering realistic hair simulation. To address this, existing methods leverage hairstyle priors trained on synthetic data. Such data, however, is limited in both quantity and quality since it requires manual work from skilled artists to model the 3D hairstyles and create near-photorealistic renderings. To address this, we propose a novel approach that uses both, real and synthetic data to learn an effective hairstyle prior. Specifically, we train a transformer-based prior model on synthetic data to obtain knowledge of the internal hairstyle geometry and introduce real data in the learning process to model the outer structure. This training scheme is able to model the visible hair strands depicted in an input image, while preserving the general 3D structure of hairstyles. We exploit this prior to create a Gaussian-splatting-based reconstruction method that creates hairstyles from one or more images. Qualitative and quantitative comparisons with existing reconstruction pipelines demonstrate the effectiveness and superior performance of our method for capturing detailed hair orientation, overall silhouette, and backside consistency. For additional results and code, please refer to https://im2haircut.is.tue.mpg.de.
arXiv project code BibTeX
Thumb ticker lg im2haircut

Perceiving Systems Conference Paper MagicHOI: Leveraging 3D Priors for Accurate Hand-object Reconstruction from Short Monocular Video Clips Wang, S., He, H., Parelli, M., Gebhardt, C., Fan, Z., Song, J. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), ICCV, October 2025 (Published)
Most RGB-based hand-object reconstruction methods rely on object templates, while template-free methods typically assume full object visibility. This assumption often breaks in real-world settings, where fixed camera viewpoints and static grips leave parts of the object unobserved, resulting in implausible reconstructions. To overcome this, we present MagicHOI, a method for reconstructing hands and objects from short monocular interaction videos, even under limited viewpoint variation. Our key insight is that, despite the scarcity of paired 3D hand-object data, largescale novel view synthesis diffusion models offer rich object supervision. This supervision serves as a prior to regularize unseen object regions during hand interactions. Leveraging this insight, we integrate a novel view synthesis model into our hand-object reconstruction framework. We further align hand to object by incorporating visible contact constraints. Our results demonstrate that MagicHOI significantly outperforms existing state-of-the-art hand-object reconstruction methods. We also show that novel view synthesis diffusion priors effectively regularize unseen object regions, enhancing 3D hand-object reconstruction.
Project Video Code URL BibTeX
Thumb ticker lg cleanshot 1104 at 09.29.58 2x

Perceiving Systems Conference Paper MoGA: 3D Generative Avatar Prior for Monocular Gaussian Avatar Reconstruction Dong, Z., Duan, L., Song, J., Black, M. J., Geiger, A. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025 (Published)
We present MoGA, a novel method to reconstruct high-fidelity 3D Gaussian avatars from a single-view image. The main challenge lies in inferring unseen appearance and geometric details while ensuring 3D consistency and realism. Most previous methods rely on 2D diffusion models to synthesize unseen views; however, these generated views are sparse and inconsistent, resulting in unrealistic 3D artifacts and blurred appearance. To address these limitations, we leverage a generative avatar model, that can generate diverse 3D avatars by sampling deformed Gaussians from a learned prior distribution. Due to the limited amount of 3D training data such a 3D model alone cannot capture all image details of unseen identities. Consequently, we integrate it as a prior, ensuring 3D consistency by projecting input images into its latent space and enforcing additional 3D appearance and geometric constraints. Our novel approach formulates Gaussian avatar creation as a model inversion process by fitting the generative avatar to synthetic views from 2D diffusion models. The generative avatar provides a meaningful initialization for model fitting, enforces 3D regularization, and helps in refining pose estimation. Experiments show that our method surpasses state-of-the-art techniques and generalizes well to real-world scenarios. Our Gaussian avatars are also inherently animatable.
pdf project code video BibTeX
Thumb ticker lg moga

Perceiving Systems Conference Paper PRIMAL: Physically Reactive and Interactive Motor Model for Avatar Learning Zhang, Y., Feng, Y., Cseke, A., Saini, N., Bajandas, N., Heron, N., Black, M. J. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025 (Published)
We formulate the motor system of an interactive avatar as a generative motion model that can drive the body to move through 3D space in a perpetual, realistic, controllable, and responsive manner. Although human motion generation has been extensively studied, many existing methods lack the responsiveness and realism of real human movements. Inspired by recent advances in foundation models, we propose PRIMAL, which is learned with a two-stage paradigm. In the pretraining stage, the model learns body movements from a large number of sub-second motion segments, providing a generative foundation from which more complex motions are built. This training is fully unsupervised without annotations. Given a single-frame initial state during inference, the pretrained model not only generates unbounded, realistic, and controllable motion, but also enables the avatar to be responsive to induced impulses in real time. In the adaptation phase, we employ a novel ControlNet-like adaptor to fine-tune the base model efficiently, adapting it to new tasks such as few-shot personalized action generation and spatial target reaching. Evaluations show that our proposed method outperforms state-of-the-art baselines. We leverage the model to create a real-time character animation system in Unreal Engine that feels highly responsive and natural.
pdf project code video BibTeX
Thumb ticker lg primal

Perceiving Systems Conference Paper SDFit: 3D Object Pose and Shape by Fitting a Morphable SDF to a Single Image Antić, D., Paschalidis, G., Tripathi, S., Gevers, T., Dwivedi, S. K., Tzionas, D. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025 (Published)
Recovering 3D object pose and shape from a single image is a challenging and ill-posed problem. This is due to strong (self-)occlusions, depth ambiguities, the vast intra- and inter-class shape variance, and the lack of 3D ground truth for natural images. Existing deep-network methods are trained on synthetic datasets to predict 3D shapes, so they often struggle generalizing to real-world images. Moreover, they lack an explicit feedback loop for refining noisy estimates, and primarily focus on geometry without directly considering pixel alignment. To tackle these limitations, we develop a novel render-and-compare optimization framework, called SDFit. This has three key innovations: First, it uses a learned category-specific and morphable signed-distance-function (mSDF) model, and fits this to an image by iteratively refining both 3D pose and shape. The mSDF robustifies inference by constraining the search on the manifold of valid shapes, while allowing for arbitrary shape topologies. Second, SDFit retrieves an initial 3D shape that likely matches the image, by exploiting foundational models for efficient look-up into 3D shape databases. Third, SDFit initializes pose by establishing rich 2D-3D correspondences between the image and the mSDF through foundational features. We evaluate SDFit on three image datasets, i.e., Pix3D, Pascal3D+, and COMIC. SDFit performs on par with SotA feed-forward networks for unoccluded images and common poses, but is uniquely robust to occlusions and uncommon poses. Moreover, it requires no retraining for unseen images. Thus, SDFit contributes new insights for generalizing in the wild.
Project arXiv Code Video Poster BibTeX
Thumb ticker lg sdfit teaser final

Perceiving Systems Conference Paper St4RTrack: Simultaneous 4D Reconstruction and Tracking in the World Feng, H., Zhang, J., Wang, Q., Ye, Y., Yu, P., Black, M., Darrell, T., Kanazawa, A. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2025 (Published)
Dynamic 3D reconstruction and point tracking in videos are typically treated as separate tasks, despite their deep connection. We propose St4RTrack, a feed-forward framework that simultaneously reconstructs and tracks dynamic video content in a world coordinate frame from RGB inputs. This is achieved by predicting two appropriately defined pointmaps for a pair of frames captured at different moments. Specifically, we predict both pointmaps at the same moment, in the same world, capturing both static and dynamic scene geometry while maintaining 3D correspondences. Chaining these predictions through the video sequence with respect to a reference frame naturally computes long-range correspondences, effectively combining 3D reconstruction with 3D tracking. Unlike prior methods that rely heavily on 4D ground truth supervision we employ a novel adaptation scheme based on a reprojection loss. We establish a new extensive benchmark for world-frame reconstruction and tracking, demonstrating the effectiveness and efficiency of our unified, data-driven framework.
pdf arXiv project code demo video BibTeX
Thumb ticker lg startrack

Perceiving Systems Conference Paper Reconstructing Animals and the Wild Kulits, P., Black, M. J., Zuffi, S. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2025 (Published)
The idea of 3D reconstruction as scene understanding is foundational in computer vision. Reconstructing 3D scenes from 2D visual observations requires strong priors to disambiguate structure. Much work has been focused on the anthropocentric, which, characterized by smooth surfaces, coherent normals, and regular edges, allows for the integration of strong geometric inductive biases. Here, we consider a more challenging problem where such assumptions do not hold: the reconstruction of natural scenes containing trees, bushes, boulders, and animals. While numerous works have attempted to tackle the problem of reconstructing animals in the wild, they have focused solely on the animal, neglecting environmental context. This limits their usefulness for analysis tasks, as animals exist inherently within the 3D world, and information is lost when environmental factors are disregarded. We propose a method to reconstruct natural scenes from single images. We base our approach on recent advances leveraging the strong world priors ingrained in Large Language Models and train an autoregressive model to decode a CLIP embedding into a structured compositional scene representation, encompassing both animals and the wild (RAW). To enable this, we propose a synthetic dataset comprising one million images and thousands of assets. Our approach, having been trained solely on synthetic data, generalizes to the task of reconstructing animals and their environments in real-world images. We will release our dataset and code to encourage future research.
project arXiv code BibTeX
Thumb ticker lg raw2

Perceiving Systems Conference Paper DiffLocks: Generating 3D Hair from a Single Image using Diffusion Models Rosu, R. A., Wu, K., Feng, Y., Zheng, Y., Black, M. J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2025 (Published)
We address the task of reconstructing 3D hair geometry from a single image, which is challenging due to the diversity of hairstyles and the lack of paired image-to-3D hair data. Previous methods are primarily trained on synthetic data and cope with the limited amount of such data by using low-dimensional intermediate representations, such as guide strands and scalp-level embeddings, that require post-processing to decode, upsample, and add realism. These approaches fail to reconstruct detailed hair, struggle with curly hair, or are limited to handling only a few hairstyles. To overcome these limitations, we propose DiffLocks, a novel framework that enables detailed reconstruction of a wide variety of hairstyles directly from a single image. First, we address the lack of 3D hair data by automating the creation of the largest synthetic hair dataset to date, containing 40K hairstyles. Second, we leverage the synthetic hair dataset to learn an image-conditioned diffusion-transfomer model that reconstructs accurate 3D strands from a single frontal image. By using a pretrained image backbone, our method generalizes to in-the-wild images despite being trained only on synthetic data. Our diffusion model predicts a scalp texture map in which any point in the map contains the latent code for an individual hair strand. These codes are directly decoded to 3D strands without post-processing techniques. Representing individual strands, instead of guide strands, enables the transformer to model the detailed spatial structure of complex hairstyles. With this, DiffLocks can reconstruct highly curled hair, like afro hairstyles, from a single image for the first time. Qualitative and quantitative results demonstrate that DiffLocks outperforms exising state-of-the-art approaches. Data and code is available for research.
project paper code dataset BibTeX
Thumb ticker lg difflocks

Perceiving Systems Conference Paper InterDyn: Controllable Interactive Dynamics with Video Diffusion Models Akkerman, R., Feng, H., Black, M. J., Tzionas, D., Abrevaya, V. F. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2025 (Published)
Predicting the dynamics of interacting objects is essential for both humans and intelligent systems. However, existing approaches are limited to simplified, toy settings and lack generalizability to complex, real-world environments. Recent advances in generative models have enabled the prediction of state transitions based on interventions, but focus on generating a single future state which neglects the continuous dynamics resulting from the interaction. To address this gap, we propose InterDyn, a novel framework that generates videos of interactive dynamics given an initial frame and a control signal encoding the motion of a driving object or actor. Our key insight is that large video generation models can act as both neural renderers and implicit physics ``simulators'', having learned interactive dynamics from large-scale video data. To effectively harness this capability, we introduce an interactive control mechanism that conditions the video generation process on the motion of the driving entity. Qualitative results demonstrate that InterDyn generates plausible, temporally consistent videos of complex object interactions while generalizing to unseen objects. Quantitative evaluations show that InterDyn outperforms baselines that focus on static state transitions. This work highlights the potential of leveraging video generative models as implicit physics engines
project arXiv BibTeX
Thumb ticker lg interdyn

Perceiving Systems Conference Paper PICO: Reconstructing 3D People In Contact with Objects Cseke, A., Tripathi, S., Dwivedi, S. K., Lakshmipathy, A. S., Chatterjee, A., Black, M. J., Tzionas, D. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2025 (Published)
Recovering 3D Human-Object Interaction (HOI) from single color images is challenging due to depth ambiguities, occlusions, and the huge variation in object shape and appearance. Thus, past work requires controlled settings such as known object shapes and contacts, and tackles only limited object classes. Instead, we need methods that generalize to natural images and novel object classes. We tackle this in two main ways: (1) We collect PICO-db, a new dataset of natural images uniquely paired with dense 3D contact on both body and object meshes. To this end, we use images from the recent DAMON dataset that are paired with contacts, but these contacts are only annotated on a canonical 3D body. In contrast, we seek contact labels on both the body and the object. To infer these given an image, we retrieve an appropriate 3D object mesh from a database by leveraging vision foundation models. Then, we project DAMON's body contact patches onto the object via a novel method needing only 2 clicks per patch. This minimal human input establishes rich contact correspondences between bodies and objects. (2) We exploit our new dataset of contact correspondences in a novel render-and-compare fitting method, called PICO-fit, to recover 3D body and object meshes in interaction. PICO-fit infers contact for the SMPL-X body, retrieves a likely 3D object mesh and contact from PICO-db for that object, and uses the contact to iteratively fit the 3D body and object meshes to image evidence via optimization. Uniquely, PICO-fit works well for many object categories that no existing method can tackle. This is crucial to enable HOI understanding to scale in the wild.
project arXiv video code dataset BibTeX
Thumb ticker lg pico teaser

Perceiving Systems Conference Paper ChatGarment: Garment Estimation, Generation and Editing via Large Language Models Bian, S., Xu, C., Xiu, Y., Grigorev, A., Liu, Z., Lu, C., Black, M. J., Feng, Y. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2025 (Published)
We introduce ChatGarment, a novel approach that leverages large vision-language models (VLMs) to automate the estimation, generation, and editing of 3D garment sewing patterns from images or text descriptions. Unlike previous methods that often lack robustness and interactive editing capabilities, ChatGarment finetunes a VLM to produce GarmentCode, a JSON-based, language-friendly format for 2D sewing patterns, enabling both estimating and editing from images and text instructions. To optimize performance, we refine GarmentCode by expanding its support for more diverse garment types and simplifying its structure, making it more efficient for VLM finetuning. Additionally, we develop an automated data construction pipeline to generate a large-scale dataset of image-to-sewing-pattern and text-to-sewing-pattern pairs, empowering ChatGarment with strong generalization across various garment types. Extensive evaluations demonstrate ChatGarment’s ability to accurately reconstruct, generate, and edit garments from multimodal inputs, highlighting its potential to revolutionize workflows in fashion and gaming applications.
project arXiv video code data BibTeX
Thumb ticker lg chatgarment

Empirical Inference Perceiving Systems Conference Paper ChatHuman: Chatting about 3D Humans with Tools Lin, J., Feng, Y., Liu, W., Black, M. J. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 8150-8161, June 2025 (Published)
Numerous methods have been proposed to detect, estimate, and analyze properties of people in images, including 3D pose, shape, contact, human-object interaction, and emotion. While widely applicable in vision and other areas, such methods require expert knowledge to select, use, and interpret the results. To address this, we introduce ChatHuman, a language-driven system that integrates the capabilities of specialized methods into a unified framework. ChatHuman functions as an assistant proficient in utilizing, analyzing, and interacting with tools specific to 3D human tasks, adeptly discussing and resolving related challenges. Built on a Large Language Model (LLM) framework, ChatHuman is trained to autonomously select, apply, and interpret a diverse set of tools in response to user inputs. Our approach overcomes significant hurdles in adapting LLMs to 3D human tasks, including the need for domain-specific knowledge and the ability to interpret complex 3D outputs. The innovations of ChatHuman include leveraging academic publications to instruct the LLM on tool usage, employing a retrieval-augmented generation model to create in-context learning examples for managing new tools, and effectively discriminating between and integrating tool results by transforming specialized 3D outputs into comprehensible formats. Experiments demonstrate that ChatHuman surpasses existing models in both tool selection accuracy and overall performance across various 3D human tasks, and it supports interactive chatting with users. ChatHuman represents a significant step toward consolidating diverse analytical methods into a unified, robust system for 3D human tasks.
project pdf Paper DOI BibTeX
Thumb ticker lg chathumanwide

Perceiving Systems Conference Paper InteractVLM: 3D Interaction Reasoning from 2D Foundational Models Dwivedi, S. K., Antić, D., Tripathi, S., Taheri, O., Schmid, C., Black, M. J., Tzionas, D. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 22605-22615, June 2025 (Published)
We introduce InteractVLM, a novel method to estimate 3D contact points on human bodies and objects from single in-the-wild images, enabling accurate human-object joint reconstruction in 3D. This is challenging due to occlusions, depth ambiguities, and widely varying object shapes. Existing methods rely on 3D contact annotations collected via expensive motion-capture systems or tedious manual labeling, limiting scalability and generalization. To overcome this, InteractVLM harnesses the broad visual knowledge of large Vision-Language Models (VLMs), fine-tuned with limited 3D contact data. However, directly applying these models is non-trivial, as they reason only in 2D, while human-object contact is inherently 3D. Thus we introduce a novel Render-Localize-Lift module that: (1) embeds 3D body and object surfaces in 2D space via multi-view rendering, (2) trains a novel multi-view localization model (MV-Loc) to infer contacts in 2D, and (3) lifts these to 3D. Additionally, we propose a new task called Semantic Human Contact estimation, where human contact predictions are conditioned explicitly on object semantics, enabling richer interaction modeling. InteractVLM outperforms existing work on contact estimation and also facilitates 3D reconstruction from an in-the wild image.
Project Paper Code Video BibTeX
Thumb ticker lg ivlm teaser crop

Perceiving Systems Conference Paper PromptHMR: Promptable Human Mesh Recovery Wang, Y., Sun, Y., Patel, P., Daniilidis, K., Black, M. J., Kocabas, M. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2025 (Published)
Human pose and shape (HPS) estimation presents challenges in diverse scenarios such as crowded scenes, person-person interactions, and single-view reconstruction. Existing approaches lack mechanisms to incorporate auxiliary "side information" that could enhance reconstruction accuracy in such challenging scenarios. Furthermore, the most accurate methods rely on cropped person detections and cannot exploit scene context while methods that process the whole image often fail to detect people and are less accurate than methods that use crops. While recent language-based methods explore HPS reasoning through large language or vision-language models, their metric accuracy is well below the state of the art. In contrast, we present PromptHMR, a transformer-based promptable method that reformulates HPS estimation through spatial and semantic prompts. Our method processes full images to maintain scene context and accepts multiple input modalities: spatial prompts like bounding boxes and masks, and semantic prompts like language descriptions or interaction labels. PromptHMR demonstrates robust performance across challenging scenarios: estimating people from bounding boxes as small as faces in crowded scenes, improving body shape estimation through language descriptions, modeling person-person interactions, and producing temporally coherent motions in videos. Experiments on benchmarks show that PromptHMR achieves state-of-the-art performance while offering flexible prompt-based control over the HPS estimation process.
arXiv project video BibTeX
Thumb ticker lg phmrteaser

Perceiving Systems Ph.D. Thesis Estimating Human and Camera Motion From RGB Data Kocabas, M. April 2025 (Published)
This thesis presents a unified framework for markerless 3D human motion analysis from monocular videos, addressing three interrelated challenges that have limited the fidelity of existing approaches: (i) achieving temporally consistent and physically plausible human motion estimation, (ii) accurately modeling perspective camera effects in unconstrained settings, and (iii) disentangling human motion from camera motion in dynamic scenes. Our contributions are realized through three complementary methods. First, we introduce VIBE (Video Inference for Body Pose and Shape Estimation), a novel video pose and shape estimation framework. Despite progress on single-image 3D pose and shape estimation, existing video-based state-of-the-art methods fail to produce accurate and natural motion sequences due to a lack of ground-truth 3D motion data for training. To address this problem, we propose VIBE, which makes use of an existing large-scale motion capture dataset (AMASS) together with unpaired, in-the-wild, 2D keypoint annotations. Our key novelty is an adversarial learning framework that leverages AMASS to discriminate between real human motions and those produced by our temporal pose and shape regression networks. We define a temporal network architecture and show that adversarial training, at the sequence level, produces kinematically plausible motion sequences without in-the-wild ground-truth 3D labels. Second, we propose SPEC (Seeing People in the wild with Estimated Cameras), the first in-the-wild 3D human and shape (HPS) method that estimates the perspective camera from a single image and employs this to reconstruct 3D human bodies more accurately. Due to the lack of camera parameter information for in-the-wild images, existing 3D HPS estimation methods make several simplifying assumptions: weak-perspective projection, large constant focal length, and zero camera rotation. These assumptions often do not hold and we show, quantitatively and qualitatively, that they cause errors in the reconstructed 3D shape and pose. To address this, we introduce SPEC, the first in-the-wild 3D HPS method that estimates the perspective camera from a single image and employs this to reconstruct 3D human bodies more accurately. First, we train a neural network to estimate the field of view, camera pitch, and roll given an input image. We employ novel losses that improve the camera calibration accuracy over previous work. We then train a novel network that concatenates the camera calibration to the image features and uses these together to regress 3D body shape and pose. SPEC is more accurate than the prior art on the standard benchmark (3DPW) as well as two new datasets with more challenging camera views and varying focal lengths. Specifically, we create a new photorealistic synthetic dataset (SPEC-SYN) with ground truth 3D bodies and a novel in-the-wild dataset (SPEC-MTP) with calibration and high-quality reference bodies. Third, we develop PACE (Person And Camera Estimation), a method to estimate human motion in a global scene from moving cameras. This is a highly challenging task due to the entangling of human and camera motions in the video. Existing works assume camera is static and focus on solving the human motion in camera space. To address this problem, we propose a joint optimization framework that disentangles human and camera motions using both foreground human motion priors and background scene features. Unlike existing methods that use Simultaneous Localization and Mapping (SLAM) as initialization, we propose to tightly integrate SLAM and human motion priors in an optimization that is inspired by bundle adjustment. Specifically, we optimize human and camera motions to match both the observed human pose and scene features. This design combines the strengths of SLAM and motion priors, which leads to significant improvements in human and camera motion estimation. We additionally introduce a motion prior that is suitable for batch optimization, making our approach significantly more efficient than existing approaches. Finally, we propose a novel synthetic dataset that enables evaluating camera motion in addition to human motion from dynamic videos. Experiments on the synthetic and real-world datasets demonstrate that our approach substantially outperforms prior art in recovering both human and camera motions. Extensive experiments on standard benchmarks and new datasets we introduced demonstrate that our integrated approach substantially outperforms prior methods in terms of temporal consistency, reconstruction accuracy, and global motion estimation. While these results represent a significant advance in markerless human motion analysis, further work is needed to extend these techniques to multi-person scenarios, severe occlusions, and real-time applications. Overall, this thesis lays a strong foundation for more robust and accurate human motion analysis in unconstrained environments, with promising applications in robotics, augmented reality, sports analysis, and beyond.
Thesis PDF BibTeX
Thumb ticker lg phd thesis

Perceiving Systems Ph.D. Thesis Understanding Human-Scene Interaction through Perception and Generation Yi, H. April 2025 (Published)
Humans are in constant contact with the world as they move through it and interact with it. Understanding Human-Scene Interactions (HSIs) is key to enhancing our perception and manipulation of three-dimensional (3D) environments, which is crucial for various applications such as gaming, architecture, and synthetic data creation. However, creating realistic 3D scenes populated by moving humans is a challenging and labor-intensive task. Existing human-scene interaction datasets are scarce and captured motion datasets often lack scene information. This thesis addresses these challenges by leveraging three specific types of HSI con- straints: (1) depth ordering constraint: humans that move in a scene are occluded or occlude objects, thus, defining the relative depth ordering of the objects, (2) collision constraint: humans move through free space and do not interpenetrate objects, (3) in- teraction constraint: when humans and objects are in contact, the contact surfaces oc- cupy the same place in space. Building on these constraints, we propose three distinct methodologies: capturing HSI from a monocular RGB video, generating HSI by gen- erating scenes from input human motions (scenes from humans) and generating human motion from scenes (humans from scenes). Firstly, we introduce MOVER , which jointly reconstructs 3D human motion and the interactive scenes from a RGB video. This optimization-based approach leverages these three aforementioned constraints to enhance the consistency and plausibility of recon- structed scene layouts and to refine the initial 3D human pose and shape estimations. Secondly, we present MIME , which takes 3D humans and a floor map as input to create realistic and interactive 3D environments. This method applies collision and interaction constraints, and employs an auto-regressive transformer architecture that integrates ob- jects into the scene based on existing human motion. The training data is enriched by populating the 3D FRONT scene dataset with 3D humans. By treating human movement as a “scanner” of the environment, this method results in furniture layouts that reflect true human activities, increasing the diversity and authenticity of the environments. Lastly, we introduce TeSMo , which generates 3D human motion from given 3D scenes and text descriptions, adhering to the collision and interaction constraints. It utilizes a text-controlled scene-aware motion generation framework based on denoising diffusion models. Annotated navigation and interaction motions are embedded within scenes to support the model’s training, allowing for the generation of diverse and realistic human- scene interactions tailored to specific settings and object arrangements. In conclusion, these methodologies significantly advance our understanding and syn- thesis of human-scene interactions, offering realistic modeling of 3D environments.
thesis BibTeX
Thumb ticker lg hongweithesis

Empirical Inference Perceiving Systems Conference Paper Can Large Language Models Understand Symbolic Graphics Programs? Qiu, Z., Liu, W., Feng, H., Liu, Z., Xiao, T. Z., Collins, K. M., Tenenbaum, J. B., Weller, A., Black, M. J., Schölkopf, B. The Thirteenth International Conference on Learning Representations (ICLR), April 2025 (Published)
Against the backdrop of enthusiasm for large language models (LLMs), there is a growing need to scientifically assess their capabilities and shortcomings. This is nontrivial in part because it is difficult to find tasks which the models have not encountered during training. Utilizing symbolic graphics programs, we propose a domain well-suited to test multiple spatial-semantic reasoning skills of LLMs. Popular in computer graphics, these programs procedurally generate visual data. While LLMs exhibit impressive skills in general program synthesis and analysis, symbolic graphics programs offer a new layer of evaluation: they allow us to test an LLM’s ability to answer semantic questions about the images or 3D geometries without a vision encoder. To semantically understand the symbolic programs, LLMs would need to possess the ability to “imagine” and reason how the corresponding graphics content would look with only the symbolic description of the local curvatures and strokes. We use this task to evaluate LLMs by creating a large benchmark for the semantic visual understanding of symbolic graphics programs, built procedurally with minimal human effort. Particular emphasis is placed on transformations of images that leave the image level semantics invariant while introducing significant changes to the underlying program. We evaluate commercial and open-source LLMs on our benchmark to assess their ability to reason about visual output of programs, finding that LLMs considered stronger at reasoning generally perform better. Lastly, we introduce a novel method to improve this ability – Symbolic Instruction Tuning (SIT), in which the LLM is finetuned with pre-collected instruction data on symbolic graphics programs. Interestingly, we find that SIT not only improves LLM’s understanding on symbolic programs, but it also improves general reasoning ability on various other benchmarks.
arXiv Paper BibTeX
Thumb ticker lg svg

Empirical Inference Perceiving Systems Conference Paper Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets Liu, Z., Xiao, T. Z., Liu, W., Bengio, Y., Zhang, D. The Thirteenth International Conference on Learning Representations (ICLR), April 2025 (Published)
While one commonly trains large diffusion models by collecting datasets on target downstream tasks, it is often desired to align and finetune pretrained diffusion models with some reward functions that are either designed by experts or learned from small-scale datasets. Existing post-training methods for reward finetuning of diffusion models typically suffer from lack of diversity in generated samples, lack of prior preservation, and/or slow convergence in finetuning. Inspired by recent successes in generative flow networks (GFlowNets), a class of probabilistic models that sample with the unnormalized density of a reward function, we propose a novel GFlowNet method dubbed Nabla-GFlowNet (abbreviated as ∇-GFlowNet), the first GFlowNet method that leverages the rich signal in reward gradients, together with an objective called ∇-DB plus its variant residual ∇-DB designed for prior-preserving diffusion finetuning. We show that our proposed method achieves fast yet diversity- and prior-preserving finetuning of Stable Diffusion, a large-scale text-conditioned image diffusion model, on different realistic reward functions.
arXiv BibTeX
Thumb ticker lg residualdb

Haptic Intelligence Perceiving Systems Article Wrist-to-Wrist Bioimpedance Can Reliably Detect Discrete Self-Touch Forte, M., Vardar, Y., Javot, B., Kuchenbecker, K. J. IEEE Transactions on Instrumentation and Measurement, 74(4006511):1-11, April 2025 (Published)
Self-touch is crucial in human communication, psychology, and disease transmission, yet existing methods for detecting self-touch are often invasive or limited in scope. This study systematically investigates the feasibility of using non-invasive electrical bioimpedance for detecting discrete self-touch poses across individuals. While previous research has focused on classifying defined self-touch poses, our work explores how various poses cause bioimpedance changes, providing insights into the underlying physiological mechanisms. We thus created a dataset of 27 genuine self-touch poses, including skin-to-skin contact between the hands and face and skin-to-clothing contact between the hands and chest, alongside six adversarial mid-air gestures. We then measured the wrist-to-wrist bioimpedance of 30 adults (15 female, 15 male) across these poses, with each measurement preceded by a no-touch pose serving as a baseline. Statistical analysis of the measurements showed that skin-to-skin contacts cause significant changes in bioimpedance magnitude between 237.8 kHz and 4.1 MHz, while adversarial gestures do not; skin-to-clothing contacts cause less-significant changes due to the influence and variability of the clothing material. Furthermore, our analysis highlights the sensitivity of bioimpedance to the body parts involved, skin contact area, and individual's characteristics. Our contributions are two-fold: (1) we demonstrate that bioimpedance offers a practical, non-invasive solution for detecting self-touch poses involving skin-to-skin contact, (2) researchers can leverage insights from our study to determine whether a pose can be detected without extensive testing.
DOI BibTeX
Thumb ticker lg circuits

Perceiving Systems Ph.D. Thesis Democratizing 3D Human Digitization Xiu, Y. March 2025 (Published)
Richard Feynman once said, “What I cannot create, I do not understand.” Similarly, making virtual humans more realistic helps us better grasp human nature. Simulating lifelike avatars has scientific value (such as in biomechanics) and practical applications (like the Metaverse). However, creating them affordably at scale with high quality remains challenging. Reconstructing complex poses, varied clothing, and unseen areas from casual photos under real-world conditions is still difficult. We address this through a series of works—ICON, ECON, TeCH, PuzzleAvatar—bridging pixel-based reconstruction with text-guided generation to reframe reconstruction as conditional generation. This allows us to turn everyday photos, like personal albums featuring random poses, diverse clothing, tricky angles, and arbitrary cropping, into 3D avatars. The process converts unstructured data into structured output without unnecessary complexity. With these techniques, we can efficiently scale up the creation of digital humans using readily available imagery.
Thesis BibTeX
Thumb ticker lg dall e 2025 03 16 11.46.56

Perceiving Systems Conference Paper Gaussian Garments: Reconstructing Simulation-Ready Clothing with Photo-Realistic Appearance from Multi-View Video Rong, B., Grigorev, A., Wang, W., Black, M. J., Thomaszewski, B., Tsalicoglou, C., Hilliges, O. In International Conference on 3D Vision (3DV), International Conference on 3D Vision, March 2025 (Published)
We introduce Gaussian Garments, a novel approach for reconstructing realistic-looking, simulation-ready garment assets from multi-view videos. Our method represents garments with a combination of a 3D mesh and a Gaussian texture that encodes both the color and high-frequency surface details. This representation enables accurate registration of garment geometries to multi-view videos and helps disentangle albedo textures from lighting effects. Furthermore, we demonstrate how a pre-trained Graph Neural Network (GNN) can be fine-tuned to replicate the real behavior of each garment. The reconstructed Gaussian Garments can be automatically combined into multi-garment outfits and animated with the fine-tuned GNN.
arXiv project video URL BibTeX
Thumb ticker lg teaser4

Perceiving Systems Conference Paper CameraHMR: Aligning People with Perspective Patel, P., Black, M. J. In International Conference on 3D Vision (3DV), International Conference on 3D Vision, March 2025 (Published)
We address the challenge of accurate 3D human pose and shape estimation from monocular images. The key to accuracy and robustness lies in high-quality training data. Existing training datasets containing real images with pseudo ground truth (pGT) use SMPLify to fit SMPL to sparse 2D joint locations, assuming a simplified camera with default intrinsics. We make two contributions that improve pGT accuracy. First, to estimate camera intrinsics, we develop a field-of-view prediction model (HumanFoV) trained on a dataset of images containing people. We use the estimated intrinsics to enhance the 4D-Humans dataset by incorporating a full perspective camera model during SMPLify fitting. Second, 2D joints provide limited constraints on 3D body shape, resulting in average-looking bodies. To address this, we use the BEDLAM dataset to train a dense surface keypoint detector. We apply this detector to the 4D-Humans dataset and modify SMPLify to fit the detected keypoints, resulting in significantly more realistic body shapes. Finally, we upgrade the HMR2.0 architecture to include the estimated camera parameters. We iterate model training and SMPLify fitting initialized with the previously trained model. This leads to more accurate pGT and a new model, CameraHMR, with state-of-the-art accuracy. Code and pGT are available for research purposes.
arXiv project BibTeX
Thumb ticker lg pexels

Perceiving Systems Conference Paper CHOIR: A Versatile and Differentiable Hand-Object Interaction Representation Morales, T., Taheri, O., Lacey, G. In Winter Conference on Applications of Computer Vision (WACV), February 2025 (Published)
Synthesizing accurate hands-object interactions (HOI) is critical for applications in Computer Vision, Augmented Reality (AR), and Mixed Reality (MR). Despite recent advances, the accuracy of reconstructed or generated HOI leaves room for refinement. Some techniques have improved the accuracy of dense correspondences by shifting focus from generating explicit contacts to using rich HOI fields. Still, they lack full differentiability or continuity and are tailored to specific tasks. In contrast, we present a Coarse Hand-Object Interaction Representation (CHOIR), a novel, versatile and fully differentiable field for HOI modelling. CHOIR leverages discrete unsigned distances for continuous shape and pose encoding, alongside multivariate Gaussian distributions to represent dense contact maps with few parameters. To demonstrate the versatility of CHOIR we design JointDiffusion, a diffusion model to learn a grasp distribution conditioned on noisy hand-object interactions or only object geometries, for both refinement and synthesis applications. We demonstrate JointDiffusion’s improvements over the SOTA in both applications: it increases the contact F1 score by 5% for refinement and decreases the sim. displacement by 46% for synthesis. Our experiments show that JointDiffusion with CHOIR yield superior contact accuracy and physical realism compared to SOTA methods designed for specific tasks.
GitHub Paper URL BibTeX
Thumb ticker lg choir1

Perceiving Systems Conference Paper OpenCapBench: A Benchmark to Bridge Pose Estimation and Biomechanics Gozlan, Y., Falisse, A., Uhlrich, S., Gatti, A., Black, M., Chaudhari, A. In IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) , January 2025 (Published)
Pose estimation has promised to impact healthcare by enabling more practical methods to quantify nuances of human movement and biomechanics. However, despite the inherent connection between pose estimation and biomechanics, these disciplines have largely remained disparate. For example, most current pose estimation benchmarks use metrics such as Mean Per Joint Position Error, Percentage of Correct Keypoints, or mean Average Precision to assess performance, without quantifying kinematic and physiological correctness - key aspects for biomechanics. To alleviate this challenge, we develop OpenCapBench to offer an easy-to-use unified benchmark to assess common tasks in human pose estimation, evaluated under physiological constraints. OpenCapBench computes consistent kinematic metrics through joints angles provided by an open-source musculoskeletal modeling software (OpenSim). Through OpenCapBench, we demonstrate that current pose estimation models use keypoints that are too sparse for accurate biomechanics analysis. To mitigate this challenge, we introduce SynthPose, a new approach that enables finetuning of pre-trained 2D human pose models to predict an arbitrarily denser set of keypoints for accurate kinematic analysis through the use of synthetic data. Incorporating such finetuning on synthetic data of prior models leads to twofold reduced joint angle errors. Moreover, OpenCapBench allows users to benchmark their own developed models on our clinically relevant cohort. Overall, OpenCapBench bridges the computer vision and biomechanics communities, aiming to drive simultaneous advances in both areas.
arXiv code/data URL BibTeX
Thumb ticker lg benchcap

Perceiving Systems Thesis Dynamic 3D Synthesis: From Video-Based Animatable Head Avatars to Text-Guided 4D Content Creation Zheng, Y. 2025 (Published)
The synthesis of 4D content—dynamic 3D content that evolves over time—has become increasingly important across a wide range of applications, including virtual communication, gaming, AR/VR, and digital content creation. Despite recent advances, generating realistic 4D content from accessible inputs remains a significant challenge. Existing approaches often rely on dense multi-camera capture systems, which are costly and impractical for everyday use, or yield results with limited geometric and visual fidelity. This thesis investigates two sub tasks in 4D content creation: (1) the reconstruction of high-fidelity, animatable head avatars from accessible inputs such as monocular RGB videos, and (2) the generation of dynamic 4D scenes from text prompts and optionally sparse visual input, such as reference images. These two directions are unified by a common goal—enabling controllable and high-quality 4D content creation from minimal visual supervision. The first part of this thesis presents IMavatar, a morphable implicit surface representation for reconstructing personalized head avatars from monocular videos. Implicit surfaces provide topological flexibility and can recover detailed 3D geometry directly from RGB images, making them well-suited for head avatar reconstruction. However, modeling expression- and pose-dependent deformations in an interpretable and generalizable way remains a major challenge when working with implicit representations. Inspired by 3D morphable models, IMavatar models deformation by learning expression blendshapes and skinning weight fields in a canonical space, enabling structured and generalizable control over novel expressions and poses. To enable end-to-end optimization from monocular videos, we propose a novel analytical gradient formulation that supports joint training of the geometry and deformation directly from RGB supervision. By combining the geometric fidelity of neural implicit fields with the controllability of morphable models, IMavatar achieves high-quality 4D reconstructions and strong generalization to unseen expressions and head poses. The second part of this thesis presents PointAvatar, a deformable point-based representation for animatable 3D head avatars. While implicit representations are effective at learning detailed geometry from image observations, they are inherently difficult to animate and computationally expensive to render. To address these limitations, this work explores point clouds as the underlying geometric representation for head avatars, offering the efficiency of explicit representations while avoiding the fixed-topology constraints of meshes. PointAvatar uses a canonical point cloud combined with learned blendshape and skinning weight fields, and further disentangles intrinsic albedo from view-dependent shading to support relighting under novel illumination. To improve training stability and reconstruction quality, we adopt a coarse-to-fine strategy that gradually increases point cloud resolution during learning. This enables the model to effectively capture accurate geometry and high-quality texture from monocular RGB videos, including challenging cases such as eyeglasses and complex hairstyles. Compared to IMavatar, PointAvatar achieves an 8× speed-up during training and a 100× speed-up during inference rendering, while maintaining high visual and geometric quality. In the final part, this thesis explores Dream-in-4D, a diffusion-guided framework for generating creative 4D content from natural language. The focus is on synthesizing imaginative 4D scenes from minimal visual input—either a single image or no visual input at all. To this end, the method leverages prior knowledge from pre-trained image and video diffusion models to optimize a 4D representation. Dream-in-4D follows a two-stage pipeline. In the first stage, a static 3D model is optimized as a neural radiance field using guidance from both image and 3D-aware diffusion models, resulting in high-quality, view-consistent assets. In the second stage, a time-dependent, multi-resolution deformation field is introduced to represent motion and is optimized using video diffusion guidance, equipping the static 3D asset with detailed and plausible motion driven by text prompts. The resulting system supports text-to-4D, image-to-4D, and personalized 4D generation within a unified framework, enabling intuitive and flexible dynamic scene synthesis from highly accessible inputs. Together, these methods address two essential aspects of 4D content creation: the reconstruction of animatable head avatars from monocular videos, and the generation of dynamic, imaginative 4D scenes from text and image prompts. We hope these contributions advance the field toward more accessible, controllable, and high-quality 4D content creation—enabling a broad range of applications across research, industry, and creative practice.
DOI URL BibTeX

Perceiving Systems Conference Paper Joker: Conditional 3D Head Synthesis with Extreme Facial Expressions Prinzler, M., Zakharov, E., Sklyarova, V., Kabadayi, B., Thies, J. In International Conference on 3D Vision (3DV), International Conference on 3D Vision, 2025 (Published)
We introduce Joker, a new method for the conditional synthesis of 3D human heads with extreme expressions. Given a single reference image of a person, we synthesize a volumetric human head with the reference’s identity and a new expression. We offer control over the expression via a 3D morphable model (3DMM) and textual inputs. This multi-modal conditioning signal is essential since 3DMMs alone fail to define subtle emotional changes and extreme expressions, including those involving the mouth cavity and tongue articulation. Our method is built upon a 2D diffusion-based prior that generalizes well to out-of-domain samples, such as sculptures, heavy makeup, and paintings while achieving high levels of expressiveness. To improve view consistency, we propose a new 3D distillation technique that converts predictions of our 2D prior into a neural radiance field (NeRF). Both the 2D prior and our distillation technique produce state-of-the-art results, which are confirmed by our extensive evaluations. Also, to the best of our knowledge, our method is the first to achieve view-consistent extreme tongue articulation.
project page arxiv BibTeX
Thumb ticker lg joker