Publications

DEPARTMENTS

Emperical Interference

Haptic Intelligence

Modern Magnetic Systems

Perceiving Systems

Physical Intelligence

Robotic Materials

Social Foundations of Computation


Research Groups

Autonomous Vision

Autonomous Learning

Bioinspired Autonomous Miniature Robots

Dynamic Locomotion

Embodied Vision

Human Aspects of Machine Learning

Intelligent Control Systems

Learning and Dynamical Systems

Locomotion in Biorobotic and Somatic Systems

Micro, Nano, and Molecular Systems

Movement Generation and Control

Neural Capture and Synthesis

Physics for Inference and Optimization

Organizational Leadership and Diversity

Probabilistic Learning Group


Topics

Robot Learning

Conference Paper

2022

Autonomous Learning

Robotics

AI

Career

Award


Haptic Intelligence Ph.D. Thesis Haptify: A Measurement-Based System for Quantifying the Quality of Haptic Interfaces Fazlollahi, F. University of Tübingen, Tübingen, Germany, March 2026, Department of Computer Science (Published)
Grounded force-feedback (GFF) devices, exoskeletons, and other haptic robots modulate human movement through carefully engineered mechanical, electrical, and computational designs. Given their significant societal potential and often high cost, it is essential to fairly and efficiently assess the quality of these intimate cyber-physical interfaces. However, existing device specifications and low-level performance metrics often fail to capture the nuanced qualities that expert users perceive during hands-on experimentation. To address this gap, this thesis introduces Haptify, a comprehensive benchmarking system that can thoroughly, fairly, and noninvasively evaluate GFF haptic devices. Haptify integrates multiple sensing modalities - a seven-camera optical motion-capture system, a custom-built 60-cm-square force plate, and an instrumented end-effector that can be adapted to different devices - to record the interaction between the human hand, the device, and the ground during both passive and active experiments. With this setup, users hold the device end-effector and move it through a series of carefully designed tasks while Haptify measures kinematic and kinetic responses. From this process, we establish six key ways to assess GFF device performance: workspace shape, global free-space forces, global free-space vibrations, local dynamic forces and torques, frictionless surface rendering, and stiffness rendering. These benchmarks enable systematic evaluation and comparison across devices. We first apply Haptify to benchmark two GFF devices produced by 3D Systems: the widely used Touch and the more expensive Touch X. Results reveal that the Touch X offers a slightly smaller workspace than the Touch, but it produces smaller and more predictable free-space forces, reduced vibrations, more consistent dynamic forces and torques, and higher-quality rendering of both frictionless surfaces and stiff virtual objects. To further validate and extend our approach, we conducted a user study with sixteen expert hapticians who used Haptify to evaluate four commercial GFF devices: Novint Falcon, Force Dimension Omega.3, Touch, and Touch X. Experts tested the devices in unpowered mode and across five representative virtual benchmark environments, providing extensive quantitative ratings and qualitative feedback. We distilled recurring themes from their input and analyzed correlations between expert opinions and sensor-based measurements. Our findings show that expert judgments of fundamental haptic quality indicators align closely with the metrics derived from Haptify. Moreover, device performance both unpowered and in active benchmarks can be used to predict its suitability for more complex applications, such as teleoperated surgery. By linking expert assessments with external measurement data, this thesis establishes a combined qualitative-quantitative framework for benchmarking haptic robots. This approach not only enables fair comparison across diverse devices but also establishes a direct connection between objective measurements and the subjective expertise of experienced hapticians. In doing so, it lays the foundation for more rigorous, transparent, and application-relevant evaluation of haptic technologies.
BibTeX

Haptic Intelligence Dynamic Locomotion Ph.D. Thesis The Human Leg Catapult: Biological Mechanisms for Walking Gait Replicated in the EcoWalker Robot Kiss, B. University of Stuttgart, Stuttgart, Germany, March 2026, Faculty of Civil and Environmental Engineering (Published)
Humanoid robots and assistive devices have yet to match the efficiency and adaptability of able-bodied human walking in challenging environments. To bridge this performance gap, my projects explored the underlying mechanisms of human locomotion, focusing on the ankle push-off. Ankle push-off has a prominent role in walking due to its high-power output at the end of the stance phase, and due to the impact of its timing on the adaptability to diverse environments. The human leg catapult analogy provides a framework for the projects to understand and replicate the complex biological mechanisms that govern human walking gait. As a platform for the replication, the human-like bipedal EcoWalker robot was developed from version 1 to 3 in three consecutive projects, with iterative design and control updates tailored to each project's goals. Our findings provide insights into the separate roles of mono- and biarticular muscle-tendon units in the human leg catapult, while we also show functional details of the human leg catapult release mechanism through five distinct release processes on the EcoWalker robot. Utilizing the robot in the projects ensures that our findings are relevant to practical applications, allowing humanoid robot and assistive device developers to build on our insights, potentially reducing the performance gap in efficiency and adaptability between able-bodied human walking and artificial walking.
BibTeX

Haptic Intelligence Ph.D. Thesis Modeling, Fabricating, and Evaluating Synergistic Soft‑Rigid Actuators Gertler, I. University of Stuttgart, Stuttgart, Germany, February 2026, Faculty of Engineering Design, Production Engineering and Automotive Engineering (Published)
Soft actuators offer lightweight, compliant, and safe alternatives to traditional mechanisms, but they often incur complicated actuation schemes, bulky support systems, and limited functionality when made solely from soft materials. Soft‑rigid designs that integrate rigid elements into primarily soft bodies are common, yet the potential of those rigid parts to shape actuation behavior without compromising the overall softness remains underexplored, and fabrication practices often lack reproducibility. This thesis presents two case studies of synergistic hybrid actuation systems that utilize the complementary roles of soft and rigid components to dictate temporal and spectral behavior in response to simple input commands. Between the soft and hard components, one is typically active, while the other is passive. The first case study implements a soft-active/rigid-passive approach for the medical robotics application of endoluminal locomotion. A thin hyperelastic balloon encased in an inextensible sleeve is coupled with a thicker, non-encased balloon on a single fluid supply to serve as front and rear anchors, respectively. Geometry and material selection reshape the pressure-stretch response so the rear anchor inflates and deflates before the front anchor, enabling asymmetric sequencing useful for peristaltic locomotion inside a lumen. Numerical simulation and experiments validate the characteristic curves of dip-molded balloons and alternating anchoring in rigid tubes. The approach can be extended to generate actuation patterns for sequential haptic feedback and other robotic applications. The second case study applies a soft-passive/rigid-active strategy in the domain of fingertip haptic actuation. A dip‑molded silicone sheath with embedded miniature magnets, excited by a single air‑core coil, produces localized, rich vibrotactile feedback. Simulations, mechanical measurements, and user experiments with a single-magnet design show consistent frequency‑dependent behavior and strong perceptual salience. In follow-on work, various dual‑magnet arrangements were also simulated, fabricated, and thoroughly evaluated. Classification tests indicate that frequency content is more important for perception than magnet orientation, while a realism‑rating experiment supports the feasibility of audio-driven simple commands for realistic haptic feedback. The device is demonstrated on the fingertip in virtual reality and could be adapted for other body locations for navigation, rehabilitation, or related applications. Together, these studies provide design rules, a simulation-fabrication-validation workflow, and reproducible fabrication practices for soft-rigid hybrid actuators that realize desired mechanical outputs from minimal actuation commands. The methods and findings generalize to other soft actuators and have potential applications in domains such as medical devices, wearable technologies, and soft sensing.
BibTeX

Perceiving Systems Ph.D. Thesis From Perception to Actions: Autonomous Exploration, Synthetic Data, and Dynamic Worlds Bonetto, E. January 2026 (Published)
This thesis explores innovative methods and frameworks to enhance intelligent systems' visual perception capabilities. Vision is the primary means by which many animals perceive, understand, learn, reason about, and interact with the world to achieve their goals. Unlike animals, intelligent systems must acquire these capabilities by processing raw visual data captured by cameras using computer vision and deep learning. First, we consider a crucial aspect of visual perception in intelligent systems: understanding the structure and layout of the environment. To enable applications such as object interaction or extended reality in previously unseen spaces, these systems are often required to estimate their own motion. When operating in novel environments, they must also construct a map of the space. Together, we have the essence of the Simultaneous Localization and Mapping (SLAM) problem. However, pre-mapping environments can be impractical, costly, and unscalable in scenarios like disaster response or home automation. This makes it essential to develop robots capable of autonomously exploring and mapping unknown areas, a process known as Active SLAM. Active SLAM typically involves a multi-step process in which the robot acts on the available information to decide the next best actions. The goal is to autonomously and efficiently explore environments without using prior information. Despite an extensive history, Active SLAM methods focused only on short- or long-term objectives, without considering the totality of the process or adapting to the ever-changing states. Addressing these gaps, we introduce iRotate to capitalize on continuous information-gain prediction. Distinct from prevailing approaches, iRotate constantly (pre)optimizes camera viewpoints acting on i) long-term, ii) short-term, and iii) real time objectives. By doing this, iRotate significantly reduces energy consumption and localization errors, thus diminishing the exploration effort - a substantial leap in efficiency and effectiveness. iRotate, like many other SLAM approaches, leverages the assumption of operating in a static environment. Dynamic components in the scene significantly impact SLAM performance in the localization, place recognition, and optimization steps, hindering the widespread adoption of autonomous robots. This stems from the difficulties of collecting diverse ground truth information in the real world and the long-standing limitations of simulation tools. Testing directly in the real world is costly and risky without prior simulation validation. Datasets instead are inherently static and non-interactive making them useless for developing autonomous approaches. Then, existing simulation tools often lack the visual realism and flexibility to create and control fully customized experiments to bridge the gap between simulation and the real world. This thesis addresses the challenges of obtaining ground truth data and simulating dynamic environments by introducing the GRADE framework. Through a photorealistic rendering engine, we enable online and offline testing of robotic systems and the generation of richly annotated synthetic ground truth data. By ensuring flexibility and repeatability, we allow the extension of previous experiments through variations, for example, in scene content or sensor settings. Synthetic data can first be used to address several challenges in the context of Deep Learning (DL) approaches, e.g. mismatched data distribution between applications, costs and limits of data collection procedures, and errors caused by incorrect or inconsistent labeling in training datasets. However, the gap between the real and simulated worlds often limits the direct use of synthetic data making style transfer, adaptation techniques, or real-world information necessary. Here, we leverage the photorealism obtainable with GRADE to generate synthetic data and overcome these issues. First, since humans are significant sources of dynamic behavior in environments and the target of many applications, we focus on their detection and segmentation. We train models on real, synthetic, and mixed datasets, and show that using only synthetic data can lead to state-of-the-art performance in indoor scenarios. Then, we leverage GRADE to benchmark several Dynamic Visual SLAM methods. These often rely on semantic segmentation and optical flow techniques to identify moving objects and exclude their visual features from the pose estimation and optimization processes. Our evaluations show how they tend to reject too many features, leading to failures in accurately and fully tracking camera trajectories. Surprisingly, we observed low tracking rates not only on simulated sequences but also in real-world datasets. Moreover, we also show that the performance of the segmentation and detection models used are not always positively correlated with the ones of the Dynamic Visual SLAM methods. These failures are mainly due to incorrect estimations, crowded scenes, and not considering the different motion states that the object can have. Addressing this, we introduce DynaPix. This Dynamic Visual SLAM method estimates per-pixel motion probabilities and incorporates them into a new enhanced pose estimation and optimization processes within the SLAM backend, resulting in longer tracking times and lower trajectory errors. Finally, we use GRADE to address the challenge of limited and inaccurate annotations of wild zebras, particularly for their detection and pose estimation when observed by unmanned aerial vehicles. Leveraging the flexibility of GRADE, we introduce ZebraPose - the first full top-down synthetic-to-real detection and 2D pose estimation method. Unlike previous approaches, ZebraPose demonstrates that both tasks can be performed using only synthetic data, eliminating the need for costly data collection campaigns, time-consuming annotation procedures, or syn-to-real transfer techniques. Ultimately, this thesis demonstrates how combining perception with action can overcome critical limitations in robotics and environmental perception, thereby advancing the deployment of intelligent and autonomous systems for real-world applications. Through innovations like iRotate, GRADE, and ZebraPose, it paves the way for more robust, flexible, and efficient intelligent systems capable of navigating dynamic environments.
Thesis: From Perception to Actions BibTeX

Perceiving Systems Ph.D. Thesis Physics-Informed Modeling of Dynamic Humans and Their Interactions Shashank, T. January 2026 (Published)
Building convincing digital humans is central to the vision of shared virtual worlds for AR, VR, and telepresence. Yet, despite rapid progress in 3D vision, today’s virtual humans often fall into a physical "uncanny valley”—bodies float above or penetrate objects, motions ignore balance and biomechanics, and human object interactions miss the rich contact patterns that make behavior look real. Enforcing physics through simulation is possible, but remains too slow, restrictive, and brittle for real-world, in-the-wild settings. This thesis argues that physical realism does not require full simulation. Instead, it can emerge from the same principles humans rely on every day: intuitive physics and contact. Inspired by insights from biomechanics and cognitive science, I present a unified framework that embeds these ideas directly into learning-based 3D human modeling. In this thesis, I present a suite of methods that bridge the gap between 3D human reconstruction and physical plausibility. I first introduce IPMAN, which incorporates differentiable biomechanical cues, such as center of mass and center of pressure, to produce stable, balanced, and grounded static poses. I then extend this framework to dynamic motion with HUMOS, a shape-conditioned motion generation model that accounts for how individual physiology influences movement, without requiring paired training data. Moving beyond locomotion, I address complex human-object interactions with DECO, a 3D contact detector that estimates dense, vertex-level contact across the full body surface. Finally, I present PICO, which establishes contact correspondences between the human body and arbitrary objects to recover full 3D interactions from single images. Together, these contributions bring physics-aware human modeling closer to practical deployment. The result is a step toward digital humans that not only look right, but move and interact with the world in ways that feel intuitively real.
Thesis BibTeX

Perceiving Systems Ph.D. Thesis Aerial Robot Formations for Dynamic Environment Perception Price, E. University of Tübingen, Tübingen, Germany, December 2025 (Published)
Perceiving moving subjects, like humans and animals, outside an enclosed and controlled environment in a lab is inherently challenging, since subjects could move outside the view and range of cameras and sensors that are static and extrinsically calibrated. Previous state-of-the-art methods for such perception in outdoor scenarios use markers or sensors on the subject, which are both intrusive and unscalable for animal subjects. To address this problem, we introduce robotic flying cameras that autonomously follow the subjects. To enable functions such as monitoring, behaviour analysis or motion capture, a single point of view is often insufficient due to self-occlusion, lack of depth perception and coverage from all sides. Therefore, we propose a team of such robotic cameras that fly in formation to provide continuous coverage from multiple view-points. The position of the subject must be determined using markerless, remote sensing methods in real time. To solve this, we combine a convolutional neural network-based detector to detect the subject with a novel cooperative Bayesian fusion method to track the detected subject from multiple robots. The robots need to then plan and control their own flight path and orientation relative to the subject to achieve and maintain continuous coverage from multiple view-points. This, we address with a model-predictive-control-based method to predict and plan the motion of every robot in the formation around the subject. A preliminary demonstrator is implemented with multi-rotor drones. However, drones are noisy and potentially unsafe for the observed subjects. To address this, we introduce non-holonomic lighter-than-air autonomous airships (blimps) as the robotic camera platform. This type of robot requires dynamically constrained orbiting formations to achieve omnidirectional visual coverage of a moving subject in the presence of wind. Therefore, we introduce a novel model-predictive formation controller for a team of airships. We demonstrate and evaluate our complete system in field experiments involving both human and wild animals as subjects. The collected data enables both human outdoor motion capture and animal behaviour analysis. Additionally, we propose our method for autonomous long-term wildlife monitoring. This dissertation covers the design and evaluation of aerial robots suitable to this task, including computer vision/sensing, data annotation and network training, sensor fusion, planning, control, simulation, and modelling.
Thesis DOI BibTeX

Perceiving Systems Ph.D. Thesis Learning Hands in Action Fan, Z. December 2025 (Published)
Hands are our primary interface for acting on the world. From everyday tasks like preparing food to skilled procedures like surgery, human activity is shaped by rich and varied hand interactions. These include not only manipulation of external objects but also coordinated actions between both hands. For physical AI systems to learn from human behavior, assist in physical tasks, or collaborate safely in shared environments, they must perceive and understand hands in action, how we use them to interact with each other and with the objects around us. A key component of this understanding is the ability to reconstruct human hand motion and hand-object interactions in 3D from RGB images or videos. However, existing methods focus largely on estimating the pose of a single hand, often in isolation. They struggle with scenarios involving two hands in strong interactions or the interactions with objects, particularly when those objects are articulated or previously unseen. This is because reconstructing 3D hands in action poses significant challenges, such as severe occlusions, appearance ambiguities, and the need to reason about both hand and object geometry in dynamic configurations. As a result, current systems fall short in complex real-world environments. This dissertation addresses these challenges by introducing methods and data for reconstructing hands in action from monocular RGB inputs. We begin by tackling the problem of interacting hand pose estimation. We present DIGIT, a method that leverages a part-aware semantic prior to disambiguate closely interacting hands. By explicitly modeling hand part interactions and encoding the semantics of finger parts, DIGIT robustly recovers accurate hand poses, outperforms prior baselines and provides a step forward for more complete 3D hands in action understanding. Since hands frequently manipulate objects, jointly reconstructing both is crucial. Existing methods for hand-object reconstruction are limited to rigid objects and cannot handle tools with articulation, such as scissors or laptops. This severely restricts their ability to model the full range of everyday manipulations. We present the first method that jointly reconstructs two hands and an articulated object from a single RGB image, enabling unified reasoning across both rigid and articulated object interactions. To support this, we introduce ARCTIC, a large-scale motion capture dataset of humans performing dexterous bimanual manipulation with articulated tools. ARCTIC includes both articulated and fixed (rigid) configurations, along with accurate 3D annotations of hand poses and object motions. Leveraging this dataset, our method jointly infers object articulation states, and hand poses, advancing the state of hand-object understanding in complex object manipulation settings. Finally, we address generalization to in-the-wild object interactions. Prior approaches either rely on synthetic data with limited realism or require object models at test time. We introduce HOLD, a self-supervised method that learns to reconstruct 3D hand-object interactions from monocular RGB videos, without paired 3D annotations or known object models. HOLD learns via an appearance- and motion-consistent objective across views and time, enabling strong generalization to unseen objects in interaction. Experiments demonstrate HOLD's ability to generalize to in-the-wild monocular settings, outperforming fully-supervised baselines trained on synthetic or lab-captured datasets. Together, DIGIT, ARCTIC, and HOLD advance the 3D understanding of hands in action, covering both hand-hand and hand-object interactions. These contributions improve the robustness in interacting hand pose estimation, introduce a dataset for bimanual manipulation with rigid and articulated tools, and include the first singe-image method for jointly reconstructing hands and articulated objects learned directly from this dataset. In addition, HOLD removes the need for object templates by enabling hand-object reconstruction in the wild. These developments move toward more scalable physical AI systems capable of interpreting and imitating human manipulation, with applications in teleoperation, human-robot collaboration, and embodied learning from demonstration.
PDF BibTeX

Haptic Intelligence Perceiving Systems Ph.D. Thesis An Interdisciplinary Approach to Human Pose Estimation: Application to Sign Language Forte, M. University of Tübingen, Tübingen, Germany, November 2025, Department of Computer Science (Published)
Accessibility legislation mandates equal access to information for Deaf communities. While videos of human interpreters provide optimal accessibility, they are costly and impractical for frequently updated content. AI-driven signing avatars offer a promising alternative, but their development is limited by the lack of high-quality 3D motion-capture data at scale. Vision-based motion-capture methods are scalable but struggle with the rapid hand movements, self-occlusion, and self-touch that characterize sign language. To address these limitations, this dissertation develops two complementary solutions. SGNify improves hand pose estimation by incorporating universal linguistic rules that apply to all sign languages as computational priors. Proficient signers recognize the reconstructed signs as accurately as those in the original videos, but depth ambiguities along the camera axis can still produce incorrect reconstructions for signs involving self-touch. To overcome this remaining limitation, BioTUCH integrates electrical bioimpedance sensing between the wrists of the person being captured. Systematic measurements show that skin-to-skin contact produces distinctive bioimpedance reductions at high frequencies (240 kHz to 4.1 MHz), enabling reliable contact detection. BioTUCH uses the timing of these self-touch events to refine arm poses, producing physically plausible arm configurations and significantly reducing reconstruction error. Together, these contributions support the scalable collection of high-quality 3D sign language motion data, facilitating progress toward AI-driven signing avatars.
BibTeX

Empirical Inference Ph.D. Thesis Probabilistic Machine Learning for Real-Time Gravitational-Wave Inference Dax, M. Eberhard Karls Universität Tübingen, July 2025, (MPI IS + ELLIS Institute T{\"u}bingen) (Published) BibTeX

Haptic Intelligence Ph.D. Thesis Towards Robust and Flexible Robot State and Motion Estimation through Optimization and Learning Nubert, J. ETH Zurich, Zurich, Switzerland, June 2025, Department of Mechanical and Process Engineering (Published) BibTeX

Empirical Inference Ph.D. Thesis Scalable Gaussian Processes: Advances in Iterative Methods and Pathwise Conditioning Lin, J. University of Cambridge, UK, May 2025, (Cambridge-T{\"u}bingen-Fellowship-Program) (Published) BibTeX

Empirical Inference Ph.D. Thesis The Geometry of Learning Via Loss Landscape Curvature Singh, S. P. ETH Zurich, Switzerland, May 2025, CLS Fellowship Program (Published) BibTeX

Perceiving Systems Ph.D. Thesis Estimating Human and Camera Motion From RGB Data Kocabas, M. April 2025 (Published)
This thesis presents a unified framework for markerless 3D human motion analysis from monocular videos, addressing three interrelated challenges that have limited the fidelity of existing approaches: (i) achieving temporally consistent and physically plausible human motion estimation, (ii) accurately modeling perspective camera effects in unconstrained settings, and (iii) disentangling human motion from camera motion in dynamic scenes. Our contributions are realized through three complementary methods. First, we introduce VIBE (Video Inference for Body Pose and Shape Estimation), a novel video pose and shape estimation framework. Despite progress on single-image 3D pose and shape estimation, existing video-based state-of-the-art methods fail to produce accurate and natural motion sequences due to a lack of ground-truth 3D motion data for training. To address this problem, we propose VIBE, which makes use of an existing large-scale motion capture dataset (AMASS) together with unpaired, in-the-wild, 2D keypoint annotations. Our key novelty is an adversarial learning framework that leverages AMASS to discriminate between real human motions and those produced by our temporal pose and shape regression networks. We define a temporal network architecture and show that adversarial training, at the sequence level, produces kinematically plausible motion sequences without in-the-wild ground-truth 3D labels. Second, we propose SPEC (Seeing People in the wild with Estimated Cameras), the first in-the-wild 3D human and shape (HPS) method that estimates the perspective camera from a single image and employs this to reconstruct 3D human bodies more accurately. Due to the lack of camera parameter information for in-the-wild images, existing 3D HPS estimation methods make several simplifying assumptions: weak-perspective projection, large constant focal length, and zero camera rotation. These assumptions often do not hold and we show, quantitatively and qualitatively, that they cause errors in the reconstructed 3D shape and pose. To address this, we introduce SPEC, the first in-the-wild 3D HPS method that estimates the perspective camera from a single image and employs this to reconstruct 3D human bodies more accurately. First, we train a neural network to estimate the field of view, camera pitch, and roll given an input image. We employ novel losses that improve the camera calibration accuracy over previous work. We then train a novel network that concatenates the camera calibration to the image features and uses these together to regress 3D body shape and pose. SPEC is more accurate than the prior art on the standard benchmark (3DPW) as well as two new datasets with more challenging camera views and varying focal lengths. Specifically, we create a new photorealistic synthetic dataset (SPEC-SYN) with ground truth 3D bodies and a novel in-the-wild dataset (SPEC-MTP) with calibration and high-quality reference bodies. Third, we develop PACE (Person And Camera Estimation), a method to estimate human motion in a global scene from moving cameras. This is a highly challenging task due to the entangling of human and camera motions in the video. Existing works assume camera is static and focus on solving the human motion in camera space. To address this problem, we propose a joint optimization framework that disentangles human and camera motions using both foreground human motion priors and background scene features. Unlike existing methods that use Simultaneous Localization and Mapping (SLAM) as initialization, we propose to tightly integrate SLAM and human motion priors in an optimization that is inspired by bundle adjustment. Specifically, we optimize human and camera motions to match both the observed human pose and scene features. This design combines the strengths of SLAM and motion priors, which leads to significant improvements in human and camera motion estimation. We additionally introduce a motion prior that is suitable for batch optimization, making our approach significantly more efficient than existing approaches. Finally, we propose a novel synthetic dataset that enables evaluating camera motion in addition to human motion from dynamic videos. Experiments on the synthetic and real-world datasets demonstrate that our approach substantially outperforms prior art in recovering both human and camera motions. Extensive experiments on standard benchmarks and new datasets we introduced demonstrate that our integrated approach substantially outperforms prior methods in terms of temporal consistency, reconstruction accuracy, and global motion estimation. While these results represent a significant advance in markerless human motion analysis, further work is needed to extend these techniques to multi-person scenarios, severe occlusions, and real-time applications. Overall, this thesis lays a strong foundation for more robust and accurate human motion analysis in unconstrained environments, with promising applications in robotics, augmented reality, sports analysis, and beyond.
Thesis PDF BibTeX

Perceiving Systems Ph.D. Thesis Understanding Human-Scene Interaction through Perception and Generation Yi, H. April 2025 (Published)
Humans are in constant contact with the world as they move through it and interact with it. Understanding Human-Scene Interactions (HSIs) is key to enhancing our perception and manipulation of three-dimensional (3D) environments, which is crucial for various applications such as gaming, architecture, and synthetic data creation. However, creating realistic 3D scenes populated by moving humans is a challenging and labor-intensive task. Existing human-scene interaction datasets are scarce and captured motion datasets often lack scene information. This thesis addresses these challenges by leveraging three specific types of HSI con- straints: (1) depth ordering constraint: humans that move in a scene are occluded or occlude objects, thus, defining the relative depth ordering of the objects, (2) collision constraint: humans move through free space and do not interpenetrate objects, (3) in- teraction constraint: when humans and objects are in contact, the contact surfaces oc- cupy the same place in space. Building on these constraints, we propose three distinct methodologies: capturing HSI from a monocular RGB video, generating HSI by gen- erating scenes from input human motions (scenes from humans) and generating human motion from scenes (humans from scenes). Firstly, we introduce MOVER , which jointly reconstructs 3D human motion and the interactive scenes from a RGB video. This optimization-based approach leverages these three aforementioned constraints to enhance the consistency and plausibility of recon- structed scene layouts and to refine the initial 3D human pose and shape estimations. Secondly, we present MIME , which takes 3D humans and a floor map as input to create realistic and interactive 3D environments. This method applies collision and interaction constraints, and employs an auto-regressive transformer architecture that integrates ob- jects into the scene based on existing human motion. The training data is enriched by populating the 3D FRONT scene dataset with 3D humans. By treating human movement as a “scanner” of the environment, this method results in furniture layouts that reflect true human activities, increasing the diversity and authenticity of the environments. Lastly, we introduce TeSMo , which generates 3D human motion from given 3D scenes and text descriptions, adhering to the collision and interaction constraints. It utilizes a text-controlled scene-aware motion generation framework based on denoising diffusion models. Annotated navigation and interaction motions are embedded within scenes to support the model’s training, allowing for the generation of diverse and realistic human- scene interactions tailored to specific settings and object arrangements. In conclusion, these methodologies significantly advance our understanding and syn- thesis of human-scene interactions, offering realistic modeling of 3D environments.
thesis BibTeX

Perceiving Systems Ph.D. Thesis Democratizing 3D Human Digitization Xiu, Y. March 2025 (Published)
Richard Feynman once said, “What I cannot create, I do not understand.” Similarly, making virtual humans more realistic helps us better grasp human nature. Simulating lifelike avatars has scientific value (such as in biomechanics) and practical applications (like the Metaverse). However, creating them affordably at scale with high quality remains challenging. Reconstructing complex poses, varied clothing, and unseen areas from casual photos under real-world conditions is still difficult. We address this through a series of works—ICON, ECON, TeCH, PuzzleAvatar—bridging pixel-based reconstruction with text-guided generation to reframe reconstruction as conditional generation. This allows us to turn everyday photos, like personal albums featuring random poses, diverse clothing, tricky angles, and arbitrary cropping, into 3D avatars. The process converts unstructured data into structured output without unnecessary complexity. With these techniques, we can efficiently scale up the creation of digital humans using readily available imagery.
Thesis BibTeX

Empirical Inference Ph.D. Thesis Learning to Generalize Across Distribution Shifts Träuble, F. J. University of Tübingen, Germany, March 2025, (IMPRS-PhD-Fellowship-Program and ELLIS-PhD-Fellowship-Program) (Published) BibTeX

Empirical Inference Ph.D. Thesis Predictions, Policies, Rewards: Models of Decision-Making from Observational Data Pace, A. ETH Zurich, Switzerland, February 2025, ETH AI Center-Fellowship-Program (Published) BibTeX

Haptic Intelligence Ph.D. Thesis Capturing and Recognizing Multimodal Surface Interactions as Embedded High-Dimensional Distributions Khojasteh, B. University of Stuttgart, Stuttgart, Germany, December 2024, Faculty of Engineering Design, Production Engineering and Automotive Engineering (Published)
Exploring a surface with a handheld tool generates complex contact signals that uniquely encode the surface's properties-a needle hidden in a haystack of data. Humans naturally integrate visual, auditory, and haptic sensory data during these interactions to accurately assess and recognize surfaces. However, enabling artificial systems to perceive and recognize surfaces with human-like proficiency remains a significant challenge. The complexity and dimensionality of multi-modal sensor data, particularly in the intricate and dynamic modality of touch, hinders effective sensing and processing. Successfully overcoming these challenges will open up new possibilities in applications such as quality control, material documentation, and robotics. This dissertation addresses these issues at the levels of both the sensing hardware and the processing algorithms by introducing an automated similarity framework for multimodal surface recognition, developing a haptic-auditory test bed for acquiring high-quality surface data, and exploring optimal sensing configurations to improve recognition performance and robustness.
BibTeX

Empirical Inference Ph.D. Thesis Causality for Natural Language Processing Jin, Z. University of Tübingen, Germany, December 2024, (ELLIS PhD student program) (Published) URL BibTeX

Haptic Intelligence Ph.D. Thesis Precision Haptics in Gait Retraining for Knee Osteoarthritis Rokhmanova, N. Carnegie Mellon University, Pittsburgh, USA, December 2024, Department of Mechanical Engineering (Published)
Gait retraining, or teaching patients to walk in ways that reduce joint loading, shows promise as a conservative intervention for knee osteoarthritis. However, its use in clinical settings remains limited by challenges in prescribing optimal gait patterns and delivering precise, real-time biofeedback. This thesis presents four interconnected studies that aim to address these barriers to clinical adoption: First, a regression model was developed to predict patient-specific biomechanical responses to a gait modification using only simple clinical measures, reducing the need for instrumented gait analysis. Second, we identified how inertial sensor accuracy fundamentally impacts motor learning outcomes during gait retraining, demonstrating the importance of reliable kinematic tracking. Third, we designed and validated an open-source wearable haptic platform called ARIADNE, which delivers precise vibrotactile motion guidance and enables rigorous comparison of feedback strategies for gait retraining. This platform's integrated sensing revealed how anatomical placement and tissue properties influence vibration transmission and perception. Finally, a gait retraining study demonstrated that vibrotactile feedback significantly improves both learning and retention of therapeutic gait patterns compared to verbal instruction alone, highlighting the critical role of precise biofeedback systems in rehabilitation. These contributions help advance the field's understanding of the sensorimotor principles underlying gait retraining while providing practical tools to support future clinical implementation.
BibTeX

Perceiving Systems Ph.D. Thesis Beyond the Surface: Statistical Approaches to Internal Anatomy Prediction Keller, M. University of Tübingen, November 2024 (Published)
The creation of personalized anatomical digital twins is important in the fields of medicine, computer graphics, sports science, and biomechanics. But to observe a subject’s anatomy, expensive medical devices (MRI or CT) are required and creating a digital model is often time-consuming and involves manual effort. Instead, we can leverage the fact that the shape of the body surface is correlated with the internal anatomy; indeed, the external body shape is related to the bone lengths, the angle of skeletal articulation, and the thickness of various soft tissues. In this thesis, we leverage the correlation between body shape and anatomy and aim to infer the internal anatomy solely from the external appearance. Learning this correlation requires paired observations of people’s body shape, and their internal anatomy, which raises three challenges. First, building such datasets requires specific capture modalities. Second, these data must be annotated, i.e. the body shape and anatomical structures must be identified and segmented, which is often a tedious manual task requiring expertise. Third, to learn a model able to capture the correlation between body shape and internal anatomy, the data of people with various shapes and poses has to be put into correspondence. In this thesis, we cover three works that focus on learning this correlation. We show that we can infer the skeleton geometry, the bone location inside the body, and the soft tissue location solely from the external body shape. First, in the OSSO project, we leverage 2D medical scans to construct a paired dataset of 3D body shapes and corresponding 3D skeleton shapes. This dataset allows us to learn the correlation between body and skeleton shapes, enabling the inference of a custom skeleton based on an individual’s body. However, since this learning process is based on static views of subjects in specific poses, we cannot evaluate the accuracy of skeleton inference in different poses. To predict the bone orientation within the body in various poses, we need dynamic data. To track bones inside the body in motion, we can leverage methods from the biomechanics field. So in the second work, instead of medical imaging, we use a biomechanical skeletal model along with simulation to build a paired dataset of bodies in motion and their corresponding skeletons. In this work, we build such a dataset and learn SKEL, a body shape and skeleton model that includes the locations of anatomical bones from any body shape and in any pose. After dealing with the skeletal structure, we broaden our focus to include different layers of soft tissues. In the third work, HIT, we leverage segmented medical data to learn to predict the distribution of adipose tissues (fat) and lean tissues (muscle, organs, etc.) inside the body.
pdf URL BibTeX

Perceiving Systems Ph.D. Thesis Aerial Markerless Motion Capture Saini, N. November 2024 (Published)
Human motion capture (mocap) is important for several applications such as healthcare, sports, animation etc. Existing markerless mocap methods employ multiple static and calibrated RGB cameras to infer the subject’s pose. These methods are not suitable for outdoor and unstructured scenarios. They need an extra calibration step before the mocap session and cannot dynamically adapt the viewpoint for the best mocap performance. A mocap setup consisting of multiple unmanned aerial vehicles with onboard cameras is ideal for such situations. However, estimating the subject’s motion together with the camera motions is an under-constrained problem. In this thesis, we explore multiple approaches where we split this problem into multiple stages. We obtain the prior knowledge or rough estimates of the subject’s or the cameras’ motion in the initial stages and exploit them in the final stages. In our work AirCap-Pose-Estimator, we use extra sensors (an IMU and a GPS receiver) on the multiple moving cameras to obtain the approximate camera poses. We use these estimates to jointly optimize the camera poses, the 3D body pose and the subject’s shape to robustly fit the 2D keypoints of the subject. We show that the camera pose estimates using just the sensors are not accurate enough, and our joint optimization formulation improves the accuracy of the camera poses while estimating the subject’s poses. Placing extra sensors on the cameras is not always feasible. That is why, in our work AirPose, we introduce a distributed neural network that runs on board, estimating the subject’s motion and calibrating the cameras relative to the subject. We utilize realistic human scans with ground truth to train our network. We further fine-tune it using a small amount of real-world data. Finally, we propose a bundle-adjustment method (AirPose+), which utilizes the initial estimates from our network to recover high-quality motions of the subject and the cameras. Finally, we consider a generic setup consisting of multiple static and moving cameras. We propose a method that estimates the poses of the cameras and the human relative to the ground plane using only 2D human keypoints. We learn a human motion prior using a large amount of human mocap data and use it in a novel multi-stage optimization approach to fit the SMPL human body model and the camera poses to the 2D keypoints. We show that in addition to the aerial cameras, our method works for smartphone cameras and standard RGB ground cameras. This thesis advances the field of markerless mocap which is currently limited to multiple static calibrated RGB cameras. Our methods allow the user to use moving RGB cameras and skip the extrinsic calibration. In the future, we will explore the usage of a single moving camera without even needing camera intrinsics.
thesis BibTeX

Haptic Intelligence Ph.D. Thesis Data-Driven Needle Puncture Detection for the Delivery of Urgent Medical Care in Space L’Orsa, R. University of Calgary, Calgary, Canada, November 2024, Department of Electrical and Computer Engineering (Published)
Needle thoracostomy (NT) is a surgical procedure that treats one of the most preventable causes of trauma-related death: dangerous accumulations of air between the chest wall and the lungs. However, needle-tip overshoot of the target space can result in the inadvertent puncture of critical structures like the heart. This type of complication is fatal without urgent surgical care, which is not available in resource-poor environments like space. Since NT is done blind, operators rely on tool sensations to identify when the needle has reached its target. Needle instrumentation could enable puncture notifications to help operators limit tool-tip overshoot, but such a solution requires reliable puncture detection from manual (i.e., variable-velocity) needle insertion data streams. Data-driven puncture-detection (DDPD) algorithms are appropriate for this application, but their performance has historically been unacceptably low for use in safety-critical applications. This work contributes towards the development of an intelligent device for manual NT assistance by proposing two novel DDPD algorithms. Three data sets are collected that provide needle forces and displacements acquired during insertions into ex vivo porcine tissue analogs for the human chest, and factors affecting DDPD algorithm performance are analyzed in these data. Puncture event features are examined for each sensor, and the suitability of both accelerometer measurements and diffuse reflectance measurements are evaluated within the context of NT. Finally, DDPD ensembles are proposed that yield a 5.1-fold improvement in precision as compared to the traditional force-only DDPD approach. These results lay a foundation for improving the urgent delivery of percutaneous procedures in space and other resource-poor settings.
BibTeX

Perceiving Systems Ph.D. Thesis Leveraging Unpaired Data for the Creation of Controllable Digital Humans Sanyal, S. Max Planck Institute for Intelligent Systems and Eberhard Karls Universität Tübingen, November 2024 (Published)
Digital humans have grown increasingly popular, offering transformative potential across various fields such as education, entertainment, and healthcare. They enrich user experiences by providing immersive and personalized interactions. Enhancing these experiences involves making digital humans controllable, allowing for manipulation of aspects like pose and appearance, among others. Learning to create such controllable digital humans necessitates extensive data from diverse sources. This includes 2D human images alongside their corresponding 3D geometry and texture, 2D images showcasing similar appearances across a wide range of body poses, etc., for effective control over pose and appearance. However, the availability of such “paired data” is limited, making its collection both time-consuming and expensive. Despite these challenges, there is an abundance of unpaired 2D images with accessible, inexpensive labels—such as identity, type of clothing, appearance of clothing, etc. This thesis capitalizes on these affordable labels, employing informed observations from “unpaired data” to facilitate the learning of controllable digital humans through reconstruction, transposition, and generation processes. The presented methods—RingNet, SPICE, and SCULPT—each tackles different aspects of controllable digital human modeling. RingNet (Sanyal et al. [2019]) exploits the consistent facial geometry across different images of the same individual to estimate 3D face shapes and poses without 2D-to-3D supervision. This method illustrates how leveraging the inherent properties of unpaired images—such as identity consistency—can circumvent the need for expensive paired datasets. Similarly, SPICE (Sanyal et al. [2021]) employs a self-supervised learning framework that harnesses unpaired images to generate realistic transpositions of human poses by understanding the underlying 3D body structure and maintaining consistency in body shape and appearance features across different poses. Finally, SCULPT (Sanyal et al. [2024] generates clothed and textured 3D meshes by integrating insights from unpaired 2D images and medium-sized 3D scans. This process employs an unpaired learning approach, conditioning texture and geometry generation on attributes easily derived from data, like the type and appearance of clothing. In conclusion, this thesis highlights how unpaired data and innovative learning techniques can address the challenges of data scarcity and high costs in developing controllable digital humans by advancing reconstruction, transposition, and generation techniques.
download BibTeX

Empirical Inference Ph.D. Thesis On Principled Modeling of Inductive Bias in Machine Learning Liu, W. University of Cambridge, UK, Cambridge, November 2024, (Cambridge-Tübingen-Fellowship-Program, ELLIS PhD student program) (Published) BibTeX

Perceiving Systems Ph.D. Thesis Learning Digital Humans from Vision and Language Feng, Y. ETH Zürich, October 2024 (Published)
The study of realistic digital humans has gained significant attention within the research communities of computer vision, computer graphics, and ma- chine learning. This growing interest is motivated by the crucial under- standing of human selves and the essential role digital humans play in enabling the metaverse. Applications span various sectors including vir- tual presence, fitness, digital fashion, entertainment, humanoid robots and healthcare. However, learning about 3D humans presents significant challenges due to data scarcity. In an era where scalability is crucial for AI, this raises the question: can we enhance the scalability of learning digital humans? To understand this, consider how humans interact: we observe and com- municate, forming impressions of others through these interactions. This thesis proposes a similar potential for computers: could they be taught to understand humans by observing and listening? Such an approach would involve processing visual data, like images and videos, and linguistic data from text descriptions. Thus, this research endeavors to enable machines to learn about digital humans from vision and language, both of which are readily available and scalable sources of data. Our research begins by developing a framework to create detailed 3D faces from in-the-wild images. This framework, capable of generating highly realistic and animatable 3D faces from single images, is trained without paired 3D supervision and achieves state-of-the-art accuracy in shape re- construction. It effectively disentangles identity and expression details, thereby enhancing facial animation. We then explore capturing the body, clothing, face, and hair from monocu- lar videos, using a novel hybrid explicit-implicit 3D representation. This iii approach facilitates the disentangled learning of digital humans from monocular videos and allows for the easy transfer of hair and clothing to different bodies, as demonstrated through experiments in disentangled re- construction, virtual try-ons, and hairstyle transfers. Next, we present a method that utilizes text-visual foundation models to generate highly realistic 3D faces, complete with hair and accessories, based on text descriptions. These foundation models are trained exclusively on in-the-wild images and efficiently produce detailed and realistic outputs, facilitating the creation of authentic avatars. Finally, we introduce a framework that employs Large Language Models (LLMs) to interpret and generate 3D human poses from both images and text. This method, inspired by how humans intuitively understand pos- tures, merges image interpretation with body language analysis. By em- bedding SMPL poses into a multimodal LLM, our approach not only in- tegrates semantic reasoning but also enhances the generation and under- standing of 3D poses, utilizing the comprehensive capabilities of LLMs. Additionally, the use of LLMs facilitates interactive discussions with users about human poses, enriching human-computer interactions. Our research on digital humans significantly boosts scalability and con- trollability. By generating digital humans from images, videos, and text, we democratize their creation, making it broadly accessible through ev- eryday imagery and straightforward text, while enhancing generalization. Disentangled modeling and interactive chatting with human poses increase the controllability of digital humans and improve user interactions and cus- tomizations, showcasing their potential to extend into various disciplines.
pdf DOI URL BibTeX

Perceiving Systems Ph.D. Thesis Realistic Digital Human Characters: Challenges, Models and Algorithms Osman, A. A. A. University of Tübingen, September 2024 (Published)
Statistical models for the body, head, and hands are essential in various computer vision tasks. However, popular models like SMPL, MANO, and FLAME produce unrealistic deformations due to inherent flaws in their modeling assumptions and how they are trained, which have become standard practices in constructing models for the body and its parts. This dissertation addresses these limitations by proposing new modeling and training algorithms to improve the realism and generalization of current models. We introduce a new model, STAR (Sparse Trained Articulated Human Body Regressor), which learns a sparse representation of the human body deformations, significantly reducing the number of model parameters compared to models like SMPL. This approach ensures that deformations are spatially localized, leading to more realistic deformations. STAR also incorporates shape-dependent pose deformations, accounting for variations in body shape to enhance overall model accuracy and realism. Additionally, we present a novel federated training algorithm for developing a comprehensive suite of models for the body and its parts. We train an expressive body model, SUPR (Sparse Unified Part-Based Representation), on a federated dataset of full-body scans, including detailed scans of the head, hands, and feet. We then separate SUPR into a full suite of state-of-the-art models for the head, hands, and foot. The new foot model captures complex foot deformations, addressing challenges related to foot shape, pose, and ground contact dynamics. The dissertation concludes by introducing AVATAR (Articulated Virtual Humans Trained By Bayesian Inference From a Single Scan), a novel, data-efficient training algorithm. AVATAR allows the creation of personalized, high-fidelity body models from a single scan by framing model construction as a Bayesian inference problem, thereby enabling training from small-scale datasets while reducing the risk of overfitting. These advancements push the state of the art in human body modeling and training techniques, making them more accessible for broader research and practical applications.
Thesis DOI BibTeX

Empirical Inference Ph.D. Thesis Advances in Probabilistic Methods for Deep Learning Immer, A. ETH Zurich, Switzerland, September 2024, CLS PhD Program (Published) BibTeX

Haptic Intelligence Ph.D. Thesis Engineering and Evaluating Naturalistic Vibrotactile Feedback for Telerobotic Assembly Gong, Y. University of Stuttgart, Stuttgart, Germany, August 2024, Faculty of Engineering Design, Production Engineering and Automotive Engineering (Published)
Teleoperation allows workers on a construction site to assemble pre-fabricated building components by controlling powerful machines from a safe distance. However, teleoperation's primary reliance on visual feedback limits the operator's efficiency in situations with stiff contact or poor visibility, compromising their situational awareness and thus increasing the difficulty of the task; it also makes construction machines more difficult to learn to operate. To bridge this gap, we propose that reliable, economical, and easy-to-implement naturalistic vibrotactile feedback could improve telerobotic control interfaces in construction and other application areas such as surgery. This type of feedback enables the operator to feel the natural vibrations experienced by the robot, which contain crucial information about its motions and its physical interactions with the environment. This dissertation explores how to deliver naturalistic vibrotactile feedback from a robot's end-effector to the hand of an operator performing telerobotic assembly tasks; furthermore, it seeks to understand the effects of such haptic cues. The presented research can be divided into four parts. We first describe the engineering of AiroTouch, a naturalistic vibrotactile feedback system tailored for use on construction sites but suitable for many other applications of telerobotics. Then we evaluate AiroTouch and explore the effects of the naturalistic vibrotactile feedback it delivers in three user studies conducted either in laboratory settings or on a construction site. We begin this dissertation by developing guidelines for creating a haptic feedback system that provides high-quality naturalistic vibrotactile feedback. These guidelines include three sections: component selection, component placement, and system evaluation. We detail each aspect with the parameters that need to be considered. Based on these guidelines, we adapt widely available commercial audio equipment to create our system called AiroTouch, which measures the vibration experienced by each robot tool with a high-bandwidth three-axis accelerometer and enables the user to feel this vibration in real time through a voice-coil actuator. Accurate haptic transmission is achieved by optimizing the positions of the system's off-the-shelf sensors and actuators and is then verified through measurements. The second part of this thesis presents our initial validation of AiroTouch. We explored how adding this naturalistic type of vibrotactile feedback affects the operator during small-scale telerobotic assembly. Due to the limited accessibility of teleoperated robots and to maintain safety, we conducted a user study in lab with a commercial bimanual dexterous teleoperation system developed for surgery (Intuitive da Vinci Si). Thirty participants used this robot equipped with AiroTouch to assemble a small stiff structure under three randomly ordered haptic feedback conditions: no vibrations, one-axis vibrations, and summed three-axis vibrations. The results show that participants learn to take advantage of both tested versions of the haptic feedback in the given tasks, as significantly lower vibrations and forces are observed in the second trial. Subjective responses indicate that naturalistic vibrotactile feedback increases the realism of the interaction and reduces the perceived task duration, task difficulty, and fatigue. To test our approach on a real construction site, we enhanced AiroTouch using wireless signal-transmission technologies and waterproofing, and then we adapted it to a mini-crane construction robot. A study was conducted to evaluate how naturalistic vibrotactile feedback affects an observer's understanding of telerobotic assembly performed by this robot on a construction site. Seven adults without construction experience observed a mix of manual and autonomous assembly processes both with and without naturalistic vibrotactile feedback. Qualitative analysis of their survey responses and interviews indicates that all participants had positive responses to this technology and believed it would be beneficial for construction activities. Finally, we evaluated the effects of naturalistic vibrotactile feedback provided by wireless AiroTouch during live teleoperation of the mini-crane. Twenty-eight participants remotely controlled the mini-crane to complete three large-scale assembly-related tasks in lab, both with and without this type of haptic feedback. Our results show that naturalistic vibrotactile feedback enhances the participants' awareness of both robot motion and contact between the robot and other objects, particularly in scenarios with limited visibility. These effects increase participants' confidence when controlling the robot. Moreover, there is a noticeable trend of reduced vibration magnitude in the conditions where this type of haptic feedback is provided. The primary contribution of this dissertation is the clear explanation of details that are essential for the effective implementation of naturalistic vibrotactile feedback. We demonstrate that our accessible, audio-based approach can enhance user performance and experience during telerobotic assembly in construction and other application domains. These findings lay the foundation for further exploration of the potential benefits of incorporating haptic cues to enhance user experience during teleoperation.
BibTeX

Perceiving Systems Ph.D. Thesis Modelling Dynamic 3D Human-Object Interactions: From Capture to Synthesis Taheri, O. University of Tübingen, July 2024 (Accepted)
Modeling digital humans that move and interact realistically with virtual 3D worlds has emerged as an essential research area recently, with significant applications in computer graphics, virtual and augmented reality, telepresence, the Metaverse, and assistive technologies. In particular, human-object interaction, encompassing full-body motion, hand-object grasping, and object manipulation, lies at the core of how humans execute tasks and represents the complex and diverse nature of human behavior. Therefore, accurate modeling of these interactions would enable us to simulate avatars to perform tasks, enhance animation realism, and develop applications that better perceive and respond to human behavior. Despite its importance, this remains a challenging problem, due to several factors such as the complexity of human motion, the variance of interaction based on the task, and the lack of rich datasets capturing the complexity of real-world interactions. Prior methods have made progress, but limitations persist as they often focus on individual aspects of interaction, such as body, hand, or object motion, without considering the holistic interplay among these components. This Ph.D. thesis addresses these challenges and contributes to the advancement of human-object interaction modeling through the development of novel datasets, methods, and algorithms.
BibTeX

Empirical Inference Ph.D. Thesis Advancing Normalising Flows to Model Boltzmann Distributions Stimper, V. University of Cambridge, UK, Cambridge, June 2024, (Cambridge-Tübingen-Fellowship-Program) (Published) BibTeX

Perceiving Systems Ph.D. Thesis Self- and Interpersonal Contact in 3D Human Mesh Reconstruction Müller, L. University of Tübingen, Tübingen, March 2024 (Published)
The ability to perceive tactile stimuli is of substantial importance for human beings in establishing a connection with the surrounding world. Humans rely on the sense of touch to navigate their environment and to engage in interactions with both themselves and other people. The field of computer vision has made great progress in estimating a person’s body pose and shape from an image, however, the investigation of self- and interpersonal contact has received little attention despite its considerable significance. Estimating contact from images is a challenging endeavor because it necessitates methodologies capable of predicting the full 3D human body surface, i.e. an individual’s pose and shape. The limitations of current methods become evident when considering the two primary datasets and labels employed within the community to supervise the task of human pose and shape estimation. First, the widely used 2D joint locations lack crucial information for representing the entire 3D body surface. Second, in datasets of 3D human bodies, e.g. collected from motion capture systems or body scanners, contact is usually avoided, since it naturally leads to occlusion which complicates data cleaning and can break the data processing pipelines. In this thesis, we first address the problem of estimating contact that humans make with themselves from RGB images. To do this, we introduce two novel methods that we use to create new datasets tailored for the task of human mesh estimation for poses with self-contact. We create (1) 3DCP, a dataset of 3D body scan and motion capture data of humans in poses with self-contact and (2) MTP, a dataset of images taken in the wild with accurate 3D reference data using pose mimicking. Next, we observe that 2D joint locations can be readily labeled at scale given an image, however, an equivalent label for self-contact does not exist. Consequently, we introduce (3) distrecte self-contact (DSC) annotations indicating the pairwise contact of discrete regions on the human body. We annotate three existing image datasets with discrete self-contact and use these labels during mesh optimization to bring body parts supposed to touch into contact. Then we train TUCH, a human mesh regressor, on our new datasets. When evaluated on the task of human body pose and shape estimation on public benchmarks, our results show that knowing about self-contact not only improves mesh estimates for poses with self-contact, but also for poses without self-contact. Next, we study contact humans make with other individuals during close social interaction. Reconstructing these interactions in 3D is a significant challenge due to the mutual occlusion. Furthermore, the existing datasets of images taken in the wild with ground-truth contact labels are of insufficient size to facilitate the training of a robust human mesh regressor. In this work, we employ a generative model, BUDDI, to learn the joint distribution of 3D pose and shape of two individuals during their interaction and use this model as prior during an optimization routine. To construct training data we leverage pre-existing datasets, i.e. motion capture data and Flickr images with discrete contact annotations. Similar to discrete self-contact labels, we utilize discrete human- human contact to jointly fit two meshes to detected 2D joint locations. The majority of methods for generating 3D humans focus on the motion of a single person and operate on 3D joint locations. While these methods can effectively generate motion, their representation of 3D humans is not sufficient for physical contact since they do not model the body surface. Our approach, in contrast, acts on the pose and shape parameters of a human body model, which enables us to sample 3D meshes of two people. We further demonstrate how the knowledge of human proxemics, incorporated in our model, can be used to guide an optimization routine. For this, in each optimization iteration, BUDDI takes the current mesh and proposes a refinement that we subsequently consider in the objective function. This procedure enables us to go beyond state of the art by forgoing ground-truth discrete human-human contact labels during optimization. Self- and interpersonal contact happen on the surface of the human body, however, the majority of existing art tends to predict bodies with similar, “average” body shape. This is due to a lack of training data of paired images taken in the wild and ground- truth 3D body shape and because 2D joint locations are not sufficient to explain body shape. The most apparent solution would be to collect body scans of people together with their photos. This is, however, a time-consuming and cost-intensive process that lacks scalability. Instead, we leverage the vocabulary humans use to describe body shape. First, we ask annotators to label how much a word like “tall” or “long legs” applies to a human body. We gather these ratings for rendered meshes of various body shapes, for which we have ground-truth body model shape parameters, and for images collected from model agency websites. Using this data, we learn a shape-to-attribute (A2S) model that predicts body shape ratings from body shape parameters. Then we train a human mesh regressor, SHAPY, on the model agency images wherein we supervise body shape via attribute annotations using A2S. Since no suitable test set of diverse 3D ground-truth body shape with images taken in natural settings exists, we introduce Human Bodies in the Wild (HBW). This novel dataset contains photographs of individuals together with their body scan. Our model predicts more realistic body shapes from an image and quantitatively improves body shape estimation on this new benchmark. In summary, we present novel datasets, optimization methods, a generative model, and regressors to advance the field of 3D human pose and shape estimation. Taken together, these methods open up ways to obtain more accurate and realistic 3D mesh estimates from images with multiple people in self- and mutual contact poses and with diverse body shapes. This line of research also enables generative approaches to create more natural, human-like avatars. We believe that knowing about self- and human-human contact through computer vision has wide-ranging implications in other fields as for example robotics, fitness, or behavioral science.
download Thesis DOI BibTeX

Haptic Intelligence Ph.D. Thesis Creating a Haptic Empathetic Robot Animal That Feels Touch and Emotion Burns, R. B. University of Tübingen, Tübingen, Germany, February 2024, Department of Computer Science (Published)
Social touch, such as a hug or a poke on the shoulder, is an essential aspect of everyday interaction. Humans use social touch to gain attention, communicate needs, express emotions, and build social bonds. Despite its importance, touch sensing is very limited in most commercially available robots. By endowing robots with social-touch perception, one can unlock a myriad of new interaction possibilities. In this thesis, I present my work on creating a Haptic Empathetic Robot Animal (HERA), a koala-like robot for children with autism. I demonstrate the importance of establishing design guidelines based on one's target audience, which we investigated through interviews with autism specialists. I share our work on creating full-body tactile sensing for the NAO robot using low-cost, do-it-yourself (DIY) methods, and I introduce an approach to model long-term robot emotions using second-order dynamics.
BibTeX

Empirical Inference Ph.D. Thesis Identifiable Causal Representation Learning: Unsupervised, Multi-View, and Multi-Environment von Kügelgen, J. University of Cambridge, UK, Cambridge, February 2024, (Cambridge-Tübingen-Fellowship) (Published) URL BibTeX

Perceiving Systems Ph.D. Thesis Natural Language Control for 3D Human Motion Synthesis Petrovich, M. LIGM, Ecole des Ponts, Univ Gustave Eiffel, CNRS, 2024 (Published)
3D human motions are at the core of many applications in the film industry, healthcare, augmented reality, virtual reality and video games. However, these applications often rely on expensive and time-consuming motion capture data. The goal of this thesis is to explore generative models as an alternative route to obtain 3D human motions. More specifically, our aim is to allow a natural language interface as a means to control the generation process. To this end, we develop a series of models that synthesize realistic and diverse motions following the semantic inputs. In our first contribution, described in Chapter 3, we address the challenge of generating human motion sequences conditioned on specific action categories. We introduce ACTOR, a conditional variational autoencoder (VAE) that learns an action-aware latent representation for human motions. We show significant gains over existing methods thanks to our new Transformer-based VAE formulation, encoding and decoding SMPL pose sequences through a single motion-level embedding. In our second contribution, described in Chapter 4, we go beyond categorical actions, and dive into the task of synthesizing diverse 3D human motions from textual descriptions allowing a larger vocabulary and potentially more fine-grained control. Our work stands out from previous research by not deterministically generating a single motion sequence, but by synthesizing multiple, varied sequences from a given text. We propose TEMOS, building on our VAE-based ACTOR architecture, but this time integrating a pretrained text encoder to handle large-vocabulary natural language inputs. In our third contribution, described in Chapter 5, we address the adjacent task of text-to-3D human motion retrieval, where the goal is to search in a motion collection by querying via text. We introduce a simple yet effective approach, named TMR, building on our earlier model TEMOS, by integrating a contrastive loss to enhance the structure of the cross-modal latent space. Our findings emphasize the importance of retaining the motion generation loss in conjunction with contrastive training for improved results. We establish a new evaluation benchmark and conduct analyses on several protocols. In our fourth contribution, described in Chapter 6, we introduce a new problem termed as “multi-track timeline control” for text-driven 3D human motion synthesis. Instead of a single textual prompt, users can organize multiple prompts in temporal intervals that may overlap. We introduce STMC, a test-time denoising method that can be integrated with any pre-trained motion diffusion model. Our evaluations demonstrate that our method generates motions that closely match the semantic and temporal aspects of the input timelines. In summary, our contributions in this thesis are as follows: (i) we develop a generative variational autoencoder, ACTOR, for action-conditioned generation of human motion sequences, (ii) we introduce TEMOS, a text-conditioned generative model that synthesizes diverse human motions from textual descriptions, (iii) we present TMR, a new approach for text-to-3D human motion retrieval, (iv) we propose STMC, a method for timeline control in text-driven motion synthesis, enabling the generation of detailed and complex motions.
pdf YouTube Thesis BibTeX

Embodied Vision Ph.D. Thesis Investigating Shape Priors, Relationships, and Multi-Task Cues for Object-level Scene Understanding Elich, C. ETH Zürich, Zurich, 2024 (Published)
Humans are proficient at intuitively identifying objects and reasoning about their diverse properties from complex visual observations. Despite significant advances in artificial intelligence, computers have yet to achieve a comparable level of understanding, which is crucial for effective reasoning about tasks and interactions within an environment. In this thesis, we explore the benefits of various visual cues when dealing with key challenges in scene understanding, specifically focusing on weak supervision, finding view correspondence, and paradigms for simultaneously learning multiple tasks. We begin by investigating cues that reduce the need for full supervision. In particular, we propose an approach for learning multi-object 3D scene decomposition and object-wise properties from single images with only weak supervision. Our method utilizes a recurrent encoder to infer a latent representation for each object and a differentiable renderer to obtain a training signal. To guide the training process and constrain the search space of possible solutions, we leverage prior knowledge through pre-trained 3D shape spaces. Subsequently, we investigate the benefits of reasoning about relations between objects to learn more distinct object representations that allow for matching object detections across viewpoint changes. To address this, we introduce an approach that employs graph neural networks to learn matching features based on appearance as well as inter- and cross-frame relations. We conduct comparisons with keypoint-based methods and propose a methodology to combine these approaches, aiming to achieve overall improved performance. Finally, we consider the challenge of multi-task learning and analyze related paradigms in the context of basic single-task learning. In particular, we study the impact of the choice of optimizer, the role of gradient conflicts, and the effects on the transferability of features learned through either learning setup on common image corruptions. Our findings reveal surprising similarities between single-task and multi-task learning, suggesting that methods and techniques from one field could be advantageously applied to the other.
DOI URL BibTeX

Embodied Vision Ph.D. Thesis Methods for Learning Adaptive and Symbolic Forward Models for Control and Planning Achterhold, J. M. Eberhard Karls Universität Tübingen, Tübingen, 2024 (Published)
Learning-based methods for sequential decision making, i.e., methods which leverage data, have shown the ability to solve complex problems in recent years. This includes control of dynamical systems, as well as mastering games such as Go and StarCraft. In addition, these methods often promise to be applicable to a wide variety of problems. A subclass of these methods are model-based methods. They leverage data to learn a model which allows predicting the evolution of a dynamical system to control. In recent research, it was shown that these methods, in contrast to model-free methods, require less data to be trained. In addition, model-based methods allow re-using the dynamics model when the task to be solved has changed, and straightforward adaptation to changes in the system’s dynamics. One particular focus of this thesis is on learning dynamics models which can data-efficiently adapt to changes in the system’s dynamics, as well as the efficient collection of data to adapt a learned model. In this regard, two novel methods are presented. In the application domain of autonomous robot navigation, in which both parameters of the robot and the terrain are subject to change, a novel method comprising an adaptive dynamics model is presented and evaluated on a simulated environment. A further advantage of model-based methods is the ability to incorporate physical prior knowledge for model design. In this thesis, we demonstrate that leveraging physical prior knowledge is advantageous for the task of tracking and predicting the motion of a table tennis ball, respecting its spin. However, model-based methods, in particular planning with learned models, have to cope with certain challenges. For long prediction horizons, which are required if the effect of an action is apparent only far in the future, model errors accumulate. In addition, model-based planning is commonly computationally intensive, which is problematic if high-frequency, reactive control is required. In this thesis, a method is presented to alleviate these problems. To this end, we propose a two-layered hierarchical method. Model-based planning is only applied on the higher layer on symbolic abstractions. On the lower-layer, model-free reactive control is used. We show successful application of this method to board games which can only be interacted with through a robotic manipulator, e.g., a robotic arm, which requires high-frequency reactive control.
DOI URL BibTeX

Empirical Inference Ph.D. Thesis Stochastic Predictive Control for Legged Robots Gazar, A. University of Tübingen, Germany, December 2023 (Published) DOI BibTeX

Perceiving Systems Ph.D. Thesis Neural Shape Modeling of 3D Clothed Humans Ma, Q. October 2023 (Published)
Parametric models for 3D human bodies play a crucial role in the synthesis and analysis of humans in visual computing. While current models effectively capture body pose and shape variations, a significant aspect has been overlooked – clothing. Existing 3D human models mostly produce a minimally-clothed body geometry, limiting their ability to represent the complexity of dressed people in real-world data sources. The challenge lies in the unique characteristics of garments, which make modeling clothed humans particularly difficult. Clothing exhibits diverse topologies, and as the body moves, it introduces wrinkles at various spatial scales. Moreover, pose-dependent clothing deformations are non-rigid and non-linear, exceeding the capabilities of classical body models constructed with fixed-topology surface meshes and linear approximations of pose-aware shape deformations. This thesis addresses these challenges by innovating in two key areas: the 3D shape representation and deformation modeling techniques. We demonstrate that, the seemingly old-fashioned shape representation, point clouds – when equipped with deep learning and neural fields – can be a powerful tool for modeling clothed characters. Specifically, the thesis begins by introducing a large-scale dataset of dynamic 3D humans in various clothing, which serves as a foundation for training the models presented in this work. The first model we present is CAPE: a neural generative model for 3D clothed human meshes. Here, a clothed body is straightforwardly obtained by applying per-vetex offsets to a pre-defined, unclothed body template mesh. Sampling from the CAPE model generates plausibly-looking digital humans wearing common garments, but the fixed-topology mesh representation limits its applicability to more complex garment types. To address this limitation, we present a series of point-based clothed human models: SCALE, PoP and SkiRT. The SCALE model represents a clothed human using a collection of points organized into local patches. The patches can freely move and deform to represent garments of diverse topologies, unlocking the generalization to more challenging outfits such as dresses and jackets. Unlike traditional approaches based on physics simulations, SCALE learns pose-dependent cloth deformations from data with minimal manual intervention. To further improve the geometric quality, the PoP model eliminates the concept of patches and instead learns a continuous neural deformation field from the body surface. Densely querying this field results in a highresolution point cloud of a dressed human, showcasing intricate clothing wrinkles. PoP can generalize across multiple subjects and outfits, and can even bring a single, static scan into animation. Finally, we tackle a long-standing challenge in learning-based digital human modeling: loose garments, in particular skirts and dresses. Building upon PoP, the SkiRT pipeline further learns a shape “template” and neural field of linear-blend-skinning weights for clothed bodies, improving the models’ robustness for loose garments of varied topology. Our point-based human models are “interplicit”: the output point clouds capture surfaces explicitly at discrete points but implicitly in between. The explicit points are fast, topologically flexible, and are compatible with existing graphics tools, while the implicit neural deformation field contributes to high-quality geometry. This thesis primarily demonstrates these advantages in the context of clothed human shape modeling; future work can apply our representation and techniques to general 3D deformable shapes and neural rendering.
download Thesis DOI BibTeX

Haptic Intelligence Ph.D. Thesis Gesture-Based Nonverbal Interaction for Exercise Robots Mohan, M. University of Tübingen, Tübingen, Germany, October 2023, Department of Computer Science (Published)
When teaching or coaching, humans augment their words with carefully timed hand gestures, head and body movements, and facial expressions to provide feedback to their students. Robots, however, rarely utilize these nuanced cues. A minimally supervised social robot equipped with these abilities could support people in exercising, physical therapy, and learning new activities. This thesis examines how the intuitive power of human gestures can be harnessed to enhance human-robot interaction. To address this question, this research explores gesture-based interactions to expand the capabilities of a socially assistive robotic exercise coach, investigating the perspectives of both novice users and exercise-therapy experts. This thesis begins by concentrating on the user's engagement with the robot, analyzing the feasibility of minimally supervised gesture-based interactions. This exploration seeks to establish a framework in which robots can interact with users in a more intuitive and responsive manner. The investigation then shifts its focus toward the professionals who are integral to the success of these innovative technologies: the exercise-therapy experts. Roboticists face the challenge of translating the knowledge of these experts into robotic interactions. We address this challenge by developing a teleoperation algorithm that can enable exercise therapists to create customized gesture-based interactions for a robot. Thus, this thesis lays the groundwork for dynamic gesture-based interactions in minimally supervised environments, with implications for not only exercise-coach robots but also broader applications in human-robot interaction.
BibTeX

Perceiving Systems Ph.D. Thesis Learning Clothed 3D Human Models with Articulated Neural Implicit Representations Chen, X. July 2023 (Published)
3D digital humans are important for a range of applications including movie and game production, virtual and augmented reality, and human-computer interaction. However, existing industrial solutions for creating 3D digital humans rely on expensive scanning devices and intensive manual labor, preventing their broader application. To address these challenges, the research community focuses on learning 3D parametric human models from data, aiming to automatically generate realistic digital humans based on input parameters that specify pose and shape attributes. Although recent advancements have enabled the generation of faithful 3D human bodies, modeling realistic humans that include additional features such as clothing, hair, and accessories remains an open research challenge. The goal of this thesis is to develop 3D parametric human models that can generate realistic digital humans including not only human bodies but also additional features, in particular clothing. The central challenge lies in the fundamental problem of how to represent non-rigid, articulated, and topology-varying shapes. Explicit geometric representations like polygon meshes lack the flexibility needed to model varying topology between clothing and human bodies, and across different clothing styles. On the other hand, implicit representations, such as signed distance functions, are topologically flexible but do not have a robust articulation algorithm yet. To tackle this problem, we first introduce a principled algorithm that models articulation for implicit representations, in particular the recently emerging neural implicit representations which have shown impressive modeling fidelity. Our algorithm, SNARF, generalizes linear blend skinning for polygon meshes to implicit representations and can faithfully articulate implicit shapes to any pose. SNARF is fully differentiable, which enables learning skinning weights and shapes jointly from posed observations. By leveraging this algorithm, we can learn single-subject clothed human models with realistic shapes and natural deformations from 3D scans. We further improve SNARF’s efficiency with several implementation and algorithmic optimizations, including using a more compact representation of the skinning weights, factoring out redundant computations, and custom CUDA kernel implementations. Collectively, these adaptations result in a speedup of 150 times while preserving accuracy, thereby enabling the efficient learning of 3D animatable humans. Next, we go beyond single-subject modeling and tackle the more challenging task of generative modeling clothed 3D humans. By integrating our articulation module with deep generative models, we have developed a generative model capable of creating novel 3D humans with various clothing styles and identities, as well as geometric details such as wrinkles. Lastly, to eliminate the reliance on expensive 3D scans and to facilitate texture learning, we introduce a system that integrates our differentiable articulation module with differentiable volume rendering in an end-to-end manner, enabling the reconstruction of animatable 3D humans directly from 2D monocular videos. The contributions of this thesis significantly advance the realistic generation and reconstruction of clothed 3D humans and provide new tools for modeling non-rigid, articulated, and topology-varying shapes. We hope that this work will contribute to the development of 3D human modeling and pave the way for new applications in the future.
download DOI BibTeX