Header logo is


2024


Leveraging Unpaired Data for the Creation of Controllable Digital Humans
Leveraging Unpaired Data for the Creation of Controllable Digital Humans

Sanyal, S.

Max Planck Institute for Intelligent Systems and Eberhard Karls Universität Tübingen, September 2024 (phdthesis) To be published

Abstract
Digital humans have grown increasingly popular, offering transformative potential across various fields such as education, entertainment, and healthcare. They enrich user experiences by providing immersive and personalized interactions. Enhancing these experiences involves making digital humans controllable, allowing for manipulation of aspects like pose and appearance, among others. Learning to create such controllable digital humans necessitates extensive data from diverse sources. This includes 2D human images alongside their corresponding 3D geometry and texture, 2D images showcasing similar appearances across a wide range of body poses, etc., for effective control over pose and appearance. However, the availability of such “paired data” is limited, making its collection both time-consuming and expensive. Despite these challenges, there is an abundance of unpaired 2D images with accessible, inexpensive labels—such as identity, type of clothing, appearance of clothing, etc. This thesis capitalizes on these affordable labels, employing informed observations from “unpaired data” to facilitate the learning of controllable digital humans through reconstruction, transposition, and generation processes. The presented methods—RingNet, SPICE, and SCULPT—each tackles different aspects of controllable digital human modeling. RingNet (Sanyal et al. [2019]) exploits the consistent facial geometry across different images of the same individual to estimate 3D face shapes and poses without 2D-to-3D supervision. This method illustrates how leveraging the inherent properties of unpaired images—such as identity consistency—can circumvent the need for expensive paired datasets. Similarly, SPICE (Sanyal et al. [2021]) employs a self-supervised learning framework that harnesses unpaired images to generate realistic transpositions of human poses by understanding the underlying 3D body structure and maintaining consistency in body shape and appearance features across different poses. Finally, SCULPT (Sanyal et al. [2024] generates clothed and textured 3D meshes by integrating insights from unpaired 2D images and medium-sized 3D scans. This process employs an unpaired learning approach, conditioning texture and geometry generation on attributes easily derived from data, like the type and appearance of clothing. In conclusion, this thesis highlights how unpaired data and innovative learning techniques can address the challenges of data scarcity and high costs in developing controllable digital humans by advancing reconstruction, transposition, and generation techniques.

ps

[BibTex]

2024


[BibTex]


Realistic Digital Human Characters: Challenges, Models and Algorithms
Realistic Digital Human Characters: Challenges, Models and Algorithms

Osman, A. A. A.

University of Tübingen, September 2024 (phdthesis)

Abstract
Statistical models for the body, head, and hands are essential in various computer vision tasks. However, popular models like SMPL, MANO, and FLAME produce unrealistic deformations due to inherent flaws in their modeling assumptions and how they are trained, which have become standard practices in constructing models for the body and its parts. This dissertation addresses these limitations by proposing new modeling and training algorithms to improve the realism and generalization of current models. We introduce a new model, STAR (Sparse Trained Articulated Human Body Regressor), which learns a sparse representation of the human body deformations, significantly reducing the number of model parameters compared to models like SMPL. This approach ensures that deformations are spatially localized, leading to more realistic deformations. STAR also incorporates shape-dependent pose deformations, accounting for variations in body shape to enhance overall model accuracy and realism. Additionally, we present a novel federated training algorithm for developing a comprehensive suite of models for the body and its parts. We train an expressive body model, SUPR (Sparse Unified Part-Based Representation), on a federated dataset of full-body scans, including detailed scans of the head, hands, and feet. We then separate SUPR into a full suite of state-of-the-art models for the head, hands, and foot. The new foot model captures complex foot deformations, addressing challenges related to foot shape, pose, and ground contact dynamics. The dissertation concludes by introducing AVATAR (Articulated Virtual Humans Trained By Bayesian Inference From a Single Scan), a novel, data-efficient training algorithm. AVATAR allows the creation of personalized, high-fidelity body models from a single scan by framing model construction as a Bayesian inference problem, thereby enabling training from small-scale datasets while reducing the risk of overfitting. These advancements push the state of the art in human body modeling and training techniques, making them more accessible for broader research and practical applications.

ps

[BibTex]


no image
Advances in Probabilistic Methods for Deep Learning

Immer, A.

ETH Zurich, Switzerland, September 2024, CLS PhD Program (phdthesis)

ei

[BibTex]

[BibTex]


no image
Engineering and Evaluating Naturalistic Vibrotactile Feedback for Telerobotic Assembly

Gong, Y.

University of Stuttgart, Stuttgart, Germany, August 2024, Faculty of Design, Production Engineering and Automotive Engineering (phdthesis)

Abstract
Teleoperation allows workers on a construction site to assemble pre-fabricated building components by controlling powerful machines from a safe distance. However, teleoperation's primary reliance on visual feedback limits the operator's efficiency in situations with stiff contact or poor visibility, compromising their situational awareness and thus increasing the difficulty of the task; it also makes construction machines more difficult to learn to operate. To bridge this gap, we propose that reliable, economical, and easy-to-implement naturalistic vibrotactile feedback could improve telerobotic control interfaces in construction and other application areas such as surgery. This type of feedback enables the operator to feel the natural vibrations experienced by the robot, which contain crucial information about its motions and its physical interactions with the environment. This dissertation explores how to deliver naturalistic vibrotactile feedback from a robot's end-effector to the hand of an operator performing telerobotic assembly tasks; furthermore, it seeks to understand the effects of such haptic cues. The presented research can be divided into four parts. We first describe the engineering of AiroTouch, a naturalistic vibrotactile feedback system tailored for use on construction sites but suitable for many other applications of telerobotics. Then we evaluate AiroTouch and explore the effects of the naturalistic vibrotactile feedback it delivers in three user studies conducted either in laboratory settings or on a construction site. We begin this dissertation by developing guidelines for creating a haptic feedback system that provides high-quality naturalistic vibrotactile feedback. These guidelines include three sections: component selection, component placement, and system evaluation. We detail each aspect with the parameters that need to be considered. Based on these guidelines, we adapt widely available commercial audio equipment to create our system called AiroTouch, which measures the vibration experienced by each robot tool with a high-bandwidth three-axis accelerometer and enables the user to feel this vibration in real time through a voice-coil actuator. Accurate haptic transmission is achieved by optimizing the positions of the system's off-the-shelf sensors and actuators and is then verified through measurements. The second part of this thesis presents our initial validation of AiroTouch. We explored how adding this naturalistic type of vibrotactile feedback affects the operator during small-scale telerobotic assembly. Due to the limited accessibility of teleoperated robots and to maintain safety, we conducted a user study in lab with a commercial bimanual dexterous teleoperation system developed for surgery (Intuitive da Vinci Si). Thirty participants used this robot equipped with AiroTouch to assemble a small stiff structure under three randomly ordered haptic feedback conditions: no vibrations, one-axis vibrations, and summed three-axis vibrations. The results show that participants learn to take advantage of both tested versions of the haptic feedback in the given tasks, as significantly lower vibrations and forces are observed in the second trial. Subjective responses indicate that naturalistic vibrotactile feedback increases the realism of the interaction and reduces the perceived task duration, task difficulty, and fatigue. To test our approach on a real construction site, we enhanced AiroTouch using wireless signal-transmission technologies and waterproofing, and then we adapted it to a mini-crane construction robot. A study was conducted to evaluate how naturalistic vibrotactile feedback affects an observer's understanding of telerobotic assembly performed by this robot on a construction site. Seven adults without construction experience observed a mix of manual and autonomous assembly processes both with and without naturalistic vibrotactile feedback. Qualitative analysis of their survey responses and interviews indicates that all participants had positive responses to this technology and believed it would be beneficial for construction activities. Finally, we evaluated the effects of naturalistic vibrotactile feedback provided by wireless AiroTouch during live teleoperation of the mini-crane. Twenty-eight participants remotely controlled the mini-crane to complete three large-scale assembly-related tasks in lab, both with and without this type of haptic feedback. Our results show that naturalistic vibrotactile feedback enhances the participants' awareness of both robot motion and contact between the robot and other objects, particularly in scenarios with limited visibility. These effects increase participants' confidence when controlling the robot. Moreover, there is a noticeable trend of reduced vibration magnitude in the conditions where this type of haptic feedback is provided. The primary contribution of this dissertation is the clear explanation of details that are essential for the effective implementation of naturalistic vibrotactile feedback. We demonstrate that our accessible, audio-based approach can enhance user performance and experience during telerobotic assembly in construction and other application domains. These findings lay the foundation for further exploration of the potential benefits of incorporating haptic cues to enhance user experience during teleoperation.

hi

Project Page [BibTex]

Project Page [BibTex]


no image
A Measure-Theoretic Axiomatisation of Causality and Kernel Regression

Park, J.

University of Tübingen, Germany, July 2024 (phdthesis)

ei

[BibTex]

[BibTex]


Modelling Dynamic 3D Human-Object Interactions: From Capture to Synthesis
Modelling Dynamic 3D Human-Object Interactions: From Capture to Synthesis

Taheri, O.

University of Tübingen, July 2024 (phdthesis) To be published

Abstract
Modeling digital humans that move and interact realistically with virtual 3D worlds has emerged as an essential research area recently, with significant applications in computer graphics, virtual and augmented reality, telepresence, the Metaverse, and assistive technologies. In particular, human-object interaction, encompassing full-body motion, hand-object grasping, and object manipulation, lies at the core of how humans execute tasks and represents the complex and diverse nature of human behavior. Therefore, accurate modeling of these interactions would enable us to simulate avatars to perform tasks, enhance animation realism, and develop applications that better perceive and respond to human behavior. Despite its importance, this remains a challenging problem, due to several factors such as the complexity of human motion, the variance of interaction based on the task, and the lack of rich datasets capturing the complexity of real-world interactions. Prior methods have made progress, but limitations persist as they often focus on individual aspects of interaction, such as body, hand, or object motion, without considering the holistic interplay among these components. This Ph.D. thesis addresses these challenges and contributes to the advancement of human-object interaction modeling through the development of novel datasets, methods, and algorithms.

ps

[BibTex]

[BibTex]


no image
Advancing Normalising Flows to Model Boltzmann Distributions

Stimper, V.

University of Cambridge, UK, Cambridge, June 2024, (Cambridge-Tübingen-Fellowship-Program) (phdthesis)

ei

[BibTex]

[BibTex]


no image
Language Models Can Reduce Asymmetry in Information Markets

Rahaman, N., Weiss, M., Wüthrich, M., Bengio, Y., Li, E., Pal, C., Schölkopf, B.

arXiv:2403.14443, March 2024, Published as: Redesigning Information Markets in the Era of Language Models, Conference on Language Modeling (COLM) (techreport)

Abstract
This work addresses the buyer's inspection paradox for information markets. The paradox is that buyers need to access information to determine its value, while sellers need to limit access to prevent theft. To study this, we introduce an open-source simulated digital marketplace where intelligent agents, powered by language models, buy and sell information on behalf of external participants. The central mechanism enabling this marketplace is the agents' dual capabilities: they not only have the capacity to assess the quality of privileged information but also come equipped with the ability to forget. This ability to induce amnesia allows vendors to grant temporary access to proprietary information, significantly reducing the risk of unauthorized retention while enabling agents to accurately gauge the information's relevance to specific queries or tasks. To perform well, agents must make rational decisions, strategically explore the marketplace through generated sub-queries, and synthesize answers from purchased information. Concretely, our experiments (a) uncover biases in language models leading to irrational behavior and evaluate techniques to mitigate these biases, (b) investigate how price affects demand in the context of informational goods, and (c) show that inspection and higher budgets both lead to higher quality outcomes.

ei

link (url) [BibTex]

link (url) [BibTex]


no image
Interpreting How Large Language Models Handle Facts and Counterfactuals through Mechanistic Interpretability

Ortu, F.

University of Trieste, Italy, March 2024 (mastersthesis)

ei

[BibTex]


no image
Identifiable Causal Representation Learning

von Kügelgen, J.

University of Cambridge, UK, Cambridge, February 2024, (Cambridge-Tübingen-Fellowship) (phdthesis)

ei

[BibTex]

[BibTex]


Creating a Haptic Empathetic Robot Animal That Feels Touch and Emotion
Creating a Haptic Empathetic Robot Animal That Feels Touch and Emotion

Burns, R.

University of Tübingen, Tübingen, Germany, February 2024, Department of Computer Science (phdthesis)

Abstract
Social touch, such as a hug or a poke on the shoulder, is an essential aspect of everyday interaction. Humans use social touch to gain attention, communicate needs, express emotions, and build social bonds. Despite its importance, touch sensing is very limited in most commercially available robots. By endowing robots with social-touch perception, one can unlock a myriad of new interaction possibilities. In this thesis, I present my work on creating a Haptic Empathetic Robot Animal (HERA), a koala-like robot for children with autism. I demonstrate the importance of establishing design guidelines based on one's target audience, which we investigated through interviews with autism specialists. I share our work on creating full-body tactile sensing for the NAO robot using low-cost, do-it-yourself (DIY) methods, and I introduce an approach to model long-term robot emotions using second-order dynamics.

hi

Project Page [BibTex]

Project Page [BibTex]


no image
Learning a Terrain- and Robot-Aware Dynamics Model for Autonomous Mobile Robot Navigation

Achterhold, J., Guttikonda, S., Kreber, J. U., Li, H., Stueckler, J.

CoRR abs/2409.11452, 2024, Preprint submitted to Robotics and Autonomous Systems Journal. https://arxiv.org/abs/2409.11452 (techreport) Submitted

Abstract
Mobile robots should be capable of planning cost-efficient paths for autonomous navigation. Typically, the terrain and robot properties are subject to variations. For instance, properties of the terrain such as friction may vary across different locations. Also, properties of the robot may change such as payloads or wear and tear, e.g., causing changing actuator gains or joint friction. Autonomous navigation approaches should thus be able to adapt to such variations. In this article, we propose a novel approach for learning a probabilistic, terrain- and robot-aware forward dynamics model (TRADYN) which can adapt to such variations and demonstrate its use for navigation. Our learning approach extends recent advances in meta-learning forward dynamics models based on Neural Processes for mobile robot navigation. We evaluate our method in simulation for 2D navigation of a robot with uni-cycle dynamics with varying properties on terrain with spatially varying friction coefficients. In our experiments, we demonstrate that TRADYN has lower prediction error over long time horizons than model ablations which do not adapt to robot or terrain variations. We also evaluate our model for navigation planning in a model-predictive control framework and under various sources of noise. We demonstrate that our approach yields improved performance in planning control-efficient paths by taking robot and terrain properties into account.

ev

preprint [BibTex]

preprint [BibTex]


no image
A Pontryagin Perspective on Reinforcement Learning

Eberhard, O., Vernade, C., Muehlebach, M.

Max Planck Institute for Intelligent Systems, 2024 (techreport)

lds

link (url) [BibTex]

link (url) [BibTex]


Self- and Interpersonal Contact in 3D Human Mesh Reconstruction
Self- and Interpersonal Contact in 3D Human Mesh Reconstruction

Müller, L.

University of Tübingen, Tübingen, 2024 (phdthesis)

Abstract
The ability to perceive tactile stimuli is of substantial importance for human beings in establishing a connection with the surrounding world. Humans rely on the sense of touch to navigate their environment and to engage in interactions with both themselves and other people. The field of computer vision has made great progress in estimating a person’s body pose and shape from an image, however, the investigation of self- and interpersonal contact has received little attention despite its considerable significance. Estimating contact from images is a challenging endeavor because it necessitates methodologies capable of predicting the full 3D human body surface, i.e. an individual’s pose and shape. The limitations of current methods become evident when considering the two primary datasets and labels employed within the community to supervise the task of human pose and shape estimation. First, the widely used 2D joint locations lack crucial information for representing the entire 3D body surface. Second, in datasets of 3D human bodies, e.g. collected from motion capture systems or body scanners, contact is usually avoided, since it naturally leads to occlusion which complicates data cleaning and can break the data processing pipelines. In this thesis, we first address the problem of estimating contact that humans make with themselves from RGB images. To do this, we introduce two novel methods that we use to create new datasets tailored for the task of human mesh estimation for poses with self-contact. We create (1) 3DCP, a dataset of 3D body scan and motion capture data of humans in poses with self-contact and (2) MTP, a dataset of images taken in the wild with accurate 3D reference data using pose mimicking. Next, we observe that 2D joint locations can be readily labeled at scale given an image, however, an equivalent label for self-contact does not exist. Consequently, we introduce (3) distrecte self-contact (DSC) annotations indicating the pairwise contact of discrete regions on the human body. We annotate three existing image datasets with discrete self-contact and use these labels during mesh optimization to bring body parts supposed to touch into contact. Then we train TUCH, a human mesh regressor, on our new datasets. When evaluated on the task of human body pose and shape estimation on public benchmarks, our results show that knowing about self-contact not only improves mesh estimates for poses with self-contact, but also for poses without self-contact. Next, we study contact humans make with other individuals during close social interaction. Reconstructing these interactions in 3D is a significant challenge due to the mutual occlusion. Furthermore, the existing datasets of images taken in the wild with ground-truth contact labels are of insufficient size to facilitate the training of a robust human mesh regressor. In this work, we employ a generative model, BUDDI, to learn the joint distribution of 3D pose and shape of two individuals during their interaction and use this model as prior during an optimization routine. To construct training data we leverage pre-existing datasets, i.e. motion capture data and Flickr images with discrete contact annotations. Similar to discrete self-contact labels, we utilize discrete human- human contact to jointly fit two meshes to detected 2D joint locations. The majority of methods for generating 3D humans focus on the motion of a single person and operate on 3D joint locations. While these methods can effectively generate motion, their representation of 3D humans is not sufficient for physical contact since they do not model the body surface. Our approach, in contrast, acts on the pose and shape parameters of a human body model, which enables us to sample 3D meshes of two people. We further demonstrate how the knowledge of human proxemics, incorporated in our model, can be used to guide an optimization routine. For this, in each optimization iteration, BUDDI takes the current mesh and proposes a refinement that we subsequently consider in the objective function. This procedure enables us to go beyond state of the art by forgoing ground-truth discrete human-human contact labels during optimization. Self- and interpersonal contact happen on the surface of the human body, however, the majority of existing art tends to predict bodies with similar, “average” body shape. This is due to a lack of training data of paired images taken in the wild and ground- truth 3D body shape and because 2D joint locations are not sufficient to explain body shape. The most apparent solution would be to collect body scans of people together with their photos. This is, however, a time-consuming and cost-intensive process that lacks scalability. Instead, we leverage the vocabulary humans use to describe body shape. First, we ask annotators to label how much a word like “tall” or “long legs” applies to a human body. We gather these ratings for rendered meshes of various body shapes, for which we have ground-truth body model shape parameters, and for images collected from model agency websites. Using this data, we learn a shape-to-attribute (A2S) model that predicts body shape ratings from body shape parameters. Then we train a human mesh regressor, SHAPY, on the model agency images wherein we supervise body shape via attribute annotations using A2S. Since no suitable test set of diverse 3D ground-truth body shape with images taken in natural settings exists, we introduce Human Bodies in the Wild (HBW). This novel dataset contains photographs of individuals together with their body scan. Our model predicts more realistic body shapes from an image and quantitatively improves body shape estimation on this new benchmark. In summary, we present novel datasets, optimization methods, a generative model, and regressors to advance the field of 3D human pose and shape estimation. Taken together, these methods open up ways to obtain more accurate and realistic 3D mesh estimates from images with multiple people in self- and mutual contact poses and with diverse body shapes. This line of research also enables generative approaches to create more natural, human-like avatars. We believe that knowing about self- and human-human contact through computer vision has wide-ranging implications in other fields as for example robotics, fitness, or behavioral science.

ps

[BibTex]

[BibTex]


no image
Distributed Event-Based Learning via ADMM

Er, D., Trimpe, S., Muehlebach, M.

Max Planck Institute for Intelligent Systems, 2024 (techreport)

lds

link (url) [BibTex]

link (url) [BibTex]


Natural Language Control for 3D Human Motion Synthesis
Natural Language Control for 3D Human Motion Synthesis

Petrovich, M.

LIGM, Ecole des Ponts, Univ Gustave Eiffel, CNRS, 2024 (phdthesis)

Abstract
3D human motions are at the core of many applications in the film industry, healthcare, augmented reality, virtual reality and video games. However, these applications often rely on expensive and time-consuming motion capture data. The goal of this thesis is to explore generative models as an alternative route to obtain 3D human motions. More specifically, our aim is to allow a natural language interface as a means to control the generation process. To this end, we develop a series of models that synthesize realistic and diverse motions following the semantic inputs. In our first contribution, described in Chapter 3, we address the challenge of generating human motion sequences conditioned on specific action categories. We introduce ACTOR, a conditional variational autoencoder (VAE) that learns an action-aware latent representation for human motions. We show significant gains over existing methods thanks to our new Transformer-based VAE formulation, encoding and decoding SMPL pose sequences through a single motion-level embedding. In our second contribution, described in Chapter 4, we go beyond categorical actions, and dive into the task of synthesizing diverse 3D human motions from textual descriptions allowing a larger vocabulary and potentially more fine-grained control. Our work stands out from previous research by not deterministically generating a single motion sequence, but by synthesizing multiple, varied sequences from a given text. We propose TEMOS, building on our VAE-based ACTOR architecture, but this time integrating a pretrained text encoder to handle large-vocabulary natural language inputs. In our third contribution, described in Chapter 5, we address the adjacent task of text-to-3D human motion retrieval, where the goal is to search in a motion collection by querying via text. We introduce a simple yet effective approach, named TMR, building on our earlier model TEMOS, by integrating a contrastive loss to enhance the structure of the cross-modal latent space. Our findings emphasize the importance of retaining the motion generation loss in conjunction with contrastive training for improved results. We establish a new evaluation benchmark and conduct analyses on several protocols. In our fourth contribution, described in Chapter 6, we introduce a new problem termed as “multi-track timeline control” for text-driven 3D human motion synthesis. Instead of a single textual prompt, users can organize multiple prompts in temporal intervals that may overlap. We introduce STMC, a test-time denoising method that can be integrated with any pre-trained motion diffusion model. Our evaluations demonstrate that our method generates motions that closely match the semantic and temporal aspects of the input timelines. In summary, our contributions in this thesis are as follows: (i) we develop a generative variational autoencoder, ACTOR, for action-conditioned generation of human motion sequences, (ii) we introduce TEMOS, a text-conditioned generative model that synthesizes diverse human motions from textual descriptions, (iii) we present TMR, a new approach for text-to-3D human motion retrieval, (iv) we propose STMC, a method for timeline control in text-driven motion synthesis, enabling the generation of detailed and complex motions.

ps

Thesis [BibTex]

Thesis [BibTex]


no image
Incremental Few-Shot Adaptation for Non-Prehensile Object Manipulation using Parallelizable Physics Simulators

Baumeister, F., Mack, L., Stueckler, J.

CoRR abs/2409.13228, CoRR, 2024, Submitted to IEEE International Conference on Robotics and Automation (ICRA) 2025 (techreport) Submitted

Abstract
Few-shot adaptation is an important capability for intelligent robots that perform tasks in open-world settings such as everyday environments or flexible production. In this paper, we propose a novel approach for non-prehensile manipulation which iteratively adapts a physics-based dynamics model for model-predictive control. We adapt the parameters of the model incrementally with a few examples of robot-object interactions. This is achieved by sampling-based optimization of the parameters using a parallelizable rigid-body physics simulation as dynamic world model. In turn, the optimized dynamics model can be used for model-predictive control using efficient sampling-based optimization. We evaluate our few-shot adaptation approach in several object pushing experiments in simulation and with a real robot.

ev

preprint supplemental video link (url) [BibTex]

preprint supplemental video link (url) [BibTex]

2023


no image
Denoising Representation Learning for Causal Discovery

Sakenyte, U.

Université de Genèva, Switzerland, December 2023, external supervision (mastersthesis)

ei

[BibTex]

2023


[BibTex]


no image
Gesture-Based Nonverbal Interaction for Exercise Robots

Mohan, M.

University of Tübingen, Tübingen, Germany, October 2023, Department of Computer Science (phdthesis)

Abstract
When teaching or coaching, humans augment their words with carefully timed hand gestures, head and body movements, and facial expressions to provide feedback to their students. Robots, however, rarely utilize these nuanced cues. A minimally supervised social robot equipped with these abilities could support people in exercising, physical therapy, and learning new activities. This thesis examines how the intuitive power of human gestures can be harnessed to enhance human-robot interaction. To address this question, this research explores gesture-based interactions to expand the capabilities of a socially assistive robotic exercise coach, investigating the perspectives of both novice users and exercise-therapy experts. This thesis begins by concentrating on the user's engagement with the robot, analyzing the feasibility of minimally supervised gesture-based interactions. This exploration seeks to establish a framework in which robots can interact with users in a more intuitive and responsive manner. The investigation then shifts its focus toward the professionals who are integral to the success of these innovative technologies: the exercise-therapy experts. Roboticists face the challenge of translating the knowledge of these experts into robotic interactions. We address this challenge by developing a teleoperation algorithm that can enable exercise therapists to create customized gesture-based interactions for a robot. Thus, this thesis lays the groundwork for dynamic gesture-based interactions in minimally supervised environments, with implications for not only exercise-coach robots but also broader applications in human-robot interaction.

hi

Project Page [BibTex]

Project Page [BibTex]


no image
Efficient Sampling from Differentiable Matrix Elements

Kofler, A.

Technical University of Munich, Germany, September 2023 (mastersthesis)

ei

[BibTex]

[BibTex]


no image
Learning and Testing Powerful Hypotheses

Kübler, J. M.

University of Tübingen, Germany, July 2023 (phdthesis)

ei

[BibTex]

[BibTex]


no image
Learning Identifiable Representations: Independent Influences and Multiple Views

Gresele, L.

University of Tübingen, Germany, June 2023 (phdthesis)

ei

[BibTex]


no image
Learning with and for discrete optimization

Paulus, M.

ETH Zurich, Switzerland, May 2023, CLS PhD Program (phdthesis)

ei

[BibTex]

[BibTex]


no image
Intrinsic complexity and mechanisms of expressivity of cortical neurons

Spieler, A. M.

University of Tübingen, Germany, March 2023 (mastersthesis)

ei

[BibTex]

[BibTex]


no image
CausalEffect Estimation by Combining Observational and Interventional Data

Kladny, K.

ETH Zurich, Switzerland, February 2023 (mastersthesis)

lds ei

[BibTex]

[BibTex]


no image
Towards Generative Machine Teaching

Qui, Z.

Technical University of Munich, Germany, February 2023 (mastersthesis)

ei

[BibTex]

[BibTex]


no image
ArchiSound: Audio Generation with Diffusion

Schneider, F.

ETH Zurich, Switzerland, January 2023, external supervision (mastersthesis)

ei

[BibTex]

[BibTex]


no image
Generation and Quantification of Spin in Robot Table Tennis

Dittrich, A.

University of Stuttgart, Germany, January 2023 (mastersthesis)

ei

[BibTex]

[BibTex]


no image
Natural Language Processing for Policymaking

Jin, Z., Mihalcea, R.

In Handbook of Computational Social Science for Policy, pages: 141-162, 7, (Editors: Bertoni, E. and Fontana, M. and Gabrielli, L. and Signorelli, S. and Vespe, M.), Springer International Publishing, 2023 (inbook)

ei

DOI [BibTex]

DOI [BibTex]


no image
Object-Level Dynamic Scene Reconstruction With Physical Plausibility From RGB-D Images

Strecke, M. F.

Eberhard Karls Universität Tübingen, Tübingen, 2023 (phdthesis)

Abstract
Humans have the remarkable ability to perceive and interact with objects in the world around them. They can easily segment objects from visual data and have an intuitive understanding of how physics influences objects. By contrast, robots are so far often constrained to tailored environments for a specific task, due to their inability to reconstruct a versatile and accurate scene representation. In this thesis, we combine RGB-D video data with background knowledge of real-world physics to develop such a representation for robots.

Our contributions can be separated into two main parts: a dynamic object tracking tool and optimization frameworks that allow for improving shape reconstructions based on physical plausibility. The dynamic object tracking tool "EM-Fusion" detects, segments, reconstructs, and tracks objects from RGB-D video data. We propose a probabilistic data association approach for attributing the image pixels to the different moving objects in the scene. This allows us to track and reconstruct moving objects and the background scene with state-of-the art accuracy and robustness towards occlusions.

We investigate two ways of further optimizing the reconstructed shapes of moving objects based on physical plausibility. The first of these, "Co-Section", includes physical plausibility by reasoning about the empty space around an object. We observe that no two objects can occupy the same space at the same time and that the depth images in the input video provide an estimate of observed empty space. Based on these observations, we propose intersection and hull constraints, which we combine with the observed surfaces in a global optimization approach. Compared to EM-Fusion, which only reconstructs the observed surface, Co-Section optimizes watertight shapes. These watertight shapes provide a rough estimate of unseen surfaces and could be useful as initialization for further refinement, e.g., by interactive perception. In the second optimization approach, "DiffSDFSim", we reason about object shapes based on physically plausible object motion. We observe that object trajectories after collisions depend on the object's shape, and extend a differentiable physics simulation for optimizing object shapes together with other physical properties (e.g., forces, masses, friction) based on the motion of the objects and their interactions. Our key contributions are using signed distance function models for representing shapes and a novel method for computing gradients that models the dependency of the time of contact on object shapes. We demonstrate that our approach recovers target shapes well by fitting to target trajectories and depth observations. Further, the ground-truth trajectories are recovered well in simulation using the resulting shape and physical properties. This enables predictions about the future motion of objects by physical simulation.

We anticipate that our contributions can be useful building blocks in the development of 3D environment perception for robots. The reconstruction of individual objects as in EM-Fusion is a key ingredient required for interactions with objects. Completed shapes as the ones provided by Co-Section provide useful cues for planning interactions like grasping of objects. Finally, the recovery of shape and other physical parameters using differentiable simulation as in DiffSDFSim allows simulating objects and thus predicting the effects of interactions. Future work might extend the presented works for interactive perception of dynamic environments by comparing these predictions with observed real-world interactions to further improve the reconstructions and physical parameter estimations.

ev

link (url) DOI [BibTex]


Synchronizing Machine Learning Algorithms, Realtime Robotic Control and Simulated Environment with o80
Synchronizing Machine Learning Algorithms, Realtime Robotic Control and Simulated Environment with o80

Berenz, V., Widmaier, F., Guist, S., Schölkopf, B., Büchler, D.

Robot Software Architectures Workshop (RSA) 2023, ICRA, 2023 (techreport)

Abstract
Robotic applications require the integration of various modalities, encompassing perception, control of real robots and possibly the control of simulated environments. While the state-of-the-art robotic software solutions such as ROS 2 provide most of the required features, flexible synchronization between algorithms, data streams and control loops can be tedious. o80 is a versatile C++ framework for robotics which provides a shared memory model and a command framework for real-time critical systems. It enables expert users to set up complex robotic systems and generate Python bindings for scientists. o80's unique feature is its flexible synchronization between processes, including the traditional blocking commands and the novel ``bursting mode'', which allows user code to control the execution of the lower process control loop. This makes it particularly useful for setups that mix real and simulated environments.

ei

arxiv poster link (url) [BibTex]


no image
Wave front shaping with zone plates: Fabrication and characterization of lenses for soft x-ray applications from standard to singular optics

Baluktsian, M.

Universität Stuttgart, Stuttgart (und Verlag Dr. Hut, München), 2023 (phdthesis)

mms

link (url) [BibTex]


no image
Static and dynamic investigation of magnonic systems: materials, applications and modeling

Schulz, Frank Martin Ernst

Universität Stuttgart, Stuttgart, 2023 (phdthesis)

mms

link (url) DOI [BibTex]

link (url) DOI [BibTex]

2022


no image
Proceedings of the Second Workshop on NLP for Positive Impact (NLP4PI)

Biester, L., Demszky, D., Jin, Z., Sachan, M., Tetreault, J., Wilson, S., Xiao, L., Zhao, J.

Association for Computational Linguistics, December 2022 (proceedings)

ei

link (url) [BibTex]

2022


link (url) [BibTex]


no image
Multi-Timescale Representation Learning of Human and Robot Haptic Interactions

Richardson, B.

University of Stuttgart, Stuttgart, Germany, December 2022, Faculty of Computer Science, Electrical Engineering and Information Technology (phdthesis)

Abstract
The sense of touch is one of the most crucial components of the human sensory system. It allows us to safely and intelligently interact with the physical objects and environment around us. By simply touching or dexterously manipulating an object, we can quickly infer a multitude of its properties. For more than fifty years, researchers have studied how humans physically explore and form perceptual representations of objects. Some of these works proposed the paradigm through which human haptic exploration is presently understood: humans use a particular set of exploratory procedures to elicit specific semantic attributes from objects. Others have sought to understand how physically measured object properties correspond to human perception of semantic attributes. Few, however, have investigated how specific explorations are perceived. As robots become increasingly advanced and more ubiquitous in daily life, they are beginning to be equipped with haptic sensing capabilities and algorithms for processing and structuring haptic information. Traditional haptics research has so far strongly influenced the introduction of haptic sensation and perception into robots but has not proven sufficient to give robots the necessary tools to become intelligent autonomous agents. The work presented in this thesis seeks to understand how single and sequential haptic interactions are perceived by both humans and robots. In our first study, we depart from the more traditional methods of studying human haptic perception and investigate how the physical sensations felt during single explorations are perceived by individual people. We treat interactions as probability distributions over a haptic feature space and train a model to predict how similarly a pair of surfaces is rated, predicting perceived similarity with a reasonable degree of accuracy. Our novel method also allows us to evaluate how individual people weigh different surface properties when they make perceptual judgments. The method is highly versatile and presents many opportunities for further studies into how humans form perceptual representations of specific explorations. Our next body of work explores how to improve robotic haptic perception of single interactions. We use unsupervised feature-learning methods to derive powerful features from raw robot sensor data and classify robot explorations into numerous haptic semantic property labels that were assigned from human ratings. Additionally, we provide robots with more nuanced perception by learning to predict graded ratings of a subset of properties. Our methods outperform previous attempts that all used hand-crafted features, demonstrating the limitations of such traditional approaches. To push robot haptic perception beyond evaluation of single explorations, our final work introduces and evaluates a method to give robots the ability to accumulate information over many sequential actions; our approach essentially takes advantage of object permanence by conditionally and recursively updating the representation of an object as it is sequentially explored. We implement our method on a robotic gripper platform that performs multiple exploratory procedures on each of many objects. As the robot explores objects with new procedures, it gains confidence in its internal representations and classification of object properties, thus moving closer to the marvelous haptic capabilities of humans and providing a solid foundation for future research in this domain.

hi

link (url) Project Page [BibTex]

link (url) Project Page [BibTex]


Reconstructing Expressive {3D} Humans from {RGB} Images
Reconstructing Expressive 3D Humans from RGB Images

Choutas, V.

ETH Zurich, Max Planck Institute for Intelligent Systems and ETH Zurich, December 2022 (phdthesis)

Abstract
To interact with our environment, we need to adapt our body posture and grasp objects with our hands. During a conversation our facial expressions and hand gestures convey important non-verbal cues about our emotional state and intentions towards our fellow speakers. Thus, modeling and capturing 3D full-body shape and pose, hand articulation and facial expressions are necessary to create realistic human avatars for augmented and virtual reality. This is a complex task, due to the large number of degrees of freedom for articulation, body shape variance, occlusions from objects and self-occlusions from body parts, e.g. crossing our hands, and subject appearance. The community has thus far relied on expensive and cumbersome equipment, such as multi-view cameras or motion capture markers, to capture the 3D human body. While this approach is effective, it is limited to a small number of subjects and indoor scenarios. Using monocular RGB cameras would greatly simplify the avatar creation process, thanks to their lower cost and ease of use. These advantages come at a price though, since RGB capture methods need to deal with occlusions, perspective ambiguity and large variations in subject appearance, in addition to all the challenges posed by full-body capture. In an attempt to simplify the problem, researchers generally adopt a divide-and-conquer strategy, estimating the body, face and hands with distinct methods using part-specific datasets and benchmarks. However, the hands and face constrain the body and vice-versa, e.g. the position of the wrist depends on the elbow, shoulder, etc.; the divide-and-conquer approach can not utilize this constraint. In this thesis, we aim to reconstruct the full 3D human body, using only readily accessible monocular RGB images. In a first step, we introduce a parametric 3D body model, called SMPL-X, that can represent full-body shape and pose, hand articulation and facial expression. Next, we present an iterative optimization method, named SMPLify-X, that fits SMPL-X to 2D image keypoints. While SMPLify-X can produce plausible results if the 2D observations are sufficiently reliable, it is slow and susceptible to initialization. To overcome these limitations, we introduce ExPose, a neural network regressor, that predicts SMPL-X parameters from an image using body-driven attention, i.e. by zooming in on the hands and face, after predicting the body. From the zoomed-in part images, dedicated part networks predict the hand and face parameters. ExPose combines the independent body, hand, and face estimates by trusting them equally. This approach though does not fully exploit the correlation between parts and fails in the presence of challenges such as occlusion or motion blur. Thus, we need a better mechanism to aggregate information from the full body and part images. PIXIE uses neural networks called moderators that learn to fuse information from these two image sets before predicting the final part parameters. Overall, the addition of the hands and face leads to noticeably more natural and expressive reconstructions. Creating high fidelity avatars from RGB images requires accurate estimation of 3D body shape. Although existing methods are effective at predicting body pose, they struggle with body shape. We identify the lack of proper training data as the cause. To overcome this obstacle, we propose to collect internet images from fashion models websites, together with anthropometric measurements. At the same time, we ask human annotators to rate images and meshes according to a pre-defined set of linguistic attributes. We then define mappings between measurements, linguistic shape attributes and 3D body shape. Equipped with these mappings, we train a neural network regressor, SHAPY, that predicts accurate 3D body shapes from a single RGB image. We observe that existing 3D shape benchmarks lack subject variety and/or ground-truth shape. Thus, we introduce a new benchmark, Human Bodies in the Wild (HBW), which contains images of humans and their corresponding 3D ground-truth body shape. SHAPY shows how we can overcome the lack of in-the-wild images with 3D shape annotations through easy-to-obtain anthropometric measurements and linguistic shape attributes. Regressors that estimate 3D model parameters are robust and accurate, but often fail to tightly fit the observations. Optimization-based approaches tightly fit the data, by minimizing an energy function composed of a data term that penalizes deviations from the observations and priors that encode our knowledge of the problem. Finding the balance between these terms and implementing a performant version of the solver is a time-consuming and non-trivial task. Machine-learned continuous optimizers combine the benefits of both regression and optimization approaches. They learn the priors directly from data, avoiding the need for hand-crafted heuristics and loss term balancing, and benefit from optimized neural network frameworks for fast inference. Inspired from the classic Levenberg-Marquardt algorithm, we propose a neural optimizer that outperforms classic optimization, regression and hybrid optimization-regression approaches. Our proposed update rule uses a weighted combination of gradient descent and a network-predicted update. To show the versatility of the proposed method, we apply it on three other problems, namely full body estimation from (i) 2D keypoints, (ii) head and hand location from a head-mounted device and (iii) face tracking from dense 2D landmarks. Our method can easily be applied to new model fitting problems and offers a competitive alternative to well-tuned traditional model fitting pipelines, both in terms of accuracy and speed. To summarize, we propose a new and richer representation of the human body, SMPL-X, that is able to jointly model the 3D human body pose and shape, facial expressions and hand articulation. We propose methods, SMPLify-X, ExPose and PIXIE that estimate SMPL-X parameters from monocular RGB images, progressively improving the accuracy and realism of the predictions. To further improve reconstruction fidelity, we demonstrate how we can use easy-to-collect internet data and human annotations to overcome the lack of 3D shape data and train a model, SHAPY, that predicts accurate 3D body shape from a single RGB image. Finally, we propose a flexible learnable update rule for parametric human model fitting that outperforms both classic optimization and neural network approaches. This approach is easily applicable to a variety of problems, unlocking new applications in AR/VR scenarios.

ps

pdf [BibTex]

pdf [BibTex]


no image
Mechanical Design, Development and Testing of Bioinspired Legged Robots for Dynamic Locomotion

Sarvestani, L. A.

Eberhard Karls Universität Tübingen, Tübingen , November 2022 (phdthesis)

dlg

DOI [BibTex]

DOI [BibTex]


no image
Towards learning mechanistic models at the right level of abstraction

Neitz, A.

University of Tübingen, Germany, November 2022 (phdthesis)

ei

[BibTex]

[BibTex]


Magnetic Micro-/Nanopropellers  for Biomedicine
Magnetic Micro-/Nanopropellers for Biomedicine

Qiu, T., Jeong, M., Goyal, R., Kadiri, V., Sachs, J., Fischer, P.

In Field-Driven Micro and Nanorobots for Biology and Medicine, pages: 389-410, 16, (Editors: Sun, Y. and Wang, X. and Yu, J.), Springer, Cham, 2022 (inbook)

Abstract
In nature, many bacteria swim by rotating their helical flagella. A particularly promising class of artificial micro- and nano-robots mimic this propeller-like propulsion mechanism to move through fluids and tissues for applications in minimally-invasive medicine. Several fundamental challenges have to be overcome in order to build micro-machines that move similar to bacteria for in vivo applications. Here, we review recent advances of magnetically-powered micro-/nano-propellers. Four important aspects of the propellers – the geometrical shape, the fabrication method, the generation of magnetic fields for actuation, and the choice of biocompatible magnetic materials – are highlighted. First, the fundamental requirements are elucidated that arise due to hydrodynamics at low Reynolds (Re) number. We discuss the role that the propellers’ shape and symmetry play in realizing effective propulsion at low Re. Second, the additive nano-fabrication method Glancing Angle Deposition is discussed as a versatile technique to quickly grow large numbers of designer nano-helices. Third, systems to generate rotating magnetic fields via permanent magnets or electromagnetic coils are presented. And finally, the biocompatibility of the magnetic materials is discussed. Iron-platinum is highlighted due to its biocompatibility and its superior magnetic properties, which is promising for targeted delivery, minimally-invasive magnetic nano-devices and biomedical applications.

pf

link (url) DOI [BibTex]

link (url) DOI [BibTex]


no image
Life Improvement Science

Lieder, F., Prentice, M.

In Encyclopedia of Quality of Life and Well-Being Research, Springer, November 2022 (inbook)

re

DOI [BibTex]

DOI [BibTex]


no image
Learning Causal Representations for Generalization and Adaptation in Supervised, Imitation, and Reinforcement Learning

Lu, C.

University of Cambridge, UK, Cambridge, October 2022, (Cambridge-Tübingen-Fellowship) (phdthesis)

ei

[BibTex]

[BibTex]


no image
Investigating Independent Mechanisms in Neural Networks

Liang, W.

Université Paris-Saclay, France, October 2022 (mastersthesis)

ei

[BibTex]

[BibTex]


no image
Understanding the Influence of Moisture on Fingerpad-Surface Interactions

Nam, S.

University of Tübingen, Tübingen, Germany, October 2022, Department of Computer Science (phdthesis)

Abstract
People frequently touch objects with their fingers. The physical deformation of a finger pressing an object surface stimulates mechanoreceptors, resulting in a perceptual experience. Through interactions between perceptual sensations and motor control, humans naturally acquire the ability to manage friction under various contact conditions. Many researchers have advanced our understanding of human fingers to this point, but their complex structure and the variations in friction they experience due to continuously changing contact conditions necessitate additional study. Moisture is a primary factor that influences many aspects of the finger. In particular, sweat excreted from the numerous sweat pores on the fingerprints modifies the finger's material properties and the contact conditions between the finger and a surface. Measuring changes of the finger's moisture over time and in response to external stimuli presents a challenge for researchers, as commercial moisture sensors do not provide continuous measurements. This dissertation investigates the influence of moisture on fingerpad-surface interactions from diverse perspectives. First, we examine the extent to which moisture on the finger contributes to the sensation of stickiness during contact with glass. Second, we investigate the representative material properties of a finger at three distinct moisture levels, since the softness of human skin varies significantly with moisture. The third perspective is friction; we examine how the contact conditions, including the moisture of a finger, determine the available friction force opposing lateral sliding on glass. Fourth, we have invented and prototyped a transparent in vivo moisture sensor for the continuous measurement of finger hydration. In the first part of this dissertation, we explore how the perceptual intensity of light stickiness relates to the physical interaction between the skin and the surface. We conducted a psychophysical experiment in which nine participants actively pressed their index finger on a flat glass plate with a normal force close to 1.5 N and then detached it after a few seconds. A custom-designed apparatus recorded the contact force vector and the finger contact area during each interaction as well as pre- and post-trial finger moisture. After detaching their finger, participants judged the stickiness of the glass using a nine-point scale. We explored how sixteen physical variables derived from the recorded data correlate with each other and with the stickiness judgments of each participant. These analyses indicate that stickiness perception mainly depends on the pre-detachment pressing duration, the time taken for the finger to detach, and the impulse in the normal direction after the normal force changes sign; finger-surface adhesion seems to build with pressing time, causing a larger normal impulse during detachment and thus a more intense stickiness sensation. We additionally found a strong between-subjects correlation between maximum real contact area and peak pull-off force, as well as between finger moisture and impulse. When a fingerpad presses into a hard surface, the development of the contact area depends on the pressing force and speed. Importantly, it also varies with the finger's moisture, presumably because hydration changes the tissue's material properties. Therefore, for the second part of this dissertation, we collected data from one finger repeatedly pressing a glass plate under three moisture conditions, and we constructed a finite element model that we optimized to simulate the same three scenarios. We controlled the moisture of the subject's finger to be dry, natural, or moist and recorded 15 pressing trials in each condition. The measurements include normal force over time plus finger-contact images that are processed to yield gross contact area. We defined the axially symmetric 3D model's lumped parameters to include an SLS-Kelvin model (spring in series with parallel spring and damper) for the bulk tissue, plus an elastic epidermal layer. Particle swarm optimization was used to find the parameter values that cause the simulation to best match the trials recorded in each moisture condition. The results show that the softness of the bulk tissue reduces as the finger becomes more hydrated. The epidermis of the moist finger model is softest, while the natural finger model has the highest viscosity. In the third part of this dissertation, we focused on friction between the fingerpad and the surface. The magnitude of finger-surface friction available at the onset of full slip is crucial for understanding how the human hand can grip and manipulate objects. Related studies revealed the significance of moisture and contact time in enhancing friction. Recent research additionally indicated that surface temperature may also affect friction. However, previously reported friction coefficients have been measured only in dynamic contact conditions, where the finger is already sliding across the surface. In this study, we repeatedly measured the initial friction before full slip under eight contact conditions with low and high finger moisture, pressing time, and surface temperature. Moisture and pressing time both independently increased finger-surface friction across our population of twelve participants, and the effect of surface temperature depended on the contact conditions. Furthermore, detailed analysis of the recorded measurements indicates that micro stick-slip during the partial-slip phase contributes to enhanced friction. For the fourth and final part of this dissertation, we designed a transparent moisture sensor for continuous measurement of fingerpad hydration. Because various stimuli cause the sweat pores on fingerprints to excrete sweat, many researchers want to quantify the flow and assess its impact on the formation of the contact area. Unfortunately, the most popular sensor for skin hydration is opaque and does not offer continuous measurements. Our capacitive moisture sensor consists of a pair of inter-digital electrodes covered by an insulating layer, enabling impedance measurements across a wide frequency range. This proposed sensor is made entirely of transparent materials, which allows us to simultaneously measure the finger's contact area. Electrochemical impedance spectroscopy identifies the equivalent electrical circuit and the electrical component parameters that are affected by the amount of moisture present on the surface of the sensor. Most notably, the impedance at 1 kHz seems to best reflect the relative amount of sweat.

hi

DOI Project Page [BibTex]

DOI Project Page [BibTex]


no image
Learning Time-Continuous Dynamics Models with Gaussian-Process-Based Gradient Matching

Wenk, P.

ETH Zurich, Switzerland, October 2022, CLS PhD Program (phdthesis)

ei

[BibTex]

[BibTex]


no image
Multi-Target Multi-Object Manipulation using Relational Deep Reinforcement Learning

Feil, M.

Technnical University Munich, Germany, September 2022 (mastersthesis)

ei

[BibTex]

[BibTex]


no image
Independent Mechanism Analysis for High Dimensions

Sliwa, J.

University of Tübingen, Germany, September 2022, (Graduate Training Centre of Neuroscience) (mastersthesis)

ei

[BibTex]

[BibTex]