Publications

DEPARTMENTS

Emperical Interference

Haptic Intelligence

Modern Magnetic Systems

Perceiving Systems

Physical Intelligence

Robotic Materials

Social Foundations of Computation


Research Groups

Autonomous Vision

Autonomous Learning

Bioinspired Autonomous Miniature Robots

Dynamic Locomotion

Embodied Vision

Human Aspects of Machine Learning

Intelligent Control Systems

Learning and Dynamical Systems

Locomotion in Biorobotic and Somatic Systems

Micro, Nano, and Molecular Systems

Movement Generation and Control

Neural Capture and Synthesis

Physics for Inference and Optimization

Organizational Leadership and Diversity

Probabilistic Learning Group


Topics

Robot Learning

Conference Paper

2022

Autonomous Learning

Robotics

AI

Career

Award


Conference Paper SimpleEgo: Predicting Probabilistic Body Pose from Egocentric Cameras Velasquez, H. C., Hewitt, C., Aliakbarian, S., Baltrušaitis, T. In International Conference on 3D Vision (3DV 2024), 3DV, March 2024 (Accepted)
Our work addresses the problem of egocentric human pose estimation from downwards-facing cameras on head-mounted devices (HMD). This presents a challenging scenario, as parts of the body often fall outside of the image or are occluded. Previous solutions minimize this problem by using fish-eye camera lenses to capture a wider view, but these can present hardware design issues. They also predict 2D heat-maps per joint and lift them to 3D space to deal with self-occlusions, but this requires large network architectures which are impractical to deploy on resource-constrained HMDs. We predict pose from images captured with conventional rectilinear camera lenses. This resolves hardware design issues, but means body parts are often out of frame. As such, we directly regress probabilistic joint rotations represented as matrix Fisher distributions for a parameterized body model. This allows us to quantify pose uncertainties and explain out-of-frame or occluded joints. This also removes the need to compute 2D heat-maps and allows for simplified DNN architectures which require less compute. Given the lack of egocentric datasets using rectilinear camera lenses, we introduce the SynthEgo dataset, a synthetic dataset with 60K stereo images containing high diversity of pose, shape, clothing and skin tone. Our approach achieves state-of-the-art results for this challenging configuration, reducing mean per-joint position error by 23% overall and 58% for the lower body. Our architecture also has eight times fewer parameters and runs twice as fast as the current state-of-the-art. Experiments show that training on our synthetic dataset leads to good generalization to real world images without fine-tuning.
Home Dataset BibTeX

Perceiving Systems Conference Paper ArtiGrasp: Physically Plausible Synthesis of Bi-Manual Dexterous Grasping and Articulation Zhang, H., Christen, S., Fan, Z., Zheng, L., Hwangbo, J., Song, J., Hilliges, O. In International Conference on 3D Vision (3DV 2024), 3DV, March 2024 (Published)
We present ArtiGrasp, a novel method to synthesize bi-manual hand-object interactions that include grasping and articulation. This task is challenging due to the diversity of the global wrist motions and the precise finger control that are necessary to articulate objects. ArtiGrasp leverages reinforcement learning and physics simulations to train a policy that controls the global and local hand pose. Our framework unifies grasping and articulation within a single policy guided by a single hand pose reference. Moreover, to facilitate the training of the precise finger control required for articulation, we present a learning curriculum with increasing difficulty. It starts with single-hand manipulation of stationary objects and continues with multi-agent training including both hands and non-stationary objects. To evaluate our method, we introduce Dynamic Object Grasping and Articulation, a task that involves bringing an object into a target articulated pose. This task requires grasping, relocation, and articulation. We show our method's efficacy towards this task. We further demonstrate that our method can generate motions with noisy hand-object pose estimates from an off-the-shelf image-based regressor.
pdf project code BibTeX

Perceiving Systems Conference Paper GRIP: Generating Interaction Poses Using Spatial Cues and Latent Consistency Taheri, O., Zhou, Y., Tzionas, D., Zhou, Y., Ceylan, D., Pirk, S., Black, M. J. In International Conference on 3D Vision (3DV 2024), 3DV, March 2024 (Published)
Hands are dexterous and highly versatile manipulators that are central to how humans interact with objects and their environment. Consequently, modeling realistic hand-object interactions, including the subtle motion of individual fingers, is critical for applications in computer graphics, computer vision, and mixed reality. Prior work on capturing and modeling humans interacting with objects in 3D focuses on the body and object motion, often ignoring hand pose. In contrast, we introduce GRIP, a learning-based method that takes, as input, the 3D motion of the body and the object, and synthesizes realistic motion for both hands before, during, and after object interaction. As a preliminary step before synthesizing the hand motion, we first use a network, ANet, to denoise the arm motion. Then, we leverage the spatio-temporal relationship between the body and the object to extract novel temporal interaction cues, and use them in a two-stage inference pipeline to generate the hand motion. In the first stage, we introduce a new approach to encourage motion temporal consistency in the latent space (LTC) and generate consistent interaction motions. In the second stage, GRIP generates refined hand poses to avoid hand-object penetrations. Given sequences of noisy body and object motion, GRIP “upgrades” them to include hand-object interaction. Quantitative experiments and perceptual studies demonstrate that GRIP outperforms baseline methods and generalizes to unseen objects and motions from different motion-capture datasets. Our models and code are available for research purposes.
Paper SupMat Poster URL BibTeX

Perceiving Systems Conference Paper PACE: Human and Camera Motion Estimation from in-the-wild Videos Kocabas, M., Yuan, Y., Molchanov, P., Guo, Y., Black, M. J., Hilliges, O., Kautz, J., Iqbal, U. In International Conference on 3D Vision (3DV 2024), 3DV, March 2024 (Published)
We present a method to estimate human motion in a global scene from moving cameras. This is a highly challenging task due to the coupling of human and camera motions in the video. To address this problem, we propose a joint optimization framework that disentangles human and camera motions using both foreground human motion priors and background scene features. Unlike existing methods that use SLAM as initialization, we propose to tightly integrate SLAM and human motion priors in an optimization that is inspired by bundle adjustment. Specifically, we optimize human and camera motions to match both the observed human pose and scene features. This design combines the strengths of SLAM and motion priors, which leads to significant improvements in human and camera motion estimation. We additionally introduce a motion prior that is suitable for batch optimization, making our approach significantly more efficient than existing approaches. Finally, we propose a novel synthetic dataset that enables evaluating camera motion in addition to human motion from dynamic videos. Experiments on the synthetic and real-world RICH datasets demonstrate that our approach substantially outperforms prior art in recovering both human and camera motions.
arXiv Project Page YouTube Poster BibTeX

Perceiving Systems Conference Paper POCO: 3D Pose and Shape Estimation using Confidence Dwivedi, S. K., Schmid, C., Yi, H., Black, M. J., Tzionas, D. In International Conference on 3D Vision (3DV 2024), 3DV, March 2024 (Published)
The regression of 3D Human Pose and Shape HPS from an image is becoming increasingly accurate. This makes the results useful for downstream tasks like human action recognition or 3D graphics. Yet, no regressor is perfect, and accuracy can be affected by ambiguous image evidence or by poses and appearance that are unseen during training. Most current HPS regressors, however, do not report the confidence of their outputs, meaning that downstream tasks cannot differentiate accurate estimates from inaccurate ones. To address this, we develop POCO, a novel framework for training HPS regressors to estimate not only a 3D human body, but also their confidence, in a single feed-forward pass. Specifically, POCO estimates both the 3D body pose and a per-sample variance. The key idea is to introduce a Dual Conditioning Strategy (DCS) for regressing uncertainty that is highly correlated to pose reconstruction quality. The POCO framework can be applied to any HPS regressor and here we evaluate it by modifying HMR, PARE, and CLIFF. In all cases, training the network to reason about uncertainty helps it learn to more accurately estimate 3D pose. While this was not our goal, the improvement is modest but consistent. Our main motivation is to provide uncertainty estimates for downstream tasks; we demonstrate this in two ways: (1) We use the confidence estimates to bootstrap HPS training. Given unlabelled image data, we take the confident estimates of a POCO-trained regressor as pseudo ground truth. Retraining with this automatically-curated data improves accuracy. (2) We exploit uncertainty in video pose estimation by automatically identifying uncertain frames (e.g. due to occlusion) and inpainting these from confident frames.
Code Paper SupMat Poster URL BibTeX

Perceiving Systems Conference Paper TADA! Text to Animatable Digital Avatars Liao, T., Yi, H., Xiu, Y., Tang, J., Huang, Y., Thies, J., Black, M. J. In International Conference on 3D Vision (3DV 2024), 3DV, March 2024 (Published)
We introduce TADA, a simple-yet-effective approach that takes textual descriptions and produces expressive 3D avatars with high-quality geometry and lifelike textures, that can be animated and rendered with traditional graphics pipelines. Existing text-based character generation methods are limited in terms of geometry and texture quality, and cannot be realistically animated due to inconsistent align-007 ment between the geometry and the texture, particularly in the face region. To overcome these limitations, TADA leverages the synergy of a 2D diffusion model and an animatable parametric body model. Specifically, we derive an optimizable high-resolution body model from SMPL-X with 3D displacements and a texture map, and use hierarchical rendering with score distillation sampling (SDS) to create high-quality, detailed, holistic 3D avatars from text. To ensure alignment between the geometry and texture, we render normals and RGB images of the generated character and exploit their latent embeddings in the SDS training process. We further introduce various expression parameters to deform the generated character during training, ensuring that the semantics of our generated character remain consistent with the original SMPL-X model, resulting in an animatable character. Comprehensive evaluations demonstrate that TADA significantly surpasses existing approaches on both qualitative and quantitative measures. TADA enables creation of large-scale digital character assets that are ready for animation and rendering, while also being easily editable through natural language. The code will be public for research purposes.
Home Code Video URL BibTeX

Perceiving Systems Neural Capture and Synthesis Conference Paper TECA: Text-Guided Generation and Editing of Compositional 3D Avatars Zhang, H., Feng, Y., Kulits, P., Wen, Y., Thies, J., Black, M. J. In International Conference on 3D Vision (3DV 2024), 3DV, March 2024 (Published)
Our goal is to create a realistic 3D facial avatar with hair and accessories using only a text description. While this challenge has attracted significant recent interest, existing methods either lack realism, produce unrealistic shapes, or do not support editing, such as modifications to the hairstyle. We argue that existing methods are limited because they employ a monolithic modeling approach, using a single representation for the head, face, hair, and accessories. Our observation is that the hair and face, for example, have very different structural qualities that benefit from different representations. Building on this insight, we generate avatars with a compositional model, in which the head, face, and upper body are represented with traditional 3D meshes, and the hair, clothing, and accessories with neural radiance fields (NeRF). The model-based mesh representation provides a strong geometric prior for the face region, improving realism while enabling editing of the person's appearance. By using NeRFs to represent the remaining components, our method is able to model and synthesize parts with complex geometry and appearance, such as curly hair and fluffy scarves. Our novel system synthesizes these high-quality compositional avatars from text descriptions. The experimental results demonstrate that our method, Text-guided generation and Editing of Compositional Avatars (TECA), produces avatars that are more realistic than those of recent methods while being editable because of their compositional nature. For example, our TECA enables the seamless transfer of compositional features like hairstyles, scarves, and other accessories between avatars. This capability supports applications such as virtual try-on.
arXiv project URL BibTeX

Perceiving Systems Conference Paper TeCH: Text-guided Reconstruction of Lifelike Clothed Humans Huang, Y., Yi, H., Xiu, Y., Liao, T., Tang, J., Cai, D., Thies, J. In International Conference on 3D Vision (3DV 2024), 3DV, March 2024 (Published)
Despite recent research advancements in reconstructing clothed humans from a single image, accurately restoring the "unseen regions" with high-level details remains an unsolved challenge that lacks attention. Existing methods often generate overly smooth back-side surfaces with a blurry texture. But how to effectively capture all visual attributes of an individual from a single image, which are sufficient to reconstruct unseen areas (e.g., the back view)? Motivated by the power of foundation models, TeCH reconstructs the 3D human by leveraging 1) descriptive text prompts (e.g., garments, colors, hairstyles) which are automatically generated via a garment parsing model and Visual Question Answering (VQA), 2) a personalized fine-tuned Text-to-Image diffusion model (T2I) which learns the "indescribable" appearance. To represent high-resolution 3D clothed humans at an affordable cost, we propose a hybrid 3D representation based on DMTet, which consists of an explicit body shape grid and an implicit distance field. Guided by the descriptive prompts + personalized T2I diffusion model, the geometry and texture of the 3D humans are optimized through multi-view Score Distillation Sampling (SDS) and reconstruction losses based on the original observation. TeCH produces high-fidelity 3D clothed humans with consistent & delicate texture, and detailed full-body geometry. Quantitative and qualitative experiments demonstrate that TeCH outperforms the state-of-the-art methods in terms of reconstruction accuracy and rendering quality.
Code Home Video arXiv URL BibTeX

Autonomous Learning Conference Paper Generating Realistic Arm Movements in Reinforcement Learning: A Quantitative Comparison of Reward Terms and Task Requirements Charaja, J. P., Wochner, I., Schumacher, P., Ilg, W., Giese, M., Maufroy, C., Bulling, A., Schmitt, S., Martius, G., Haeufle, D. F. Proceedings 2024 10th IEEE RAS/EMBS International Conference for Biomedical Robotics and Biomechatronics (BioRob), 562-568, IEEE, 2024 10TH IEEE RAS/EMBS INTERNATIONAL CONFERENCE FOR BIOMEDICAL ROBOTICS\nAND BIOMECHATRONICS, BIOROB, February 2024 (Published)
The mimicking of human-like arm movement characteristics involves the consideration of three factors during control policy synthesis: (a) chosen task requirements, (b) inclusion of noise during movement execution and (c) chosen optimality principles. Previous studies showed that when considering these factors (a-c) individually, it is possible to synthesize arm movements that either kinematically match the experimental data or reproduce the stereotypical triphasic muscle activation pattern. However, to date no quantitative comparison has been made on how realistic the arm movement generated by each factor is; as well as whether a partial or total combination of all factors results in arm movements with human-like kinematic characteristics and a triphasic muscle pattern. To investigate this, we used reinforcement learning to learn a control policy for a musculoskeletal arm model, aiming to discern which combination of factors (a-c) results in realistic arm movements according to four frequently reported stereotypical characteristics. Our findings indicate that incorporating velocity and acceleration requirements into the reaching task, employing reward terms that encourage minimization of mechanical work, hand jerk, and control effort, along with the inclusion of noise during movement, leads to the emergence of realistic human arm movements in reinforcement learning. We expect that the gained insights will help in the future to better predict desired arm movements and corrective forces in wearable assistive devices.
DOI URL BibTeX

Deep Models and Optimization Conference Paper SDEs for Minimax Optimization Monzio Compagnoni, E., Orvieto, A., Kersting, H., Proske, F., Lucchi, A. PMLR, AISTATS, February 2024 (Published) URL BibTeX

Rationality Enhancement Software Workshop Article Optimal feedback improves behavioral focus during self-regulated computer-based work. Wirzberger, M., Lado, A., Prentice, M., Oreshnikov, I., Passy, J., Stock, A., Lieder, F. Scientific Reports, 14:3134-, February 2024 (Published)
Distractions are omnipresent and can derail our attention, which is a precious and very limited resource. To achieve their goals in the face of distractions, people need to regulate their attention, thoughts, and behavior; this is known as self-regulation. How can self-regulation be supported or strengthened in ways that are relevant for everyday work and learning activities? To address this question, we introduce and evaluate a desktop application that helps people stay focused on their work and train self-regulation at the same time. Our application lets the user set a goal for what they want to do during a defined period of focused work at their computer, then gives negative feedback when they get distracted, and positive feedback when they reorient their attention towards their goal. After this so-called focus session, the user receives overall feedback on how well they focused on their goal relative to previous sessions. While existing approaches to attention training often use artificial tasks, our approach transforms real-life challenges into opportunities for building strong attention control skills. Our results indicate that optimal attentional feedback can generate large increases in behavioral focus, task motivation, and self-control – benefitting users to successfully achieve their long-term goals.
DOI URL BibTeX

Haptic Intelligence Ph.D. Thesis Creating a Haptic Empathetic Robot Animal That Feels Touch and Emotion Burns, R. B. University of Tübingen, Tübingen, Germany, February 2024, Department of Computer Science (Published)
Social touch, such as a hug or a poke on the shoulder, is an essential aspect of everyday interaction. Humans use social touch to gain attention, communicate needs, express emotions, and build social bonds. Despite its importance, touch sensing is very limited in most commercially available robots. By endowing robots with social-touch perception, one can unlock a myriad of new interaction possibilities. In this thesis, I present my work on creating a Haptic Empathetic Robot Animal (HERA), a koala-like robot for children with autism. I demonstrate the importance of establishing design guidelines based on one's target audience, which we investigated through interviews with autism specialists. I share our work on creating full-body tactile sensing for the NAO robot using low-cost, do-it-yourself (DIY) methods, and I introduce an approach to model long-term robot emotions using second-order dynamics.
BibTeX

Empirical Inference Ph.D. Thesis Identifiable Causal Representation Learning: Unsupervised, Multi-View, and Multi-Environment von Kügelgen, J. University of Cambridge, UK, Cambridge, February 2024, (Cambridge-Tübingen-Fellowship) (Published) URL BibTeX

Empirical Inference Conference Paper Inferring Atmospheric Properties of Exoplanets with Flow Matching and Neural Importance Sampling Gebhard, T. D., Wildberger, J., Dax, M., Angerhausen, D., Quanz, S. P., Schölkopf, B. 3rd Annual AAAI Workshop on AI to Accelerate Science and Engineering (AI2ASE), February 2024 (Published) arXiv URL BibTeX

Haptic Intelligence Robotics Article IMU-Based Kinematics Estimation Accuracy Affects Gait Retraining Using Vibrotactile Cues Rokhmanova, N., Pearl, O., Kuchenbecker, K. J., Halilaj, E. IEEE Transactions on Neural Systems and Rehabilitation Engineering, 32:1005-1012, February 2024 (Published)
Wearable sensing using inertial measurement units (IMUs) is enabling portable and customized gait retraining for knee osteoarthritis. However, the vibrotactile feedback that users receive directly depends on the accuracy of IMU-based kinematics. This study investigated how kinematic errors impact an individual's ability to learn a therapeutic gait using vibrotactile cues. Sensor accuracy was computed by comparing the IMU-based foot progression angle to marker-based motion capture, which was used as ground truth. Thirty subjects were randomized into three groups to learn a toe-in gait: one group received vibrotactile feedback during gait retraining in the laboratory, another received feedback outdoors, and the control group received only verbal instruction and proceeded directly to the evaluation condition. All subjects were evaluated on their ability to maintain the learned gait in a new outdoor environment. We found that subjects with high tracking errors exhibited more incorrect responses to vibrotactile cues and slower learning rates than subjects with low tracking errors. Subjects with low tracking errors outperformed the control group in the evaluation condition, whereas those with higher error did not. Errors were correlated with foot size and angle magnitude, which may indicate a non-random algorithmic bias. The accuracy of IMU-based kinematics has a cascading effect on feedback; ignoring this effect could lead researchers or clinicians to erroneously classify a patient as a non-responder if they did not improve after retraining. To use patient and clinician time effectively, future implementation of portable gait retraining will require assessment across a diverse range of patients.
DOI BibTeX

Perceiving Systems Conference Paper Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion Chhatre, K., Daněček, R., Athanasiou, N., Becherini, G., Peters, C., Black, M. J., Bolkart, T. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 1942-1953, Piscataway, NJ, IEEE/CVF Conference on Computer Vision and Pattern Recognition, January 2024 (Published)
Existing methods for synthesizing 3D human gestures from speech have shown promising results, but they do not explicitly model the impact of emotions on the generated gestures. Instead, these methods directly output animations from speech without control over the expressed emotion. To address this limitation, we present AMUSE, an emotional speech-driven body animation model based on latent diffusion. Our observation is that content (i.e., gestures related to speech rhythm and word utterances), emotion, and personal style are separable. To account for this, AMUSE maps the driving audio to three disentangled latent vectors: one for content, one for emotion, and one for personal style. A latent diffusion model, trained to generate gesture motion sequences, is then conditioned on these latent vectors. Once trained, AMUSE synthesizes 3D human gestures directly from speech with control over the expressed emotions and style by combining the content from the driving speech with the emotion and style of another speech sequence. Randomly sampling the noise of the diffusion model further generates variations of the gesture with the same emotional expressivity. Qualitative, quantitative, and perceptual evaluations demonstrate that AMUSE outputs realistic gesture sequences. Compared to the state of the art, the generated gestures are better synchronized with the speech content and better represent the emotion expressed by the input speech.
Project Paper Code DOI URL BibTeX

Physical Intelligence Article Hierarchical Nanostructures as Acoustically Manipulatable Multifunctional Agents in Dynamic Fluid Flow Kim, D. W., Wrede, P., Estrada, H., Yildiz, E., Lazovic, J., Bhargava, A., Razansky, D., Sitti, M. Advanced Materials, 36(50), January 2024 (Published)
Acoustic waves provide a biocompatible and deep-tissue-penetrating tool suitable for contactless manipulation in in vivo environments. Despite the prevalence of dynamic fluids within the body, previous studies have primarily focused on static fluids, and manipulatable agents in dynamic fluids are limited to gaseous core-shell particles. However, these gas-filled particles face challenges in fast-flow manipulation, complex setups, design versatility, and practical medical imaging, underscoring the need for effective alternatives. In this study, flower-like hierarchical nanostructures (HNS) into microparticles (MPs) are incorporated, and demonstrated that various materials fabricated as HNS-MPs exhibit effective and reproducible acoustic trapping within high-velocity fluid flows. Through simulations, it is validated that the HNS-MPs are drawn to the focal point by acoustic streaming and form a trap through secondary acoustic streaming at the tips of the nanosheets comprising the HNS-MPs. Furthermore, the wide range of materials and modification options for HNS, combined with their high surface area and biocompatibility, enable them to serve as acoustically manipulatable multimodal imaging contrast agents and microrobots. They can perform intravascular multi-trap maneuvering with real-time imaging, purification of wastewater flow, and highly-loaded drug delivery. Given the diverse HNS materials developed to date, this study extends their applications to acoustofluidic and biomedical fields.
DOI URL BibTeX

Deep Models and Optimization Conference Paper Recurrent Distance Filtering for Graph Representation Learning Ding, Y., Orvieto, A., He, B., Hofmann, T. In PMLR, ICML, January 2024 (Published) URL BibTeX

Deep Models and Optimization Conference Paper Super Consistency of Neural Network Landscapes and Learning Rate Transfer Noci, L., Meterez, A., Hofmann, T., Orvieto, A. In Proceedings of the Thirty-Eighth Annual Conference on Neural Information Processing Systems, Thirty-Eighth Annual Conference on Neural Information Processing Systems, January 2024 (Published) URL BibTeX

Haptic Intelligence Miscellaneous Adapting a High-Fidelity Simulation of Human Skin for Comparative Touch Sensing in the Elephant Trunk Schulz, A., Serhat, G., Kuchenbecker, K. J. 64(Supplement_1):S458-S459, Abstract presented at the Society for Integrative and Comparative Biology Annual Meeting (SICB), Seattle, USA, January 2024 (Published)
Skin is a complex biological composite consisting of layers with distinct mechanical properties, morphologies, and mechanosensory capabilities. This work seeks to expand the comparative biomechanics field to comparative haptics, analyzing elephant trunk touch by redesigning a previously published human finger-pad model with morphological parameters measured from an elephant trunk. The dorsal surface of the elephant trunk has a thick, wrinkled epidermis covered with whiskers at the distal tip and deep folds at the proximal base. We hypothesize that this thick dorsal skin protects the trunk from mechanical damage but significantly dulls its tactile sensing ability. To facilitate safe and dexterous motion, the distributed dorsal whiskers might serve as pre-touch antennae, transmitting an amplified version of impending contact to the mechanoreceptors beneath the elephant's armor. We tested these hypotheses by simulating soft tissue deformation through high-fidelity finite element analyses involving representative skin layers and whiskers, modeled based on frozen African elephant trunk (Loxodonta africana) morphology. For a typical contact force, quintupling the stratum corneum thickness to match dorsal trunk skin reduces the von Mises stress communicated to the dermis by 18%. However, adding a whisker offsets this dulled sensing, as hypothesized, amplifying the stress by more than 15 at the same location. We hope this work will motivate further investigations of mammalian touch using approaches and models from the ample literature on human touch.
DOI BibTeX

Perceiving Systems Conference Paper Adversarial Likelihood Estimation With One-Way Flows Ben-Dov, O., Gupta, P. S., Abrevaya, V., Black, M. J., Ghosh, P. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 3779-3788, January 2024 (Published)
Generative Adversarial Networks (GANs) can produce high-quality samples, but do not provide an estimate of the probability density around the samples. However, it has been noted that maximizing the log-likelihood within an energy-based setting can lead to an adversarial framework where the discriminator provides unnormalized density (often called energy). We further develop this perspective, incorporate importance sampling, and show that 1) Wasserstein GAN performs a biased estimate of the partition function, and we propose instead to use an unbiased estimator; and 2) when optimizing for likelihood, one must maximize generator entropy. This is hypothesized to provide a better mode coverage. Different from previous works, we explicitly compute the density of the generated samples. This is the key enabler to designing an unbiased estimator of the partition function and computation of the generator entropy term. The generator density is obtained via a new type of flow network, called one-way flow network, that is less constrained in terms of architecture, as it does not require a tractable inverse function. Our experimental results show that our method converges faster, produces comparable sample quality to GANs with similar architecture, successfully avoids over-fitting to commonly used datasets and produces smooth low-dimensional latent representations of the training data.
pdf arXiv BibTeX

Social Foundations of Computation Algorithms and Society Conference Paper Causal Inference out of Control: Estimating Performativity without Treatment Randomization Cheng, G., Hardt, M., Mendler-Dünner, C. In Proceedings of the 41st International Conference on Machine Learning (ICML 2024), PMLR, The Forty-First International Conference on Machine Learning (ICML), January 2024 (Published)
Regulators and academics are increasingly interested in the causal effect that algorithmic actions of a digital platform have on user consumption. In pursuit of estimating this effect from observational data, we identify a set of assumptions that permit causal identifiability without assuming randomized platform actions. Our results are applicable to platforms that rely on machine-learning-powered predictions and leverage knowledge from historical data. The key novelty of our approach is to explicitly model the dynamics of consumption over time, exploiting the repeated interaction of digital platforms with their participants to prove our identifiability results. By viewing the platform as a controller acting on a dynamical system, we can show that exogenous variation in consumption and appropriately responsive algorithmic control actions are sufficient for identifying the causal effect of interest. We complement our claims with an analysis of ready-to-use finite sample estimators and empirical investigations. More broadly, our results deriving identifiability conditions tailored to digital platform settings illustrate a fruitful interplay of control theory and causal inference
ArXiv URL BibTeX

Robust Machine Learning Conference Paper Effective pruning of web-scale datasets based on complexity of concept clusters Abbas, A., Rusak, E., Tirumala, K., Brendel, W., Chaudhuri, K., Morcos, A. S. In January 2024 (Published) ArXiv BibTeX

Haptic Intelligence Article How Should Robots Exercise with People? Robot-Mediated Exergames Win with Music, Social Analogues, and Gameplay Clarity Fitter, N. T., Mohan, M., Preston, R. C., Johnson, M. J., Kuchenbecker, K. J. Frontiers in Robotics and AI, 10(1155837):1-18, January 2024 (Published)
The modern worldwide trend toward sedentary behavior comes with significant health risks. An accompanying wave of health technologies has tried to encourage physical activity, but these approaches often yield limited use and retention. Due to their unique ability to serve as both a health-promoting technology and a social peer, we propose robots as a game-changing solution for encouraging physical activity. This article analyzes the eight exergames we previously created for the Rethink Baxter Research Robot in terms of four key components that are grounded in the video-game literature: repetition, pattern matching, music, and social design. We use these four game facets to assess gameplay data from 40 adult users who each experienced the games in balanced random order. In agreement with prior research, our results show that relevant musical cultural references, recognizable social analogues, and gameplay clarity are good strategies for taking an otherwise highly repetitive physical activity and making it engaging and popular among users. Others who study socially assistive robots and rehabilitation robotics can benefit from this work by considering the presented design attributes to generate future hypotheses and by using our eight open-source games to pursue follow-up work on social-physical exercise with robots.
DOI BibTeX

Perceiving Systems Conference Paper Human Hair Reconstruction with Strand-Aligned 3D Gaussians Zakharov, E., Sklyarova, V., Black, M. J., Nam, G., Thies, J., Hilliges, O. In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, January 2024 (Published)
We introduce a new hair modeling method that uses a dual representation of classical hair strands and 3D Gaussians to produce accurate and realistic strand-based reconstructions from multi-view data. In contrast to recent approaches that leverage unstructured Gaussians to model human avatars, our method reconstructs the hair using 3D polylines, or strands. This fundamental difference allows the use of the resulting hairstyles out-of-the-box in modern computer graphics engines for editing, rendering, and simulation. Our 3D lifting method relies on unstructured Gaussians to generate multi-view ground truth data to supervise the fitting of hair strands. The hairstyle itself is represented in the form of the so-called strand-aligned 3D Gaussians. This representation allows us to combine strand-based hair priors, which are essential for realistic modeling of the inner structure of hairstyles, with the differentiable rendering capabilities of 3D Gaussian Splatting. Our method, named Gaussian Haircut, is evaluated on synthetic and real scenes and demonstrates state-of-the-art performance in the task of strand-based hair reconstruction.
pdf project code video arXiv DOI URL BibTeX

Empirical Inference Conference Paper Multi-channel free space optical convolutions with incoherent light Song, A., Kottapalli, S. N. M., Schölkopf, B., Fischer, P. AI and Optical Data Sciences V, PC12903:PC129030I, (Editors: Ken-ichi Kitayama and Volker J. Sorger), SPIE, January 2024 (Published) DOI BibTeX

Empirical Inference Article Network propagation for GWAS analysis: a practical guide to leveraging molecular networks for disease gene discovery Visonà, G., Bouzigon, E., Demenais, F., Schweikert, G. Briefings in Bioinformatics, 25(2), January 2024 (Published) DOI BibTeX

Empirical Inference Optics and Sensing Laboratory Conference Paper Polarization-based non-linear deep diffractive neural networks Kottapalli, S. N. M., Schlieder, L., Song, A., Volchkov, V., Schölkopf, B., Fischer, P. AI and Optical Data Sciences V, PC12903:PC129030B, (Editors: Ken-ichi Kitayama and Volker J. Sorger), SPIE, January 2024 (Published) DOI BibTeX

Haptic Intelligence Article Robust Surface Recognition with the Maximum Mean Discrepancy: Degrading Haptic-Auditory Signals Through Bandwidth and Noise Khojasteh, B., Shao, Y., Kuchenbecker, K. J. IEEE Transactions on Haptics, 17(1):58-65, January 2024, Presented at the IEEE Haptics Symposium (Published)
Sliding a tool across a surface generates rich sensations that can be analyzed to recognize what is being touched. However, the optimal configuration for capturing these signals is yet unclear. To bridge this gap, we consider haptic-auditory data as a human explores surfaces with different steel tools, including accelerations of the tool and finger, force and torque applied to the surface, and contact sounds. Our classification pipeline uses the maximum mean discrepancy (MMD) to quantify differences in data distributions in a high-dimensional space for inference. With recordings from three hemispherical tool diameters and ten diverse surfaces, we conducted two degradation studies by decreasing sensing bandwidth and increasing added noise. We evaluate the haptic-auditory recognition performance achieved with the MMD to compare newly gathered data to each surface in our known library. The results indicate that acceleration signals alone have great potential for high-accuracy surface recognition and are robust against noise contamination. The optimal accelerometer bandwidth exceeds 1000 Hz, suggesting that useful vibrotactile information extends beyond human perception range. Finally, smaller tool tips generate contact vibrations with better noise robustness. The provided sensing guidelines may enable superhuman performance in portable surface recognition, which could benefit quality control, material documentation, and robotics.
DOI BibTeX

Empirical Inference Article Towards fully covariant machine learning Villar, S., Hogg, D. W., Yao, W., Kevrekidis, G. A., Schölkopf, B. Transactions on Machine Learning Research, January 2024 (Published) URL BibTeX

Haptic Intelligence Materials Miscellaneous Whiskers That Don’t Whisk: Unique Structure From the Absence of Actuation in Elephant Whiskers Schulz, A., Kaufmann, L., Brecht, M., Richter, G., Kuchenbecker, K. J. 64(Supplement_1):S459, Abstract presented at the Society for Integrative and Comparative Biology Annual Meeting (SICB), Seattle, USA, January 2024 (Published)
Whiskers are so named because these hairs often actuate circularly, whisking, via collagen wrapping at the root of the hair follicle to increase their sensing volumes. Elephant trunks are a unique case study for whiskers, as the dorsal and lateral sections of the elephant proboscis have scattered sensory hairs that lack individual actuation. We hypothesize that the actuation limitations of these non-whisking whiskers led to anisotropic morphology and non-homogeneous composition to meet the animal's sensory needs. To test these hypotheses, we examined trunk whiskers from a 35-year-old female African savannah elephant (Loxodonta africana). Whisker morphology was evaluated through micro-CT and polarized light microscopy. The whiskers from the distal tip of the trunk were found to be axially asymmetric, with an ovular cross-section at the root, shifting to a near-square cross-section at the point. Nanoindentation and additional microscopy revealed that elephant whiskers have a composition unlike any other mammalian hair ever studied: we recorded an elastic modulus of 3 GPa at the root and 0.05 GPa at the point of a single 4-cm-long whisker. This work challenges the assumption that hairs have circular cross-sections and isotropic mechanical properties. With such striking differences compared to other mammals, including the mouse (Mus musculus), rat (Rattus norvegicus), and cat (Felis catus), we conclude that whisker morphology and composition play distinct and complementary roles in elephant trunk mechanosensing.
DOI BibTeX

Perceiving Systems Article HMP: Hand Motion Priors for Pose and Shape Estimation from Video Duran, E., Kocabas, M., Choutas, V., Fan, Z., Black, M. J. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), January 2024 (Published)
Understanding how humans interact with the world necessitates accurate 3D hand pose estimation, a task complicated by the hand’s high degree of articulation, frequent occlusions, self-occlusions, and rapid motions. While most existing methods rely on single-image inputs, videos have useful cues to address aforementioned issues. However, existing video-based 3D hand datasets are insufficient for training feedforward models to generalize to in-the-wild scenarios. On the other hand, we have access to large human motion capture datasets which also include hand motions, e.g. AMASS. Therefore, we develop a generative motion prior specific for hands, trained on the AMASS dataset which features diverse and high-quality hand motions. This motion prior is then employed for video-based 3D hand motion estimation following a latent optimization approach. Our integration of a robust motion prior significantly enhances performance, especially in occluded scenarios. It produces stable, temporally consistent results that surpass conventional single-frame methods. We demonstrate our method’s efficacy via qualitative and quantitative evaluations on the HO3D and DexYCB datasets, with special emphasis on an occlusion-focused subset of HO3D.
webpage pdf code BibTeX

Haptic Intelligence Miscellaneous MPI-10: Haptic-Auditory Measurements from Tool-Surface Interactions Khojasteh, B., Shao, Y., Kuchenbecker, K. J. Dataset published as a companion to the journal article "Robust Surface Recognition with the Maximum Mean Discrepancy: Degrading Haptic-Auditory Signals through Bandwidth and Noise" in IEEE Transactions on Haptics, January 2024 (Published) DOI BibTeX

Learning and Dynamical Systems Article A Pontryagin Perspective on Reinforcement Learning Eberhard, O., Vernade, C., Muehlebach, M. Learning for Dynamics and Control Conference, 2024 (Submitted) URL BibTeX

Physical Intelligence Article A simple quantitative model of neuromodulation, Part I: Ion flow through neural ion channels Werneck, L., Han, M., Yildiz, E., Keip, M., Sitti, M., Ortiz, M. Journal of the Mechanics and Physics of Solids, 182:105457, 2024 (Published)
We develop a simple model of ionic current through neuronal membranes as a function of membrane potential and extracellular ion concentration. The model combines a simplified Poisson–Nernst–Planck (PNP) model of ion transport through individual ion channels with channel activation functions calibrated from ad hoc in-house experimental data. The simplified PNP model is validated against bacterial gramicidin A ion channel data. The calibrated model accounts for the transport of calcium, sodium, potassium, and chloride and exhibits remarkable agreement with the experimentally measured current–voltage curves for the differentiated human neural cells.
DOI BibTeX

Materials Article Adsorption on Inkjet-Printable Polyelectrolyte Hydrogels Allows Refractive Index Sensing of Diclofenac and Metoprolol in Aqueous Solution Southan, A., Tan, J., Schuster, F., Rotenberger, J., Tovar, G. E. M. ACS Applied Polymer Materials, 6(10):6010-6021, 2024 (Published) pdf DOI URL BibTeX

Embodied Vision Conference Paper Analytical Uncertainty-Based Loss Weighting in Multi-Task Learning Kirchdorfer, L., Elich, C., Kutsche, S., Stuckenschmidt, H., Schott, L., Köhler, J. M. In Proceedings of the German Conference on Pattern Recognition (GCPR), 2024, to appear (To be published) BibTeX

Embodied Vision Article Attention Normalization Impacts Cardinality Generalization in Slot Attention Krimmel, M., Achterhold, J., Stueckler, J. In Transactions on Machine Learning Research (TMLR), 2024 (Published)
Object-centric scene decompositions are important representations for downstream tasks in fields such as computer vision and robotics. The recently proposed Slot Attention module, already leveraged by several derivative works for image segmentation and object tracking in videos, is a deep learning component which performs unsupervised object-centric scene decomposition on input images. It is based on an attention architecture, in which latent slot vectors, which hold compressed information on objects, attend to localized perceptual features from the input image. In this paper, we demonstrate that design decisions on normalizing the aggregated values in the attention architecture have considerable impact on the capabilities of Slot Attention to generalize to a higher number of slots and objects as seen during training. We propose and investigate alternatives to the original normalization scheme which increase the generalization capabilities of Slot Attention to varying slot and object counts, resulting in performance gains on the task of unsupervised image segmentation. The newly proposed normalizations represent minimal and easy to implement modifications of the usual Slot Attention module, changing the value aggregation mechanism from a weighted mean operation to a scaled weighted sum operation.
preprint video source code URL BibTeX

Learning and Dynamical Systems Article Balancing a 3D Inverted Pendulum using Remote Magnetic Manipulation Zughaibi, J., Nelson, B. J., Muehlebach, M. Robotics and Automation Letters, 2024 (In revision) URL BibTeX

Physical Intelligence Article Clinical translation of wireless soft robotic medical devices Wang, T., Wu, Y., Yildiz, E., Kanyas, S., Sitti, M. Nature Reviews Bioengineering, 2024 (Published)
Small-scale wireless soft robotics can be designed as implantable, interventional or wearable devices for various biomedical applications. Their flexibility, dexterity, adaptability and safe interactions with biological environments make them promising candidates for enabling precise and remote healthcare and disease diagnosis. However, the clinical translation of wireless soft robotic medical devices remains challenging. In this Review, we provide a comprehensive overview of the robotic technologies, the navigation methods, the dexterous functions and the translational challenges of wireless soft robotic medical devices. We first discuss safety and biocompatibility from a biological and technical perspective and then examine navigation methods for overcoming biological barriers for delivery, mobility and retrieval, highlighting dexterous medical functions at small scales. Finally, we identify key product development challenges, as well as the regulatory and ethical considerations that should be addressed to enable the clinical translation of wireless soft robotic medical devices.
DOI BibTeX

Modern Magnetic Systems Article Coherent magnons with giant nonreciprocity at nanoscale wavelengths Gallardo, R., Weigand, M., Schultheiss, K., Kakay, A., Mattheis, R., Raabe, J., Schütz, G., Deac, A., Lindner, J., Wintz, S. ACS Nano, 18(7):5249-5257, American Chemical Society, Washington, DC, 2024 DOI BibTeX

Learning and Dynamical Systems Conference Paper Conformal Performance Range Prediction for Segmentation Output Quality Control Wundram, A., Fischer, P., Muehlebach, M., Koch, L., Baumgartner, C. International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 2024 (Published) BibTeX