Publications

DEPARTMENTS

Emperical Interference

Haptic Intelligence

Modern Magnetic Systems

Perceiving Systems

Physical Intelligence

Robotic Materials

Social Foundations of Computation


Research Groups

Autonomous Vision

Autonomous Learning

Bioinspired Autonomous Miniature Robots

Dynamic Locomotion

Embodied Vision

Human Aspects of Machine Learning

Intelligent Control Systems

Learning and Dynamical Systems

Locomotion in Biorobotic and Somatic Systems

Micro, Nano, and Molecular Systems

Movement Generation and Control

Neural Capture and Synthesis

Physics for Inference and Optimization

Organizational Leadership and Diversity

Probabilistic Learning Group


Topics

Robot Learning

Conference Paper

2022

Autonomous Learning

Robotics

AI

Career

Award


Empirical Inference Perceiving Systems Conference Paper Can Large Language Models Understand Symbolic Graphics Programs? Qiu, Z., Liu, W., Feng, H., Liu, Z., Xiao, T. Z., Collins, K. M., Tenenbaum, J. B., Weller, A., Black, M. J., Schölkopf, B. The Thirteenth International Conference on Learning Representations (ICLR), April 2025 (Published)
Against the backdrop of enthusiasm for large language models (LLMs), there is a growing need to scientifically assess their capabilities and shortcomings. This is nontrivial in part because it is difficult to find tasks which the models have not encountered during training. Utilizing symbolic graphics programs, we propose a domain well-suited to test multiple spatial-semantic reasoning skills of LLMs. Popular in computer graphics, these programs procedurally generate visual data. While LLMs exhibit impressive skills in general program synthesis and analysis, symbolic graphics programs offer a new layer of evaluation: they allow us to test an LLM’s ability to answer semantic questions about the images or 3D geometries without a vision encoder. To semantically understand the symbolic programs, LLMs would need to possess the ability to “imagine” and reason how the corresponding graphics content would look with only the symbolic description of the local curvatures and strokes. We use this task to evaluate LLMs by creating a large benchmark for the semantic visual understanding of symbolic graphics programs, built procedurally with minimal human effort. Particular emphasis is placed on transformations of images that leave the image level semantics invariant while introducing significant changes to the underlying program. We evaluate commercial and open-source LLMs on our benchmark to assess their ability to reason about visual output of programs, finding that LLMs considered stronger at reasoning generally perform better. Lastly, we introduce a novel method to improve this ability – Symbolic Instruction Tuning (SIT), in which the LLM is finetuned with pre-collected instruction data on symbolic graphics programs. Interestingly, we find that SIT not only improves LLM’s understanding on symbolic programs, but it also improves general reasoning ability on various other benchmarks.
arXiv Paper BibTeX

Empirical Inference Perceiving Systems Conference Paper Efficient Diversity-Preserving Diffusion Alignment via Gradient-Informed GFlowNets Liu, Z., Xiao, T. Z., Liu, W., Bengio, Y., Zhang, D. The Thirteenth International Conference on Learning Representations (ICLR), April 2025 (Published)
While one commonly trains large diffusion models by collecting datasets on target downstream tasks, it is often desired to align and finetune pretrained diffusion models with some reward functions that are either designed by experts or learned from small-scale datasets. Existing post-training methods for reward finetuning of diffusion models typically suffer from lack of diversity in generated samples, lack of prior preservation, and/or slow convergence in finetuning. Inspired by recent successes in generative flow networks (GFlowNets), a class of probabilistic models that sample with the unnormalized density of a reward function, we propose a novel GFlowNet method dubbed Nabla-GFlowNet (abbreviated as ∇-GFlowNet), the first GFlowNet method that leverages the rich signal in reward gradients, together with an objective called ∇-DB plus its variant residual ∇-DB designed for prior-preserving diffusion finetuning. We show that our proposed method achieves fast yet diversity- and prior-preserving finetuning of Stable Diffusion, a large-scale text-conditioned image diffusion model, on different realistic reward functions.
arXiv BibTeX

Perceiving Systems Conference Paper Airship Formations for Animal Motion Capture and Behavior Analysis Price, E., Ahmad, A. Proceedings 2nd International Conference on Design and Engineering of Lighter-Than-Air systems (DELTAS2024), 2nd International Conference on Design and Engineering of Lighter-Than-Air systems (DELTAS2024), June 2024 (Published)
Using UAVs for wildlife observation and motion capture offers manifold advantages for studying animals in the wild, especially grazing herds in open terrain. The aerial perspective allows observation at a scale and depth that is not possible on the ground, offering new insights into group behavior. However, the very nature of wildlife field-studies puts traditional fixed wing and multi-copter systems to their limits: limited flight time, noise and safety aspects affect their efficacy, where lighter than air systems can remain on station for many hours. Nevertheless, airships are challenging from a ground handling perspective as well as from a control point of view, being voluminous and highly affected by wind. In this work, we showcase a system designed to use airship formations to track, follow, and visually record wild horses from multiple angles, including airship design, simulation, control, on board computer vision, autonomous operation and practical aspects of field experiments.
arXiv URL BibTeX

Perceiving Systems Conference Paper Optimizing the 3D Plate Shape for Proximal Humerus Fractures Keller, M., Krall, M., Smith, J., Clement, H., Kerner, A. M., Gradischar, A., Schäfer, Ü., Black, M. J., Weinberg, A., Pujades, S. International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), 487-496, Springer, Cham, MICCAI, October 2023 (Published)
To treat bone fractures, implant manufacturers produce 2D anatomically contoured plates. Unfortunately, existing plates only fit a limited segment of the population and/or require manual bending during surgery. Patient-specific implants would provide major benefits such as reducing surgery time and improving treatment outcomes but they are still rare in clinical practice. In this work, we propose a patient-specific design for the long helical 2D PHILOS (Proximal Humeral Internal Locking System) plate, used to treat humerus shaft fractures. Our method automatically creates a custom plate from a CT scan of a patient's bone. We start by designing an optimal plate on a template bone and, with an anatomy-aware registration method, we transfer this optimal design to any bone. In addition, for an arbitrary bone, our method assesses if a given plate is fit for surgery by automatically positioning it on the bone. We use this process to generate a compact set of plate shapes capable of fitting the bones within a given population. This plate set can be pre-printed in advance and readily available, removing the fabrication time between the fracture occurrence and the surgery. Extensive experiments on ex-vivo arms and 3D-printed bones show that the generated plate shapes (personalized and plate-set) faithfully match the individual bone anatomy and are suitable for clinical practice.
Project page Code Paper Poster DOI URL BibTeX

Perceiving Systems Conference Paper Synthetic Data-Based Detection of Zebras in Drone Imagery Bonetto, E., Ahmad, A. 2023 European Conference on Mobile Robots (ECMR), 1-8, IEEE, ECMR, September 2023 (Published)
Nowadays, there is a wide availability of datasets that enable the training of common object detectors or human detectors. These come in the form of labelled real-world images and require either a significant amount of human effort, with a high probability of errors such as missing labels, or very constrained scenarios, e.g. VICON systems. On the other hand, uncommon scenarios, like aerial views, animals, like wild zebras, or difficult-to-obtain information, such as human shapes, are hardly available. To overcome this, synthetic data generation with realistic rendering technologies has recently gained traction and advanced research areas such as target tracking and human pose estimation. However, subjects such as wild animals are still usually not well represented in such datasets. In this work, we first show that a pre-trained YOLO detector can not identify zebras in real images recorded from aerial viewpoints. To solve this, we present an approach for training an animal detector using only synthetic data. We start by generating a novel synthetic zebra dataset using GRADE, a state-of-the-art framework for data generation. The dataset includes RGB, depth, skeletal joint locations, pose, shape and instance segmentations for each subject. We use this to train a YOLO detector from scratch. Through extensive evaluations of our model with real-world data from i) limited datasets available on the internet and ii) a new one collected and manually labelled by us, we show that we can detect zebras by using only synthetic data during training. The code, results, trained models, and both the generated and training data are provided as open-source at https://eliabntt.github.io/grade-rr.
Generation code pdf DOI URL BibTeX

Perceiving Systems Empirical Inference Conference Paper MeshDiffusion: Score-based Generative 3D Mesh Modeling Liu, Z., Feng, Y., Black, M. J., Nowrouzezahrai, D., Paull, L., Liu, W. The Eleventh International Conference on Learning Representations (ICLR), ICLR, May 2023 (Published)
We consider the task of generating realistic 3D shapes, which is useful for a variety of applications such as automatic scene generation and physical simulation. Compared to other 3D representations like voxels and point clouds, meshes are more desirable in practice, because (1) they enable easy and arbitrary manipulation of shapes for relighting and simulation, and (2) they can fully leverage the power of modern graphics pipelines which are mostly optimized for meshes. Previous scalable methods for generating meshes typically rely on sub-optimal post-processing, and they tend to produce overly-smooth or noisy surfaces without fine-grained geometric details. To overcome these shortcomings, we take advantage of the graph structure of meshes and use a simple yet very effective generative modeling method to generate 3D meshes. Specifically, we represent meshes with deformable tetrahedral grids, and then train a diffusion model on this direct parametrization. We demonstrate the effectiveness of our model on multiple generative tasks.
Home Code URL BibTeX

Perceiving Systems Conference Paper DART: Articulated Hand Model with Diverse Accessories and Rich Textures Gao, D., Xiu, Y., Li, K., Yang, L., Wang, F., Zhang, P., Zhang, B., Lu, C., Tan, P. Thirty-sixth Conference on Neural Information Processing Systems (NeurIPS), NeurIPS 2022-Datasets and Benchmarks Track, November 2022 (Published)
Hand, the bearer of human productivity and intelligence, is receiving much attention due to the recent fever of 3D digital avatars. Among different hand morphable models, MANO has been widely used in various vision & graphics tasks. However, MANO disregards textures and accessories, which largely limits its power to synthesize photorealistic & lifestyle hand data. In this paper, we extend MANO with more Diverse Accessories and Rich Textures, namely DART. DART is comprised of 325 exquisite hand-crafted texture maps which vary in appearance and cover different kinds of blemishes, make-ups, and accessories. We also provide the Unity GUI which allows people to render hands with user-specific settings, e.g. pose, camera, background, lighting, and DART textures. In this way, we generate large-scale (800K), diverse, and high-fidelity hand images, paired with perfect-aligned 3D labels, called DARTset. Experiments demonstrate its superiority in generalization and diversity. As a great complement to existing datasets, DARTset could boost hand pose estimation & surface reconstruction tasks. DART and Unity software will be publicly available for research purposes.
Home Code Video BibTeX

Perceiving Systems Conference Paper Deep Residual Reinforcement Learning based Autonomous Blimp Control Liu, Y. T., Price, E., Black, M. J., Ahmad, A. 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2022), 12566-12573, IEEE, Piscataway, NJ, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2022) , October 2022 (Published)
Blimps are well suited to perform long-duration aerial tasks as they are energy efficient, relatively silent and safe. To address the blimp navigation and control task, in previous work we developed a hardware and software-in-the-loop framework and a PID-based controller for large blimps in the presence of wind disturbance. However, blimps have a deformable structure and their dynamics are inherently non-linear and time-delayed, making PID controllers difficult to tune. Thus, often resulting in large tracking errors. Moreover, the buoyancy of a blimp is constantly changing due to variations in ambient temperature and pressure. To address these issues, in this paper we present a learning-based framework based on deep residual reinforcement learning (DRRL), for the blimp control task. Within this framework, we first employ a PID controller to provide baseline performance. Subsequently, the DRRL agent learns to modify the PID decisions by interaction with the environment. We demonstrate in simulation that DRRL agent consistently improves the PID performance. Through rigorous simulation experiments, we show that the agent is robust to changes in wind speed and buoyancy. In real-world experiments, we demonstrate that the agent, trained only in simulation, is sufficiently robust to control an actual blimp in windy conditions. We openly provide the source code of our approach at https://github.com/robot-perception-group/AutonomousBlimpDRL .
DOI BibTeX

Perceiving Systems Conference Paper Collaborative Regression of Expressive Bodies using Moderation Feng, Y., Choutas, V., Bolkart, T., Tzionas, D., Black, M. J. 2021 International Conference on 3D Vision (3DV 2021), 792-804, IEEE, Piscataway, NJ, International Conference on 3D Vision (3DV 2021), December 2021 (Published)
Recovering expressive humans from images is essential for understanding human behavior. Methods that estimate 3D bodies, faces, or hands have progressed significantly, yet separately. Face methods recover accurate 3D shape and geometric details, but need a tight crop and struggle with extreme views and low resolution. Whole-body methods are robust to a wide range of poses and resolutions, but provide only a rough 3D face shape without details like wrinkles. To get the best of both worlds, we introduce PIXIE, which produces animatable, whole-body 3D avatars with realistic facial detail, from a single image. For this, PIXIE uses two key observations. First, existing work combines independent estimates from body, face, and hand experts, by trusting them equally. PIXIE introduces a novel moderator that merges the features of the experts, weighted by their confidence. All part experts can contribute to the whole, using SMPL-X’s shared shape space across all body parts. Second, human shape is highly correlated with gender, but existing work ignores this. We label training images as male, female, or non-binary, and train PIXIE to infer “gendered” 3D body shapes with a novel shape loss. In addition to 3D body pose and shape parameters, PIXIE estimates expression, illumination, albedo and 3D facial surface displacements. Quantitative and qualitative evaluation shows that PIXIE estimates more accurate whole-body shape and detailed face shape than the state of the art. Models and code are available at https://pixie.is.tue.mpg.de.
arXiv project pdf suppl DOI BibTeX

Perceiving Systems Conference Paper Learning to Disambiguate Strongly Interacting Hands via Probabilistic Per-pixel Part Segmentation Fan, Z., Spurr, A., Kocabas, M., Tang, S., Black, M. J., Hilliges, O. 2021 International Conference on 3D Vision (3DV 2021), 1-10, IEEE, Piscataway, NJ, International Conference on 3D Vision (3DV 2021), December 2021 (Published)
In natural conversation and interaction, our hands often overlap or are in contact with each other. Due to the homogeneous appearance of hands, this makes estimating the 3D pose of interacting hands from images difficult. In this paper we demonstrate that self-similarity, and the resulting ambiguities in assigning pixel observations to the respective hands and their parts, is a major cause of the final 3D pose error. Motivated by this insight, we propose DIGIT, a novel method for estimating the 3D poses of two interacting hands from a single monocular image. The method consists of two interwoven branches that process the input imagery into a per-pixel semantic part segmentation mask and a visual feature volume. In contrast to prior work, we do not decouple the segmentation from the pose estimation stage, but rather leverage the per-pixel probabilities directly in the downstream pose estimation task. To do so, the part probabilities are merged with the visual features and processed via fully-convolutional layers. We experimentally show that the proposed approach achieves new state-of-the-art performance on the InterHand2.6M dataset for both single and interacting hands across all metrics. We provide detailed ablation studies to demonstrate the efficacy of our method and to provide insights into how the modelling of pixel ownership affects single and interacting hand pose estimation. Our code will be released for research purposes.
arXiv project code video DOI BibTeX

Perceiving Systems Conference Paper Enhancing Deep Semantic Segmentation of RGB-D Data with Entangled Forests Terreran, M., Bonetto, E., Ghidoni, S. 2020 25th International Conference on Pattern Recognition (ICPR 2020), 4634-4641, IEEE, Piscataway, NJ, 25th International Conference on Pattern Recognition (ICPR 2020), January 2021 (Published)
Semantic segmentation is a problem which is getting more and more attention in the computer vision community. Nowadays, deep learning methods represent the state of the art to solve this problem, and the trend is to use deeper networks to get higher performance. The drawback with such models is a higher computational cost, which makes it difficult to integrate them on mobile robot platforms. In this work we want to explore how to obtain lighter deep learning models without compromising performance. To do so we will consider the features used in the 3D Entangled Forests algorithm and we will study the best strategies to integrate these within FuseNet deep network. Such new features allow us to shrink the network size without loosing performance, obtaining hence a lighter model which achieves state-of-the-art performance on the semantic segmentation task and represents an interesting alternative for mobile robotics applications, where computational power and energy are limited.
DOI URL BibTeX

Perceiving Systems Conference Paper GENTEL: GENerating Training data Efficiently for Learning to segment medical images Thakur, R. P., Pujades, S., Goel, L., Pohmann, R., Machann, J., Black, M. J. RFIAP 2020 - Congrés Reconnaissance des Formes, Image, Apprentissage et Perception , Congrès Reconnaissance des Formes, Image, Apprentissage et Perception (RFAIP 2020) , June 2020 (Published)
Accurately segmenting MRI images is crucial for many clinical applications. However, manually segmenting images with accurate pixel precision is a tedious and time consuming task. In this paper we present a simple, yet effective method to improve the efficiency of the image segmentation process. We propose to transform the image annotation task into a binary choice task. We start by using classical image processing algorithms with different parameter values to generate multiple, different segmentation masks for each input MRI image. Then, instead of segmenting the pixels of the images, the user only needs to decide whether a segmentation is acceptable or not. This method allows us to efficiently obtain high quality segmentations with minor human intervention. With the selected segmentations, we train a state-of-the-art neural network model. For the evaluation, we use a second MRI dataset (1.5T Dataset), acquired with a different protocol and containing annotations. We show that the trained network i) is able to automatically segment cases where none of the classical methods obtain a high quality result ; ii) generalizes to the second MRI dataset, which was acquired with a different protocol and was never seen at training time ; and iii) enables detection of miss-annotations in this second dataset. Quantitatively, the trained network obtains very good results: DICE score - mean 0.98, median 0.99- and Hausdorff distance (in pixels) - mean 4.7, median 2.0-.
Project Page PDF URL BibTeX

Empirical Inference Perceiving Systems Probabilistic Learning Group Conference Paper From Variational to Deterministic Autoencoders Ghosh*, P., Sajjadi*, M. S. M., Vergari, A., Black, M. J., Schölkopf, B. 8th International Conference on Learning Representations (ICLR) , April 2020, *equal contribution (Published)
Variational Autoencoders (VAEs) provide a theoretically-backed framework for deep generative models. However, they often produce “blurry” images, which is linked to their training objective. Sampling in the most popular implementation, the Gaussian VAE, can be interpreted as simply injecting noise to the input of a deterministic decoder. In practice, this simply enforces a smooth latent space structure. We challenge the adoption of the full VAE framework on this specific point in favor of a simpler, deterministic one. Specifically, we investigate how substituting stochasticity with other explicit and implicit regularization schemes can lead to a meaningful latent space without having to force it to conform to an arbitrarily chosen prior. To retrieve a generative mechanism for sampling new data points, we propose to employ an efficient ex-post density estimation step that can be readily adopted both for the proposed deterministic autoencoders as well as to improve sample quality of existing VAEs. We show in a rigorous empirical study that regularized deterministic autoencoding achieves state-of-the-art sample quality on the common MNIST, CIFAR-10 and CelebA datasets.
arXiv URL BibTeX

Perceiving Systems Conference Paper Learning to Reconstruct 3D Human Pose and Shape via Model-fitting in the Loop Kolotouros, N., Pavlakos, G., Black, M. J., Daniilidis, K. Proceedings International Conference on Computer Vision (ICCV), 2252-2261, IEEE, 2019 IEEE/CVF International Conference on Computer Vision (ICCV), October 2019, ISSN: 2380-7504 (Published)
Model-based human pose estimation is currently approached through two different paradigms. Optimization-based methods fit a parametric body model to 2D observations in an iterative manner, leading to accurate image-model alignments, but are often slow and sensitive to the initialization. In contrast, regression-based methods, that use a deep network to directly estimate the model parameters from pixels, tend to provide reasonable, but not pixel accurate, results while requiring huge amounts of supervision. In this work, instead of investigating which approach is better, our key insight is that the two paradigms can form a strong collaboration. A reasonable, directly regressed estimate from the network can initialize the iterative optimization making the fitting faster and more accurate. Similarly, a pixel accurate fit from iterative optimization can act as strong supervision for the network. This is the core of our proposed approach SPIN (SMPL oPtimization IN the loop). The deep network initializes an iterative optimization routine that fits the body model to 2D joints within the training loop, and the fitted estimate is subsequently used to supervise the network. Our approach is self-improving by nature, since better network estimates can lead the optimization to better solutions, while more accurate optimization fits provide better supervision for the network. We demonstrate the effectiveness of our approach in different settings, where 3D ground truth is scarce, or not available, and we consistently outperform the state-of-the-art model-based pose estimation approaches by significant margins.
pdf code project DOI BibTeX

Perceiving Systems Conference Paper Markerless Outdoor Human Motion Capture Using Multiple Autonomous Micro Aerial Vehicles Saini, N., Price, E., Tallamraju, R., Enficiaud, R., Ludwig, R., Martinović, I., Ahmad, A., Black, M. Proceedings 2019 IEEE/CVF International Conference on Computer Vision (ICCV), 823-832, IEEE, International Conference on Computer Vision (ICCV), October 2019 (Published)
Capturing human motion in natural scenarios means moving motion capture out of the lab and into the wild. Typical approaches rely on fixed, calibrated, cameras and reflective markers on the body, significantly limiting the motions that can be captured. To make motion capture truly unconstrained, we describe the first fully autonomous outdoor capture system based on flying vehicles. We use multiple micro-aerial-vehicles(MAVs), each equipped with a monocular RGB camera, an IMU, and a GPS receiver module. These detect the person, optimize their position, and localize themselves approximately. We then develop a markerless motion capture method that is suitable for this challenging scenario with a distant subject, viewed from above, with approximately calibrated and moving cameras. We combine multiple state-of-the-art 2D joint detectors with a 3D human body model and a powerful prior on human pose. We jointly optimize for 3D body pose and camera pose to robustly fit the 2D measurements. To our knowledge, this is the first successful demonstration of outdoor, full-body, markerless motion capture from autonomous flying vehicles.
Code Data Video Paper Manuscript DOI BibTeX

Perceiving Systems Conference Paper Energy Conscious Over-actuated Multi-Agent Payload Transport Robot: Simulations and Preliminary Physical Validation Tallamraju, R., Verma, P., Sripada, V., Agrawal, S., Karlapalem, K. 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), 1-7, IEEE, 2019 28th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN), October 2019 (Published)
In this work, we consider a multi-wheeled payload transport system. Each of the wheels can be selectively actuated. When they are not actuated, wheels are free moving and do not consume battery power. The payload transport system is modeled as an actuated multi-agent system, with each wheel-motor pair as an agent. Kinematic and dynamic models are developed to ensure that the payload transport system moves as desired. We design optimization formulations to decide on the number of wheels to be active and which of the wheels to be active so that the battery is conserved and the wear on the motors is reduced. Our multi-level control framework over the agents ensures that near-optimal number of agents is active for the payload transport system to function. Through simulation studies we show that our solution ensures energy efficient operation and increases the distance traveled by the payload transport system, for the same battery power. We have built the payload transport system and provide results for preliminary experimental validation.
DOI BibTeX

Perceiving Systems Conference Paper Efficient Learning on Point Clouds With Basis Point Sets Prokudin, S., Lassner, C., Romero, J. International Conference on Computer Vision, 4332-4341, October 2019
With an increased availability of 3D scanning technology, point clouds are moving into the focus of computer vision as a rich representation of everyday scenes. However, they are hard to handle for machine learning algorithms due to the unordered structure. One common approach is to apply voxelization, which dramatically increases the amount of data stored and at the same time loses details through discretization. Recently, deep learning models with hand-tailored architectures were proposed to handle point clouds directly and achieve input permutation invariance. However, these architectures use an increased number of parameters and are computationally inefficient. In this work we propose basis point sets as a highly efficient and fully general way to process point clouds with machine learning algorithms. Basis point sets are a residual representation that can be computed efficiently and can be used with standard neural network architectures. Using the proposed representation as the input to a relatively simple network allows us to match the performance of PointNet on a shape classification task while using three order of magnitudes less floating point operations. In a second experiment, we show how proposed representation can be used for obtaining high resolution meshes from noisy 3D scans. Here, our network achieves performance comparable to the state-of-the-art computationally intense multi-step frameworks, in one network pass that can be done in less than 1ms.
code pdf BibTeX

Perceiving Systems Conference Paper Distributed, Collaborative Virtual Reality Application for Product Development with Simple Avatar Calibration Method Dixken, M., Diers, D., Wingert, B., Hatzipanayioti, A., Mohler, B. J., Riedel, O., Bues, M. IEEE Conference on Virtual Reality and 3D User Interfaces, (VR), 1299-1300, IEEE, March 2019 (Published)
In this work we present a collaborative virtual reality application for distributed engineering tasks, including a simple avatar calibration method. Use cases for the application include CAD review, ergonomics analyses or virtual training. Full-body avatars with approximately natural motion characteristics are used to improve collaboration in terms of co-presence and embodiment. Through a calibration process, scaled avatars based on body measurements can be generated by users from within the VR application. For motion tracking, only the tracking capabilities of the VR devices used (head mounted display and hand controllers) are required, in order to achieve a low initialization effort for usage of the application. We demonstrate a simple, yet effective method for measuring the human body which requires these VR devices only. Tracking only the user's head and hands, we require inverse kinematics (IK) for avatar motion reconstruction. On the other hand, this requires a calibrated avatar, otherwise body postures and gestures will be misrepresented. In the first step of the demo, two users can create their calibrated avatar at the same time and then carry out a collaborative CAD review in the second step.
DOI BibTeX

Perceiving Systems Conference Paper Deep Directional Statistics: Pose Estimation with Uncertainty Quantification Prokudin, S., Gehler, P., Nowozin, S. European Conference on Computer Vision (ECCV), 9:542-559, September 2018
Modern deep learning systems successfully solve many perception tasks such as object pose estimation when the input image is of high quality. However, in challenging imaging conditions such as on low resolution images or when the image is corrupted by imaging artifacts, current systems degrade considerably in accuracy. While a loss in performance is unavoidable we would like our models to quantify their uncertainty in order to achieve robustness against images of varying quality. Probabilistic deep learning models combine the expressive power of deep learning with uncertainty quantification. In this paper, we propose a novel probabilistic deep learning model for the task of angular regression. Our model uses von Mises distributions to predict a distribution over object pose angle. Whereas a single von Mises distribution is making strong assumptions about the shape of the distribution, we extend the basic model to predict a mixture of von Mises distributions. We show how to learn a mixture model using a finite and infinite number of mixture components. Our model allow for likelihood-based training and efficient inference at test time. We demonstrate on a number of challenging pose estimation datasets that our model produces calibrated probability predictions and competitive or superior point estimates compared to the current state-of-the-art.
code pdf BibTeX

Perceiving Systems Conference Paper Decentralized MPC based Obstacle Avoidance for Multi-Robot Target Tracking Scenarios Tallamraju, R., Rajappa, S., Black, M. J., Karlapalem, K., Ahmad, A. 2018 IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR), 1-8, IEEE, August 2018 (Published)
In this work, we consider the problem of decentralized multi-robot target tracking and obstacle avoidance in dynamic environments. Each robot executes a local motion planning algorithm which is based on model predictive control (MPC). The planner is designed as a quadratic program, subject to constraints on robot dynamics and obstacle avoidance. Repulsive potential field functions are employed to avoid obstacles. The novelty of our approach lies in embedding these non-linear potential field functions as constraints within a convex optimization framework. Our method convexifies nonconvex constraints and dependencies, by replacing them as pre-computed external input forces in robot dynamics. The proposed algorithm additionally incorporates different methods to avoid field local minima problems associated with using potential field functions in planning. The motion planner does not enforce predefined trajectories or any formation geometry on the robots and is a comprehensive solution for cooperative obstacle avoidance in the context of multi-robot target tracking. We perform simulation studies for different scenarios to showcase the convergence and efficacy of the proposed algorithm.
Published Version DOI URL BibTeX

Perceiving Systems Conference Paper Effects of animation retargeting on perceived action outcomes Kenny, S., Mahmood, N., Honda, C., Black, M. J., Troje, N. F. Proceedings of the ACM Symposium on Applied Perception (SAP’17), 2:1-2:7, September 2017
The individual shape of the human body, including the geometry of its articulated structure and the distribution of weight over that structure, influences the kinematics of a person's movements. How sensitive is the visual system to inconsistencies between shape and motion introduced by retargeting motion from one person onto the shape of another? We used optical motion capture to record five pairs of male performers with large differences in body weight, while they pushed, lifted, and threw objects. Based on a set of 67 markers, we estimated both the kinematics of the actions as well as the performer's individual body shape. To obtain consistent and inconsistent stimuli, we created animated avatars by combining the shape and motion estimates from either a single performer or from different performers. In a virtual reality environment, observers rated the perceived weight or thrown distance of the objects. They were also asked to explicitly discriminate between consistent and hybrid stimuli. Observers were unable to accomplish the latter, but hybridization of shape and motion influenced their judgements of action outcome in systematic ways. Inconsistencies between shape and motion were assimilated into an altered perception of the action outcome.
pdf DOI BibTeX

Perceiving Systems Autonomous Vision Conference Paper Deep Discrete Flow Güney, F., Geiger, A. Asian Conference on Computer Vision (ACCV), 2016 (Accepted) pdf suppmat BibTeX

Perceiving Systems Conference Paper Hough-based Object Detection with Grouped Features Srikantha, A., Gall, J. International Conference on Image Processing, 1653-1657, Paris, France, IEEE International Conference on Image Processing , October 2014
Hough-based voting approaches have been successfully applied to object detection. While these methods can be efficiently implemented by random forests, they estimate the probability for an object hypothesis for each feature independently. In this work, we address this problem by grouping features in a local neighborhood to obtain a better estimate of the probability. To this end, we propose oblique classification-regression forests that combine features of different trees. We further investigate the benefit of combining independent and grouped features and evaluate the approach on RGB and RGB-D datasets.
pdf poster DOI BibTeX

Perceiving Systems Autonomous Vision Conference Paper Omnidirectional 3D Reconstruction in Augmented Manhattan Worlds Schoenbein, M., Geiger, A. International Conference on Intelligent Robots and Systems, 716 - 723, IEEE, Chicago, IL, USA, IEEE/RSJ International Conference on Intelligent Robots and System, October 2014
This paper proposes a method for high-quality omnidirectional 3D reconstruction of augmented Manhattan worlds from catadioptric stereo video sequences. In contrast to existing works we do not rely on constructing virtual perspective views, but instead propose to optimize depth jointly in a unified omnidirectional space. Furthermore, we show that plane-based prior models can be applied even though planes in 3D do not project to planes in the omnidirectional domain. Towards this goal, we propose an omnidirectional slanted-plane Markov random field model which relies on plane hypotheses extracted using a novel voting scheme for 3D planes in omnidirectional space. To quantitatively evaluate our method we introduce a dataset which we have captured using our autonomous driving platform AnnieWAY which we equipped with two horizontally aligned catadioptric cameras and a Velodyne HDL-64E laser scanner for precise ground truth depth measurements. As evidenced by our experiments, the proposed method clearly benefits from the unified view and significantly outperforms existing stereo matching techniques both quantitatively and qualitatively. Furthermore, our method is able to reduce noise and the obtained depth maps can be represented very compactly by a small number of image segments and plane parameters.
pdf DOI BibTeX

Perceiving Systems Autonomous Vision Conference Paper Calibrating and Centering Quasi-Central Catadioptric Cameras Schoenbein, M., Strauss, T., Geiger, A. IEEE International Conference on Robotics and Automation, 4443 - 4450, Hong Kong, China, IEEE International Conference on Robotics and Automation, June 2014
Non-central catadioptric models are able to cope with irregular camera setups and inaccuracies in the manufacturing process but are computationally demanding and thus not suitable for robotic applications. On the other hand, calibrating a quasi-central (almost central) system with a central model introduces errors due to a wrong relationship between the viewing ray orientations and the pixels on the image sensor. In this paper, we propose a central approximation to quasi-central catadioptric camera systems that is both accurate and efficient. We observe that the distance to points in 3D is typically large compared to deviations from the single viewpoint. Thus, we first calibrate the system using a state-of-the-art non-central camera model. Next, we show that by remapping the observations we are able to match the orientation of the viewing rays of a much simpler single viewpoint model with the true ray orientations. While our approximation is general and applicable to all quasi-central camera systems, we focus on one of the most common cases in practice: hypercatadioptric cameras. We compare our model to a variety of baselines in synthetic and real localization and motion estimation experiments. We show that by using the proposed model we are able to achieve near non-central accuracy while obtaining speed-ups of more than three orders of magnitude compared to state-of-the-art non-central models.
pdf DOI BibTeX

Perceiving Systems Autonomous Vision Conference Paper Simultaneous Underwater Visibility Assessment, Enhancement and Improved Stereo Roser, M., Dunbabin, M., Geiger, A. IEEE International Conference on Robotics and Automation, 3840 - 3847 , Hong Kong, China, IEEE International Conference on Robotics and Automation, June 2014
Vision-based underwater navigation and obstacle avoidance demands robust computer vision algorithms, particularly for operation in turbid water with reduced visibility. This paper describes a novel method for the simultaneous underwater image quality assessment, visibility enhancement and disparity computation to increase stereo range resolution under dynamic, natural lighting and turbid conditions. The technique estimates the visibility properties from a sparse 3D map of the original degraded image using a physical underwater light attenuation model. Firstly, an iterated distance-adaptive image contrast enhancement enables a dense disparity computation and visibility estimation. Secondly, using a light attenuation model for ocean water, a color corrected stereo underwater image is obtained along with a visibility distance estimate. Experimental results in shallow, naturally lit, high-turbidity coastal environments show the proposed technique improves range estimation over the original images as well as image quality and color for habitat classification. Furthermore, the recursiveness and robustness of the technique allows real-time implementation onboard an Autonomous Underwater Vehicles for improved navigation and obstacle avoidance performance.
pdf DOI BibTeX

Perceiving Systems Conference Paper Multi-View Priors for Learning Detectors from Sparse Viewpoint Data Pepik, B., Stark, M., Gehler, P., Schiele, B. International Conference on Learning Representations, International Conference on Learning Representations (ICLR), April 2014
While the majority of today's object class models provide only 2D bounding boxes, far richer output hypotheses are desirable including viewpoint, fine-grained category, and 3D geometry estimate. However, models trained to provide richer output require larger amounts of training data, preferably well covering the relevant aspects such as viewpoint and fine-grained categories. In this paper, we address this issue from the perspective of transfer learning, and design an object class model that explicitly leverages correlations between visual features. Specifically, our model represents prior distributions over permissible multi-view detectors in a parametric way -- the priors are learned once from training data of a source object class, and can later be used to facilitate the learning of a detector for a target class. As we show in our experiments, this transfer is not only beneficial for detectors based on basic-level category representations, but also enables the robust learning of detectors that represent classes at finer levels of granularity, where training data is typically even scarcer and more unbalanced. As a result, we report largely improved performance in simultaneous 2D object localization and viewpoint estimation on a recent dataset of challenging street scenes.
reviews pdf BibTeX