Header logo is


2020


Grasping Field: Learning Implicit Representations for Human Grasps
Grasping Field: Learning Implicit Representations for Human Grasps

Karunratanakul, K., Yang, J., Zhang, Y., Black, M., Muandet, K., Tang, S.

In International Conference on 3D Vision (3DV), November 2020 (inproceedings)

Abstract
Robotic grasping of house-hold objects has made remarkable progress in recent years. Yet, human grasps are still difficult to synthesize realistically. There are several key reasons: (1) the human hand has many degrees of freedom (more than robotic manipulators); (2) the synthesized hand should conform to the surface of the object; and (3) it should interact with the object in a semantically and physically plausible manner. To make progress in this direction, we draw inspiration from the recent progress on learning-based implicit representations for 3D object reconstruction. Specifically, we propose an expressive representation for human grasp modelling that is efficient and easy to integrate with deep neural networks. Our insight is that every point in a three-dimensional space can be characterized by the signed distances to the surface of the hand and the object, respectively. Consequently, the hand, the object, and the contact area can be represented by implicit surfaces in a common space, in which the proximity between the hand and the object can be modelled explicitly. We name this 3D to 2D mapping as Grasping Field, parameterize it with a deep neural network, and learn it from data. We demonstrate that the proposed grasping field is an effective and expressive representation for human grasp generation. Specifically, our generative model is able to synthesize high-quality human grasps, given only on a 3D object point cloud. The extensive experiments demonstrate that our generative model compares favorably with a strong baseline and approaches the level of natural human grasps. Furthermore, based on the grasping field representation, we propose a deep network for the challenging task of 3D hand-object interaction reconstruction from a single RGB image. Our method improves the physical plausibility of the hand-object contact reconstruction and achieves comparable performance for 3D hand reconstruction compared to state-of-the-art methods. Our model and code are available for research purpose at https://github.com/korrawe/grasping_field.

ei ps

pdf arXiv code [BibTex]

2020



{GIF}: Generative Interpretable Faces
GIF: Generative Interpretable Faces

Ghosh, P., Gupta, P. S., Uziel, R., Ranjan, A., Black, M. J., Bolkart, T.

In International Conference on 3D Vision (3DV), November 2020 (inproceedings)

Abstract
Photo-realistic visualization and animation of expressive human faces have been a long standing challenge. 3D face modeling methods provide parametric control but generates unrealistic images, on the other hand, generative 2D models like GANs (Generative Adversarial Networks) output photo-realistic face images, but lack explicit control. Recent methods gain partial control, either by attempting to disentangle different factors in an unsupervised manner, or by adding control post hoc to a pre-trained model. Unconditional GANs, however, may entangle factors that are hard to undo later. We condition our generative model on pre-defined control parameters to encourage disentanglement in the generation process. Specifically, we condition StyleGAN2 on FLAME, a generative 3D face model. While conditioning on FLAME parameters yields unsatisfactory results, we find that conditioning on rendered FLAME geometry and photometric details works well. This gives us a generative 2D face model named GIF (Generative Interpretable Faces) that offers FLAME's parametric control. Here, interpretable refers to the semantic meaning of different parameters. Given FLAME parameters for shape, pose, expressions, parameters for appearance, lighting, and an additional style vector, GIF outputs photo-realistic face images. We perform an AMT based perceptual study to quantitatively and qualitatively evaluate how well GIF follows its conditioning. The code, data, and trained model are publicly available for research purposes at http://gif.is.tue.mpg.de

ps

pdf project code [BibTex]

pdf project code [BibTex]


{PLACE}: Proximity Learning of Articulation and Contact in {3D} Environments
PLACE: Proximity Learning of Articulation and Contact in 3D Environments

Zhang, S., Zhang, Y., Ma, Q., Black, M. J., Tang, S.

In International Conference on 3D Vision (3DV), November 2020 (inproceedings)

Abstract
High fidelity digital 3D environments have been proposed in recent years, however, it remains extremely challenging to automatically equip such environment with realistic human bodies. Existing work utilizes images, depth or semantic maps to represent the scene, and parametric human models to represent 3D bodies. While being straight-forward, their generated human-scene interactions often lack of naturalness and physical plausibility. Our key observation is that humans interact with the world through body-scene contact. To synthesize realistic human-scene interactions, it is essential to effectively represent the physical contact and proximity between the body and the world. To that end, we propose a novel interaction generation method, named PLACE(Proximity Learning of Articulation and Contact in 3D Environments), which explicitly models the proximity between the human body and the 3D scene around it. Specifically, given a set of basis points on a scene mesh, we leverage a conditional variational autoencoder to synthesize the minimum distances from the basis points to the human body surface. The generated proximal relationship exhibits which region of the scene is in contact with the person. Furthermore, based on such synthesized proximity, we are able to effectively obtain expressive 3D human bodies that interact with the 3D scene naturally. Our perceptual study shows that PLACE significantly improves the state-of-the-art method, approaching the realism of real human-scene interaction. We believe our method makes an important step towards the fully automatic synthesis of realistic 3D human bodies in 3D scenes. The code and model are available for research at https://sanweiliti.github.io/PLACE/PLACE.html

ps

pdf arXiv project code [BibTex]

pdf arXiv project code [BibTex]


Label Efficient Visual Abstractions for Autonomous Driving
Label Efficient Visual Abstractions for Autonomous Driving

Behl, A., Chitta, K., Prakash, A., Ohn-Bar, E., Geiger, A.

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, October 2020 (conference)

Abstract
It is well known that semantic segmentation can be used as an effective intermediate representation for learning driving policies. However, the task of street scene semantic segmentation requires expensive annotations. Furthermore, segmentation algorithms are often trained irrespective of the actual driving task, using auxiliary image-space loss functions which are not guaranteed to maximize driving metrics such as safety or distance traveled per intervention. In this work, we seek to quantify the impact of reducing segmentation annotation costs on learned behavior cloning agents. We analyze several segmentation-based intermediate representations. We use these visual abstractions to systematically study the trade-off between annotation efficiency and driving performance, ie, the types of classes labeled, the number of image samples used to learn the visual abstraction model, and their granularity (eg, object masks vs. 2D bounding boxes). Our analysis uncovers several practical insights into how segmentation-based visual abstractions can be exploited in a more label efficient manner. Surprisingly, we find that state-of-the-art driving performance can be achieved with orders of magnitude reduction in annotation cost. Beyond label efficiency, we find several additional training benefits when leveraging visual abstractions, such as a significant reduction in the variance of the learned policy when compared to state-of-the-art end-to-end driving models.

avg

pdf slides video Project Page [BibTex]

pdf slides video Project Page [BibTex]


Learning a statistical full spine model from partial observations
Learning a statistical full spine model from partial observations

Meng, D., Keller, M., Boyer, E., Black, M., Pujades, S.

In Shape in Medical Imaging, pages: 122,133, (Editors: Reuter, Martin and Wachinger, Christian and Lombaert, Hervé and Paniagua, Beatriz and Goksel, Orcun and Rekik, Islem), Springer International Publishing, October 2020 (inproceedings)

Abstract
The study of the morphology of the human spine has attracted research attention for its many potential applications, such as image segmentation, bio-mechanics or pathology detection. However, as of today there is no publicly available statistical model of the 3D surface of the full spine. This is mainly due to the lack of openly available 3D data where the full spine is imaged and segmented. In this paper we propose to learn a statistical surface model of the full-spine (7 cervical, 12 thoracic and 5 lumbar vertebrae) from partial and incomplete views of the spine. In order to deal with the partial observations we use probabilistic principal component analysis (PPCA) to learn a surface shape model of the full spine. Quantitative evaluation demonstrates that the obtained model faithfully captures the shape of the population in a low dimensional space and generalizes to left out data. Furthermore, we show that the model faithfully captures the global correlations among the vertebrae shape. Given a partial observation of the spine, i.e. a few vertebrae, the model can predict the shape of unseen vertebrae with a mean error under 3 mm. The full-spine statistical model is trained on the VerSe 2019 public dataset and is publicly made available to the community for non-commercial purposes. (https://gitlab.inria.fr/spine/spine_model)

ps

Gitlab Code PDF DOI [BibTex]

Gitlab Code PDF DOI [BibTex]


Convolutional Occupancy Networks
Convolutional Occupancy Networks

Peng, S., Niemeyer, M., Mescheder, L., Pollefeys, M., Geiger, A.

In European Conference on Computer Vision (ECCV), Springer International Publishing, Cham, August 2020 (inproceedings)

Abstract
Recently, implicit neural representations have gained popularity for learning-based 3D reconstruction. While demonstrating promising results, most implicit approaches are limited to comparably simple geometry of single objects and do not scale to more complicated or large-scale scenes. The key limiting factor of implicit methods is their simple fully-connected network architecture which does not allow for integrating local information in the observations or incorporating inductive biases such as translational equivariance. In this paper, we propose Convolutional Occupancy Networks, a more flexible implicit representation for detailed reconstruction of objects and 3D scenes. By combining convolutional encoders with implicit occupancy decoders, our model incorporates inductive biases, enabling structured reasoning in 3D space. We investigate the effectiveness of the proposed representation by reconstructing complex geometry from noisy point clouds and low-resolution voxel representations. We empirically find that our method enables the fine-grained implicit 3D reconstruction of single objects, scales to large indoor scenes, and generalizes well from synthetic to real data.

avg

pdf suppmat video Project Page [BibTex]

pdf suppmat video Project Page [BibTex]


STAR: Sparse Trained Articulated Human Body Regressor
STAR: Sparse Trained Articulated Human Body Regressor

Osman, A. A. A., Bolkart, T., Black, M. J.

In European Conference on Computer Vision (ECCV) , August 2020 (inproceedings)

Abstract
The SMPL body model is widely used for the estimation, synthesis, and analysis of 3D human pose and shape. While popular, we show that SMPL has several limitations and introduce STAR, which is quantitatively and qualitatively superior to SMPL. First, SMPL has a huge number of parameters resulting from its use of global blend shapes. These dense pose-corrective offsets relate every vertex on the mesh to all the joints in the kinematic tree, capturing spurious long-range correlations. To address this, we define per-joint pose correctives and learn the subset of mesh vertices that are influenced by each joint movement. This sparse formulation results in more realistic deformations and significantly reduces the number of model parameters to 20% of SMPL. When trained on the same data as SMPL, STAR generalizes better despite having many fewer parameters. Second, SMPL factors pose-dependent deformations from body shape while, in reality, people with different shapes deform differently. Consequently, we learn shape-dependent pose-corrective blend shapes that depend on both body pose and BMI. Third, we show that the shape space of SMPL is not rich enough to capture the variation in the human population. We address this by training STAR with an additional 10,000 scans of male and female subjects, and show that this results in better model generalization. STAR is compact, generalizes better to new bodies and is a drop-in replacement for SMPL. STAR is publicly available for research purposes at http://star.is.tue.mpg.de.

ps

Project Page Code Video paper supplemental [BibTex]


Monocular Expressive Body Regression through Body-Driven Attention
Monocular Expressive Body Regression through Body-Driven Attention

Choutas, V., Pavlakos, G., Bolkart, T., Tzionas, D., Black, M. J.

In Computer Vision – ECCV 2020, Springer International Publishing, Cham, August 2020 (inproceedings)

Abstract
To understand how people look, interact, or perform tasks,we need to quickly and accurately capture their 3D body, face, and hands together from an RGB image. Most existing methods focus only on parts of the body. A few recent approaches reconstruct full expressive 3D humans from images using 3D body models that include the face and hands. These methods are optimization-based and thus slow, prone to local optima, and require 2D keypoints as input. We address these limitations by introducing ExPose (EXpressive POse and Shape rEgression), which directly regresses the body, face, and hands, in SMPL-X format, from an RGB image. This is a hard problem due to the high dimensionality of the body and the lack of expressive training data. Additionally, hands and faces are much smaller than the body, occupying very few image pixels. This makes hand and face estimation hard when body images are downscaled for neural networks. We make three main contributions. First, we account for the lack of training data by curating a dataset of SMPL-X fits on in-the-wild images. Second, we observe that body estimation localizes the face and hands reasonably well. We introduce body-driven attention for face and hand regions in the original image to extract higher-resolution crops that are fed to dedicated refinement modules. Third, these modules exploit part-specific knowledge from existing face and hand-only datasets. ExPose estimates expressive 3D humans more accurately than existing optimization methods at a small fraction of the computational cost. Our data, model and code are available for research at https://expose.is.tue.mpg.de.

ps

code Short video Long video arxiv pdf suppl link (url) Project Page [BibTex]


Category Level Object Pose Estimation via Neural Analysis-by-Synthesis
Category Level Object Pose Estimation via Neural Analysis-by-Synthesis

Chen, X., Dong, Z., Song, J., Geiger, A., Hilliges, O.

In European Conference on Computer Vision (ECCV), Springer International Publishing, Cham, August 2020 (inproceedings)

Abstract
Many object pose estimation algorithms rely on the analysis-by-synthesis framework which requires explicit representations of individual object instances. In this paper we combine a gradient-based fitting procedure with a parametric neural image synthesis module that is capable of implicitly representing the appearance, shape and pose of entire object categories, thus rendering the need for explicit CAD models per object instance unnecessary. The image synthesis network is designed to efficiently span the pose configuration space so that model capacity can be used to capture the shape and local appearance (i.e., texture) variations jointly. At inference time the synthesized images are compared to the target via an appearance based loss and the error signal is backpropagated through the network to the input parameters. Keeping the network parameters fixed, this allows for iterative optimization of the object pose, shape and appearance in a joint manner and we experimentally show that the method can recover orientation of objects with high accuracy from 2D images alone. When provided with depth measurements, to overcome scale ambiguities, the method can accurately recover the full 6DOF pose successfully.

avg

Project Page pdf suppmat [BibTex]

Project Page pdf suppmat [BibTex]


GRAB: A Dataset of Whole-Body Human Grasping of Objects
GRAB: A Dataset of Whole-Body Human Grasping of Objects

Taheri, O., Ghorbani, N., Black, M. J., Tzionas, D.

In Computer Vision – ECCV 2020, Springer International Publishing, Cham, August 2020 (inproceedings)

Abstract
Training computers to understand, model, and synthesize human grasping requires a rich dataset containing complex 3D object shapes, detailed contact information, hand pose and shape, and the 3D body motion over time. While "grasping" is commonly thought of as a single hand stably lifting an object, we capture the motion of the entire body and adopt the generalized notion of "whole-body grasps". Thus, we collect a new dataset, called GRAB (GRasping Actions with Bodies), of whole-body grasps, containing full 3D shape and pose sequences of 10 subjects interacting with 51 everyday objects of varying shape and size. Given MoCap markers, we fit the full 3D body shape and pose, including the articulated face and hands, as well as the 3D object pose. This gives detailed 3D meshes over time, from which we compute contact between the body and object. This is a unique dataset, that goes well beyond existing ones for modeling and understanding how humans grasp and manipulate objects, how their full body is involved, and how interaction varies with the task. We illustrate the practical value of GRAB with an example application; we train GrabNet, a conditional generative network, to predict 3D hand grasps for unseen 3D object shapes. The dataset and code are available for research purposes at https://grab.is.tue.mpg.de.

ps

pdf suppl video (long) video (short) link (url) DOI [BibTex]

pdf suppl video (long) video (short) link (url) DOI [BibTex]


Learning to Dress 3D People in Generative Clothing
Learning to Dress 3D People in Generative Clothing

Ma, Q., Yang, J., Ranjan, A., Pujades, S., Pons-Moll, G., Tang, S., Black, M. J.

In Computer Vision and Pattern Recognition (CVPR), pages: 6468-6477, IEEE, June 2020 (inproceedings)

Abstract
Three-dimensional human body models are widely used in the analysis of human pose and motion. Existing models, however, are learned from minimally-clothed 3D scans and thus do not generalize to the complexity of dressed people in common images and videos. Additionally, current models lack the expressive power needed to represent the complex non-linear geometry of pose-dependent clothing shape. To address this, we learn a generative 3D mesh model of clothed people from 3D scans with varying pose and clothing. Specifically, we train a conditional Mesh-VAE-GAN to learn the clothing deformation from the SMPL body model, making clothing an additional term on SMPL. Our model is conditioned on both pose and clothing type, giving the ability to draw samples of clothing to dress different body shapes in a variety of styles and poses. To preserve wrinkle detail, our Mesh-VAE-GAN extends patchwise discriminators to 3D meshes. Our model, named CAPE, represents global shape and fine local structure, effectively extending the SMPL body model to clothing. To our knowledge, this is the first generative model that directly dresses 3D human body meshes and generalizes to different poses.

ps

Project page Code Short video Long video arXiv DOI [BibTex]

Project page Code Short video Long video arXiv DOI [BibTex]


{GENTEL : GENerating Training data Efficiently for Learning to segment medical images}
GENTEL : GENerating Training data Efficiently for Learning to segment medical images

Thakur, R. P., Rocamora, S. P., Goel, L., Pohmann, R., Machann, J., Black, M. J.

Congrès Reconnaissance des Formes, Image, Apprentissage et Perception (RFAIP), June 2020 (conference)

Abstract
Accurately segmenting MRI images is crucial for many clinical applications. However, manually segmenting images with accurate pixel precision is a tedious and time consuming task. In this paper we present a simple, yet effective method to improve the efficiency of the image segmentation process. We propose to transform the image annotation task into a binary choice task. We start by using classical image processing algorithms with different parameter values to generate multiple, different segmentation masks for each input MRI image. Then, instead of segmenting the pixels of the images, the user only needs to decide whether a segmentation is acceptable or not. This method allows us to efficiently obtain high quality segmentations with minor human intervention. With the selected segmentations, we train a state-of-the-art neural network model. For the evaluation, we use a second MRI dataset (1.5T Dataset), acquired with a different protocol and containing annotations. We show that the trained network i) is able to automatically segment cases where none of the classical methods obtain a high quality result ; ii) generalizes to the second MRI dataset, which was acquired with a different protocol and was never seen at training time ; and iii) enables detection of miss-annotations in this second dataset. Quantitatively, the trained network obtains very good results: DICE score - mean 0.98, median 0.99- and Hausdorff distance (in pixels) - mean 4.7, median 2.0-.

ps

Project Page PDF [BibTex]

Project Page PDF [BibTex]


Generating 3D People in Scenes without People
Generating 3D People in Scenes without People

Zhang, Y., Hassan, M., Neumann, H., Black, M. J., Tang, S.

In Computer Vision and Pattern Recognition (CVPR), pages: 6194-6204, June 2020 (inproceedings)

Abstract
We present a fully automatic system that takes a 3D scene and generates plausible 3D human bodies that are posed naturally in that 3D scene. Given a 3D scene without people, humans can easily imagine how people could interact with the scene and the objects in it. However, this is a challenging task for a computer as solving it requires that (1) the generated human bodies to be semantically plausible within the 3D environment (e.g. people sitting on the sofa or cooking near the stove), and (2) the generated human-scene interaction to be physically feasible such that the human body and scene do not interpenetrate while, at the same time, body-scene contact supports physical interactions. To that end, we make use of the surface-based 3D human model SMPL-X. We first train a conditional variational autoencoder to predict semantically plausible 3D human poses conditioned on latent scene representations, then we further refine the generated 3D bodies using scene constraints to enforce feasible physical interaction. We show that our approach is able to synthesize realistic and expressive 3D human bodies that naturally interact with 3D environment. We perform extensive experiments demonstrating that our generative framework compares favorably with existing methods, both qualitatively and quantitatively. We believe that our scene-conditioned 3D human generation pipeline will be useful for numerous applications; e.g. to generate training data for human pose estimation, in video games and in VR/AR. Our project page for data and code can be seen at: \url{https://vlg.inf.ethz.ch/projects/PSI/}.

ps

Code PDF DOI [BibTex]

Code PDF DOI [BibTex]


Learning Physics-guided Face Relighting under Directional Light
Learning Physics-guided Face Relighting under Directional Light

Nestmeyer, T., Lalonde, J., Matthews, I., Lehrmann, A. M.

In Conference on Computer Vision and Pattern Recognition, pages: 5123-5132, IEEE/CVF, June 2020 (inproceedings) Accepted

Abstract
Relighting is an essential step in realistically transferring objects from a captured image into another environment. For example, authentic telepresence in Augmented Reality requires faces to be displayed and relit consistent with the observer's scene lighting. We investigate end-to-end deep learning architectures that both de-light and relight an image of a human face. Our model decomposes the input image into intrinsic components according to a diffuse physics-based image formation model. We enable non-diffuse effects including cast shadows and specular highlights by predicting a residual correction to the diffuse render. To train and evaluate our model, we collected a portrait database of 21 subjects with various expressions and poses. Each sample is captured in a controlled light stage setup with 32 individual light sources. Our method creates precise and believable relighting results and generalizes to complex illumination conditions and challenging poses, including when the subject is not looking straight at the camera.

ps

Paper [BibTex]

Paper [BibTex]


{VIBE}: Video Inference for Human Body Pose and Shape Estimation
VIBE: Video Inference for Human Body Pose and Shape Estimation

Kocabas, M., Athanasiou, N., Black, M. J.

In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), pages: 5252-5262, IEEE, IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) 2020, June 2020 (inproceedings)

Abstract
Human motion is fundamental to understanding behavior. Despite progress on single-image 3D pose and shape estimation, existing video-based state-of-the-art methodsfail to produce accurate and natural motion sequences due to a lack of ground-truth 3D motion data for training. To address this problem, we propose “Video Inference for Body Pose and Shape Estimation” (VIBE), which makes use of an existing large-scale motion capture dataset (AMASS) together with unpaired, in-the-wild, 2D keypoint annotations. Our key novelty is an adversarial learning framework that leverages AMASS to discriminate between real human motions and those produced by our temporal pose and shape regression networks. We define a temporal network architecture and show that adversarial training, at the sequence level, produces kinematically plausible motion sequences without in-the-wild ground-truth 3D labels. We perform extensive experimentation to analyze the importance of motion and demonstrate the effectiveness of VIBE on challenging 3D pose estimation datasets, achieving state-of-the-art performance. Code and pretrained models are available at https://github.com/mkocabas/VIBE

ps

arXiv code video supplemental video DOI Project Page [BibTex]

arXiv code video supplemental video DOI Project Page [BibTex]


Machine learning systems and methods of estimating body shape from images
Machine learning systems and methods of estimating body shape from images

Black, M., Rachlin, E., Heron, N., Loper, M., Weiss, A., Hu, K., Hinkle, T., Kristiansen, M.

(US Patent 10,679,046), June 2020 (patent)

Abstract
Disclosed is a method including receiving an input image including a human, predicting, based on a convolutional neural network that is trained using examples consisting of pairs of sensor data, a corresponding body shape of the human and utilizing the corresponding body shape predicted from the convolutional neural network as input to another convolutional neural network to predict additional body shape metrics.

ps

[BibTex]

[BibTex]


FootTile: a Rugged Foot Sensor for Force and Center of Pressure Sensing in Soft Terrain
FootTile: a Rugged Foot Sensor for Force and Center of Pressure Sensing in Soft Terrain

Felix Ruppert, , Badri-Spröwitz, A.

In Proceedings of the IEEE International Conference on Robotics and Automation, IEEE, International Conference on Robotics and Automation, May 2020 (inproceedings) Accepted

Abstract
In this paper, we present FootTile, a foot sensor for reaction force and center of pressure sensing in challenging terrain. We compare our sensor design to standard biomechanical devices, force plates and pressure plates. We show that FootTile can accurately estimate force and pressure distribution during legged locomotion. FootTile weighs 0.9g, has a sampling rate of 330 Hz, a footprint of 10×10 mm and can easily be adapted in sensor range to the required load case. In three experiments, we validate: first, the performance of the individual sensor, second an array of FootTiles for center of pressure sensing and third the ground reaction force estimation during locomotion in granular substrate. We then go on to show the accurate sensing capabilities of the waterproof sensor in liquid mud, as a showcase for real world rough terrain use.

dlg

Youtube1 Youtube2 Presentation link (url) [BibTex]

Youtube1 Youtube2 Presentation link (url) [BibTex]


From Variational to Deterministic Autoencoders
From Variational to Deterministic Autoencoders

Ghosh*, P., Sajjadi*, M. S. M., Vergari, A., Black, M. J., Schölkopf, B.

8th International Conference on Learning Representations (ICLR) , April 2020, *equal contribution (conference)

Abstract
Variational Autoencoders (VAEs) provide a theoretically-backed framework for deep generative models. However, they often produce “blurry” images, which is linked to their training objective. Sampling in the most popular implementation, the Gaussian VAE, can be interpreted as simply injecting noise to the input of a deterministic decoder. In practice, this simply enforces a smooth latent space structure. We challenge the adoption of the full VAE framework on this specific point in favor of a simpler, deterministic one. Specifically, we investigate how substituting stochasticity with other explicit and implicit regularization schemes can lead to a meaningful latent space without having to force it to conform to an arbitrarily chosen prior. To retrieve a generative mechanism for sampling new data points, we propose to employ an efficient ex-post density estimation step that can be readily adopted both for the proposed deterministic autoencoders as well as to improve sample quality of existing VAEs. We show in a rigorous empirical study that regularized deterministic autoencoding achieves state-of-the-art sample quality on the common MNIST, CIFAR-10 and CelebA datasets.

ei ps

arXiv link (url) [BibTex]

arXiv link (url) [BibTex]


Attractiveness and Confidence in Walking Style of Male and Female Virtual Characters
Attractiveness and Confidence in Walking Style of Male and Female Virtual Characters

Thaler, A., Bieg, A., Mahmood, N., Black, M. J., Mohler, B. J., Troje, N. F.

In IEEE Conference on Virtual Reality and 3D User Interfaces Abstracts and Workshops (VRW), pages: 678-679, March 2020 (inproceedings)

Abstract
Animated virtual characters are essential to many applications. Little is known so far about biological and personality inferences made from a virtual character’s body shape and motion. Here, we investigated how sex-specific differences in walking style relate to the perceived attractiveness and confidence of male and female virtual characters. The characters were generated by reconstructing body shape and walking motion from optical motion capture data. The results suggest that sexual dimorphism in walking style plays a different role in attributing biological and personality traits to male and female virtual characters. This finding has important implications for virtual character animation.

ps

pdf DOI [BibTex]

pdf DOI [BibTex]


Chained Representation Cycling: Learning to Estimate 3D Human Pose and Shape by Cycling Between Representations
Chained Representation Cycling: Learning to Estimate 3D Human Pose and Shape by Cycling Between Representations

Rueegg, N., Lassner, C., Black, M. J., Schindler, K.

In Thirty-Fourth AAAI Conference on Artificial Intelligence (AAAI-20), pages: 5561-5569, Febuary 2020 (inproceedings)

Abstract
The goal of many computer vision systems is to transform image pixels into 3D representations. Recent popular models use neural networks to regress directly from pixels to 3D object parameters. Such an approach works well when supervision is available, but in problems like human pose and shape estimation, it is difficult to obtain natural images with 3D ground truth. To go one step further, we propose a new architecture that facilitates unsupervised, or lightly supervised, learning. The idea is to break the problem into a series of transformations between increasingly abstract representations. Each step involves a cycle designed to be learnable without annotated training data, and the chain of cycles delivers the final solution. Specifically, we use 2D body part segments as an intermediate representation that contains enough information to be lifted to 3D, and at the same time is simple enough to be learned in an unsupervised way. We demonstrate the method by learning 3D human pose and shape from un-paired and un-annotated images. We also explore varying amounts of paired data and show that cycling greatly alleviates the need for paired data. While we present results for modeling humans, our formulation is general and can be applied to other vision problems.

ps

pdf [BibTex]

pdf [BibTex]


Machine learning systems and methods for augmenting images
Machine learning systems and methods for augmenting images

Black, M., Rachlin, E., Lee, E., Heron, N., Loper, M., Weiss, A., Smith, D.

(US Patent 10,529,137 B1), January 2020 (patent)

Abstract
Disclosed is a method including receiving visual input comprising a human within a scene, detecting a pose associated with the human using a trained machine learning model that detects human poses to yield a first output, estimating a shape (and optionally a motion) associated with the human using a trained machine learning model associated that detects shape (and optionally motion) to yield a second output, recognizing the scene associated with the visual input using a trained convolutional neural network which determines information about the human and other objects in the scene to yield a third output, and augmenting reality within the scene by leveraging one or more of the first output, the second output, and the third output to place 2D and/or 3D graphics in the scene.

ps

[BibTex]

[BibTex]


Learning Unsupervised Hierarchical Part Decomposition of 3D Objects from a Single RGB Image
Learning Unsupervised Hierarchical Part Decomposition of 3D Objects from a Single RGB Image

Paschalidou, D., Gool, L., Geiger, A.

In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) 2020, 2020 (inproceedings)

Abstract
Humans perceive the 3D world as a set of distinct objects that are characterized by various low-level (geometry, reflectance) and high-level (connectivity, adjacency, symmetry) properties. Recent methods based on convolutional neural networks (CNNs) demonstrated impressive progress in 3D reconstruction, even when using a single 2D image as input. However, the majority of these methods focuses on recovering the local 3D geometry of an object without considering its part-based decomposition or relations between parts. We address this challenging problem by proposing a novel formulation that allows to jointly recover the geometry of a 3D object as a set of primitives as well as their latent hierarchical structure without part-level supervision. Our model recovers the higher level structural decomposition of various objects in the form of a binary tree of primitives, where simple parts are represented with fewer primitives and more complex parts are modeled with more components. Our experiments on the ShapeNet and D-FAUST datasets demonstrate that considering the organization of parts indeed facilitates reasoning about 3D geometry.

avg

pdf suppmat Video 2 Project Page Slides Poster Video 1 [BibTex]

pdf suppmat Video 2 Project Page Slides Poster Video 1 [BibTex]


GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis
GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis

Schwarz, K., Liao, Y., Niemeyer, M., Geiger, A.

In Advances in Neural Information Processing Systems (NeurIPS), 2020 (inproceedings)

Abstract
While 2D generative adversarial networks have enabled high-resolution image synthesis, they largely lack an understanding of the 3D world and the image formation process. Thus, they do not provide precise control over camera viewpoint or object pose. To address this problem, several recent approaches leverage intermediate voxel-based representations in combination with differentiable rendering. However, existing methods either produce low image resolution or fall short in disentangling camera and scene properties, eg, the object identity may vary with the viewpoint. In this paper, we propose a generative model for radiance fields which have recently proven successful for novel view synthesis of a single scene. In contrast to voxel-based representations, radiance fields are not confined to a coarse discretization of the 3D space, yet allow for disentangling camera and scene properties while degrading gracefully in the presence of reconstruction ambiguity. By introducing a multi-scale patch-based discriminator, we demonstrate synthesis of high-resolution images while training our model from unposed 2D images alone. We systematically analyze our approach on several challenging synthetic and real-world datasets. Our experiments reveal that radiance fields are a powerful representation for generative image synthesis, leading to 3D consistent models that render with high fidelity.

avg

pdf suppmat video Project Page [BibTex]

pdf suppmat video Project Page [BibTex]


Towards Unsupervised Learning of Generative Models for 3D Controllable Image Synthesis
Towards Unsupervised Learning of Generative Models for 3D Controllable Image Synthesis

Liao, Y., Schwarz, K., Mescheder, L., Geiger, A.

In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) 2020, 2020 (inproceedings)

Abstract
In recent years, Generative Adversarial Networks have achieved impressive results in photorealistic image synthesis. This progress nurtures hopes that one day the classical rendering pipeline can be replaced by efficient models that are learned directly from images. However, current image synthesis models operate in the 2D domain where disentangling 3D properties such as camera viewpoint or object pose is challenging. Furthermore, they lack an interpretable and controllable representation. Our key hypothesis is that the image generation process should be modeled in 3D space as the physical world surrounding us is intrinsically three-dimensional. We define the new task of 3D controllable image synthesis and propose an approach for solving it by reasoning both in 3D space and in the 2D image domain. We demonstrate that our model is able to disentangle latent 3D factors of simple multi-object scenes in an unsupervised fashion from raw images. Compared to pure 2D baselines, it allows for synthesizing scenes that are consistent wrt. changes in viewpoint or object pose. We further evaluate various 3D representations in terms of their usefulness for this challenging task.

avg

pdf suppmat Video 2 Project Page Video 1 Slides Poster [BibTex]

pdf suppmat Video 2 Project Page Video 1 Slides Poster [BibTex]


Exploring Data Aggregation in Policy Learning for Vision-based Urban Autonomous Driving
Exploring Data Aggregation in Policy Learning for Vision-based Urban Autonomous Driving

Prakash, A., Behl, A., Ohn-Bar, E., Chitta, K., Geiger, A.

In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) 2020, 2020 (inproceedings)

Abstract
Data aggregation techniques can significantly improve vision-based policy learning within a training environment, e.g., learning to drive in a specific simulation condition. However, as on-policy data is sequentially sampled and added in an iterative manner, the policy can specialize and overfit to the training conditions. For real-world applications, it is useful for the learned policy to generalize to novel scenarios that differ from the training conditions. To improve policy learning while maintaining robustness when training end-to-end driving policies, we perform an extensive analysis of data aggregation techniques in the CARLA environment. We demonstrate how the majority of them have poor generalization performance, and develop a novel approach with empirically better generalization performance compared to existing techniques. Our two key ideas are (1) to sample critical states from the collected on-policy data based on the utility they provide to the learned policy in terms of driving behavior, and (2) to incorporate a replay buffer which progressively focuses on the high uncertainty regions of the policy's state distribution. We evaluate the proposed approach on the CARLA NoCrash benchmark, focusing on the most challenging driving scenarios with dense pedestrian and vehicle traffic. Our approach improves driving success rate by 16% over state-of-the-art, achieving 87% of the expert performance while also reducing the collision rate by an order of magnitude without the use of any additional modality, auxiliary tasks, architectural modifications or reward from the environment.

avg

pdf suppmat Video 2 Project Page Slides Video 1 [BibTex]

pdf suppmat Video 2 Project Page Slides Video 1 [BibTex]


Learning Situational Driving
Learning Situational Driving

Ohn-Bar, E., Prakash, A., Behl, A., Chitta, K., Geiger, A.

In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) 2020, 2020 (inproceedings)

Abstract
Human drivers have a remarkable ability to drive in diverse visual conditions and situations, e.g., from maneuvering in rainy, limited visibility conditions with no lane markings to turning in a busy intersection while yielding to pedestrians. In contrast, we find that state-of-the-art sensorimotor driving models struggle when encountering diverse settings with varying relationships between observation and action. To generalize when making decisions across diverse conditions, humans leverage multiple types of situation-specific reasoning and learning strategies. Motivated by this observation, we develop a framework for learning a situational driving policy that effectively captures reasoning under varying types of scenarios. Our key idea is to learn a mixture model with a set of policies that can capture multiple driving modes. We first optimize the mixture model through behavior cloning, and show it to result in significant gains in terms of driving performance in diverse conditions. We then refine the model by directly optimizing for the driving task itself, i.e., supervised with the navigation task reward. Our method is more scalable than methods assuming access to privileged information, e.g., perception labels, as it only assumes demonstration and reward-based supervision. We achieve over 98% success rate on the CARLA driving benchmark as well as state-of-the-art performance on a newly introduced generalization benchmark.

avg

pdf suppmat Video 2 Project Page Video 1 Slides [BibTex]

pdf suppmat Video 2 Project Page Video 1 Slides [BibTex]


On Joint Estimation of Pose, Geometry and svBRDF from a Handheld Scanner
On Joint Estimation of Pose, Geometry and svBRDF from a Handheld Scanner

Schmitt, C., Donne, S., Riegler, G., Koltun, V., Geiger, A.

In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) 2020, 2020 (inproceedings)

Abstract
We propose a novel formulation for joint recovery of camera pose, object geometry and spatially-varying BRDF. The input to our approach is a sequence of RGB-D images captured by a mobile, hand-held scanner that actively illuminates the scene with point light sources. Compared to previous works that jointly estimate geometry and materials from a hand-held scanner, we formulate this problem using a single objective function that can be minimized using off-the-shelf gradient-based solvers. By integrating material clustering as a differentiable operation into the optimization process, we avoid pre-processing heuristics and demonstrate that our model is able to determine the correct number of specular materials independently. We provide a study on the importance of each component in our formulation and on the requirements of the initial geometry. We show that optimizing over the poses is crucial for accurately recovering fine details and that our approach naturally results in a semantically meaningful material segmentation.

avg

pdf Project Page Slides Video Poster [BibTex]

pdf Project Page Slides Video Poster [BibTex]


Intrinsic Autoencoders for Joint Neural Rendering and Intrinsic Image Decomposition
Intrinsic Autoencoders for Joint Neural Rendering and Intrinsic Image Decomposition

Hassan Alhaija, Siva Mustikovela, Varun Jampani, Justus Thies, Matthias Niessner, Andreas Geiger, Carsten Rother

In International Conference on 3D Vision (3DV), 2020 (inproceedings)

Abstract
Neural rendering techniques promise efficient photo-realistic image synthesis while providing rich control over scene parameters by learning the physical image formation process. While several supervised methods have been pro-posed for this task, acquiring a dataset of images with accurately aligned 3D models is very difficult. The main contribution of this work is to lift this restriction by training a neural rendering algorithm from unpaired data. We pro-pose an auto encoder for joint generation of realistic images from synthetic 3D models while simultaneously decomposing real images into their intrinsic shape and appearance properties. In contrast to a traditional graphics pipeline, our approach does not require to specify all scene properties, such as material parameters and lighting by hand.Instead, we learn photo-realistic deferred rendering from a small set of 3D models and a larger set of unaligned real images, both of which are easy to acquire in practice. Simultaneously, we obtain accurate intrinsic decompositions of real images while not requiring paired ground truth. Our experiments confirm that a joint treatment of rendering and de-composition is indeed beneficial and that our approach out-performs state-of-the-art image-to-image translation base-lines both qualitatively and quantitatively.

avg

pdf suppmat [BibTex]

pdf suppmat [BibTex]


Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision
Differentiable Volumetric Rendering: Learning Implicit 3D Representations without 3D Supervision

Niemeyer, M., Mescheder, L., Oechsle, M., Geiger, A.

In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), IEEE International Conference on Computer Vision and Pattern Recognition (CVPR) 2020, 2020 (inproceedings)

Abstract
Learning-based 3D reconstruction methods have shown impressive results. However, most methods require 3D supervision which is often hard to obtain for real-world datasets. Recently, several works have proposed differentiable rendering techniques to train reconstruction models from RGB images. Unfortunately, these approaches are currently restricted to voxel- and mesh-based representations, suffering from discretization or low resolution. In this work, we propose a differentiable rendering formulation for implicit shape and texture representations. Implicit representations have recently gained popularity as they represent shape and texture continuously. Our key insight is that depth gradients can be derived analytically using the concept of implicit differentiation. This allows us to learn implicit shape and texture representations directly from RGB images. We experimentally show that our single-view reconstructions rival those learned with full 3D supervision. Moreover, we find that our method can be used for multi-view 3D reconstruction, directly resulting in watertight meshes.

avg

pdf suppmat Video 2 Project Page Video 1 Video 3 Slides Poster [BibTex]

pdf suppmat Video 2 Project Page Video 1 Video 3 Slides Poster [BibTex]


Learning Implicit Surface Light Fields
Learning Implicit Surface Light Fields

Oechsle, M., Niemeyer, M., Reiser, C., Mescheder, L., Strauss, T., Geiger, A.

In International Conference on 3D Vision (3DV), 2020 (inproceedings)

Abstract
Implicit representations of 3D objects have recently achieved impressive results on learning-based 3D reconstruction tasks. While existing works use simple texture models to represent object appearance, photo-realistic image synthesis requires reasoning about the complex interplay of light, geometry and surface properties. In this work, we propose a novel implicit representation for capturing the visual appearance of an object in terms of its surface light field. In contrast to existing representations, our implicit model represents surface light fields in a continuous fashion and independent of the geometry. Moreover, we condition the surface light field with respect to the location and color of a small light source. Compared to traditional surface light field models, this allows us to manipulate the light source and relight the object using environment maps. We further demonstrate the capabilities of our model to predict the visual appearance of an unseen object from a single real RGB image and corresponding 3D shape information. As evidenced by our experiments, our model is able to infer rich visual appearance including shadows and specular reflections. Finally, we show that the proposed representation can be embedded into a variational auto-encoder for generating novel appearances that conform to the specified illumination conditions.

avg

pdf suppmat Project Page [BibTex]

pdf suppmat Project Page [BibTex]

2017


The Numerics of GANs
The Numerics of GANs

Mescheder, L., Nowozin, S., Geiger, A.

In Proceedings from the conference "Neural Information Processing Systems 2017., (Editors: Guyon I. and Luxburg U.v. and Bengio S. and Wallach H. and Fergus R. and Vishwanathan S. and Garnett R.), Curran Associates, Inc., Advances in Neural Information Processing Systems 30 (NIPS), December 2017 (inproceedings)

Abstract
In this paper, we analyze the numerics of common algorithms for training Generative Adversarial Networks (GANs). Using the formalism of smooth two-player games we analyze the associated gradient vector field of GAN training objectives. Our findings suggest that the convergence of current algorithms suffers due to two factors: i) presence of eigenvalues of the Jacobian of the gradient vector field with zero real-part, and ii) eigenvalues with big imaginary part. Using these findings, we design a new algorithm that overcomes some of these limitations and has better convergence properties. Experimentally, we demonstrate its superiority on training common GAN architectures and show convergence on GAN architectures that are known to be notoriously hard to train.

avg

pdf Project Page [BibTex]

2017


pdf Project Page [BibTex]


A Generative Model of People in Clothing
A Generative Model of People in Clothing

Lassner, C., Pons-Moll, G., Gehler, P. V.

In Proceedings IEEE International Conference on Computer Vision (ICCV), IEEE, Piscataway, NJ, USA, IEEE International Conference on Computer Vision (ICCV), October 2017 (inproceedings)

Abstract
We present the first image-based generative model of people in clothing in a full-body setting. We sidestep the commonly used complex graphics rendering pipeline and the need for high-quality 3D scans of dressed people. Instead, we learn generative models from a large image database. The main challenge is to cope with the high variance in human pose, shape and appearance. For this reason, pure image-based approaches have not been considered so far. We show that this challenge can be overcome by splitting the generating process in two parts. First, we learn to generate a semantic segmentation of the body and clothing. Second, we learn a conditional model on the resulting segments that creates realistic images. The full model is differentiable and can be conditioned on pose, shape or color. The result are samples of people in different clothing items and styles. The proposed model can generate entirely new people with realistic clothing. In several experiments we present encouraging results that suggest an entirely data-driven approach to people generation is possible.

ps

link (url) Project Page [BibTex]

link (url) Project Page [BibTex]


Semantic Video CNNs through Representation Warping
Semantic Video CNNs through Representation Warping

Gadde, R., Jampani, V., Gehler, P. V.

In Proceedings IEEE International Conference on Computer Vision (ICCV), IEEE, Piscataway, NJ, USA, IEEE International Conference on Computer Vision (ICCV), October 2017 (inproceedings) Accepted

Abstract
In this work, we propose a technique to convert CNN models for semantic segmentation of static images into CNNs for video data. We describe a warping method that can be used to augment existing architectures with very lit- tle extra computational cost. This module is called Net- Warp and we demonstrate its use for a range of network architectures. The main design principle is to use optical flow of adjacent frames for warping internal network repre- sentations across time. A key insight of this work is that fast optical flow methods can be combined with many different CNN architectures for improved performance and end-to- end training. Experiments validate that the proposed ap- proach incurs only little extra computational cost, while im- proving performance, when video streams are available. We achieve new state-of-the-art results on the standard CamVid and Cityscapes benchmark datasets and show reliable im- provements over different baseline networks. Our code and models are available at http://segmentation.is. tue.mpg.de

ps

pdf Supplementary Project Page [BibTex]

pdf Supplementary Project Page [BibTex]


Bounding Boxes, Segmentations and Object Coordinates: How Important is Recognition for 3D Scene Flow Estimation in Autonomous Driving Scenarios?
Bounding Boxes, Segmentations and Object Coordinates: How Important is Recognition for 3D Scene Flow Estimation in Autonomous Driving Scenarios?

Behl, A., Jafari, O. H., Mustikovela, S. K., Alhaija, H. A., Rother, C., Geiger, A.

In Proceedings IEEE International Conference on Computer Vision (ICCV), IEEE, Piscataway, NJ, USA, IEEE International Conference on Computer Vision (ICCV), October 2017 (inproceedings)

Abstract
Existing methods for 3D scene flow estimation often fail in the presence of large displacement or local ambiguities, e.g., at texture-less or reflective surfaces. However, these challenges are omnipresent in dynamic road scenes, which is the focus of this work. Our main contribution is to overcome these 3D motion estimation problems by exploiting recognition. In particular, we investigate the importance of recognition granularity, from coarse 2D bounding box estimates over 2D instance segmentations to fine-grained 3D object part predictions. We compute these cues using CNNs trained on a newly annotated dataset of stereo images and integrate them into a CRF-based model for robust 3D scene flow estimation - an approach we term Instance Scene Flow. We analyze the importance of each recognition cue in an ablation study and observe that the instance segmentation cue is by far strongest, in our setting. We demonstrate the effectiveness of our method on the challenging KITTI 2015 scene flow benchmark where we achieve state-of-the-art performance at the time of submission.

avg

pdf suppmat Poster Project Page [BibTex]

pdf suppmat Poster Project Page [BibTex]


Sparsity Invariant CNNs
Sparsity Invariant CNNs

Uhrig, J., Schneider, N., Schneider, L., Franke, U., Brox, T., Geiger, A.

International Conference on 3D Vision (3DV) 2017, International Conference on 3D Vision (3DV), October 2017 (conference)

Abstract
In this paper, we consider convolutional neural networks operating on sparse inputs with an application to depth upsampling from sparse laser scan data. First, we show that traditional convolutional networks perform poorly when applied to sparse data even when the location of missing data is provided to the network. To overcome this problem, we propose a simple yet effective sparse convolution layer which explicitly considers the location of missing data during the convolution operation. We demonstrate the benefits of the proposed network architecture in synthetic and real experiments \wrt various baseline approaches. Compared to dense baselines, the proposed sparse convolution network generalizes well to novel datasets and is invariant to the level of sparsity in the data. For our evaluation, we derive a novel dataset from the KITTI benchmark, comprising 93k depth annotated RGB images. Our dataset allows for training and evaluating depth upsampling and depth prediction techniques in challenging real-world settings.

avg

pdf suppmat Project Page Project Page [BibTex]

pdf suppmat Project Page Project Page [BibTex]


OctNetFusion: Learning Depth Fusion from Data
OctNetFusion: Learning Depth Fusion from Data

Riegler, G., Ulusoy, A. O., Bischof, H., Geiger, A.

International Conference on 3D Vision (3DV) 2017, International Conference on 3D Vision (3DV), October 2017 (conference)

Abstract
In this paper, we present a learning based approach to depth fusion, i.e., dense 3D reconstruction from multiple depth images. The most common approach to depth fusion is based on averaging truncated signed distance functions, which was originally proposed by Curless and Levoy in 1996. While this method is simple and provides great results, it is not able to reconstruct (partially) occluded surfaces and requires a large number frames to filter out sensor noise and outliers. Motivated by the availability of large 3D model repositories and recent advances in deep learning, we present a novel 3D CNN architecture that learns to predict an implicit surface representation from the input depth maps. Our learning based method significantly outperforms the traditional volumetric fusion approach in terms of noise reduction and outlier suppression. By learning the structure of real world 3D objects and scenes, our approach is further able to reconstruct occluded regions and to fill in gaps in the reconstruction. We demonstrate that our learning based approach outperforms both vanilla TSDF fusion as well as TV-L1 fusion on the task of volumetric fusion. Further, we demonstrate state-of-the-art 3D shape completion results.

avg

pdf Video 1 Video 2 Project Page Project Page [BibTex]

pdf Video 1 Video 2 Project Page Project Page [BibTex]


Direct Visual Odometry for a Fisheye-Stereo Camera
Direct Visual Odometry for a Fisheye-Stereo Camera

Liu, P., Heng, L., Sattler, T., Geiger, A., Pollefeys, M.

In Proceedings IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), IEEE, Piscataway, NJ, USA, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), September 2017 (inproceedings)

Abstract
We present a direct visual odometry algorithm for a fisheye-stereo camera. Our algorithm performs simultaneous camera motion estimation and semi-dense reconstruction. The pipeline consists of two threads: a tracking thread and a mapping thread. In the tracking thread, we estimate the camera pose via semi-dense direct image alignment. To have a wider field of view (FoV) which is important for robotic perception, we use fisheye images directly without converting them to conventional pinhole images which come with a limited FoV. To address the epipolar curve problem, plane-sweeping stereo is used for stereo matching and depth initialization. Multiple depth hypotheses are tracked for selected pixels to better capture the uncertainty characteristics of stereo matching. Temporal motion stereo is then used to refine the depth and remove false positive depth hypotheses. Our implementation runs at an average of 20 Hz on a low-end PC. We run experiments in outdoor environments to validate our algorithm, and discuss the experimental results. We experimentally show that we are able to estimate 6D poses with low drift, and at the same time, do semi-dense 3D reconstruction with high accuracy.

avg

pdf Project Page [BibTex]

pdf Project Page [BibTex]


Augmented Reality Meets Deep Learning for Car Instance Segmentation in Urban Scenes
Augmented Reality Meets Deep Learning for Car Instance Segmentation in Urban Scenes

Alhaija, H. A., Mustikovela, S. K., Mescheder, L., Geiger, A., Rother, C.

In Proceedings of the British Machine Vision Conference 2017, Proceedings of the British Machine Vision Conference, September 2017 (inproceedings)

Abstract
The success of deep learning in computer vision is based on the availability of large annotated datasets. To lower the need for hand labeled images, virtually rendered 3D worlds have recently gained popularity. Unfortunately, creating realistic 3D content is challenging on its own and requires significant human effort. In this work, we propose an alternative paradigm which combines real and synthetic data for learning semantic instance segmentation models. Exploiting the fact that not all aspects of the scene are equally important for this task, we propose to augment real-world imagery with virtual objects of the target category. Capturing real-world images at large scale is easy and cheap, and directly provides real background appearances without the need for creating complex 3D models of the environment. We present an efficient procedure to augment these images with virtual objects. This allows us to create realistic composite images which exhibit both realistic background appearance as well as a large number of complex object arrangements. In contrast to modeling complete 3D environments, our data augmentation approach requires only a few user interactions in combination with 3D shapes of the target object category. We demonstrate the utility of the proposed approach for training a state-of-the-art high-capacity deep model for semantic instance segmentation. In particular, we consider the task of segmenting car instances on the KITTI dataset which we have annotated with pixel-accurate ground truth. Our experiments demonstrate that models trained on augmented imagery generalize better than those trained on synthetic data or models trained on limited amounts of annotated real data.

avg

pdf Project Page [BibTex]

pdf Project Page [BibTex]


 Effects of animation retargeting on perceived action outcomes
Effects of animation retargeting on perceived action outcomes

Kenny, S., Mahmood, N., Honda, C., Black, M. J., Troje, N. F.

Proceedings of the ACM Symposium on Applied Perception (SAP’17), pages: 2:1-2:7, September 2017 (conference)

Abstract
The individual shape of the human body, including the geometry of its articulated structure and the distribution of weight over that structure, influences the kinematics of a person's movements. How sensitive is the visual system to inconsistencies between shape and motion introduced by retargeting motion from one person onto the shape of another? We used optical motion capture to record five pairs of male performers with large differences in body weight, while they pushed, lifted, and threw objects. Based on a set of 67 markers, we estimated both the kinematics of the actions as well as the performer's individual body shape. To obtain consistent and inconsistent stimuli, we created animated avatars by combining the shape and motion estimates from either a single performer or from different performers. In a virtual reality environment, observers rated the perceived weight or thrown distance of the objects. They were also asked to explicitly discriminate between consistent and hybrid stimuli. Observers were unable to accomplish the latter, but hybridization of shape and motion influenced their judgements of action outcome in systematic ways. Inconsistencies between shape and motion were assimilated into an altered perception of the action outcome.

ps

pdf DOI [BibTex]

pdf DOI [BibTex]


Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks
Adversarial Variational Bayes: Unifying Variational Autoencoders and Generative Adversarial Networks

Mescheder, L., Nowozin, S., Geiger, A.

In Proceedings of the 34th International Conference on Machine Learning, 70, Proceedings of Machine Learning Research, (Editors: Doina Precup, Yee Whye Teh), PMLR, International Conference on Machine Learning (ICML), August 2017 (inproceedings)

Abstract
Variational Autoencoders (VAEs) are expressive latent variable models that can be used to learn complex probability distributions from training data. However, the quality of the resulting model crucially relies on the expressiveness of the inference model. We introduce Adversarial Variational Bayes (AVB), a technique for training Variational Autoencoders with arbitrarily expressive inference models. We achieve this by introducing an auxiliary discriminative network that allows to rephrase the maximum-likelihood-problem as a two-player game, hence establishing a principled connection between VAEs and Generative Adversarial Networks (GANs). We show that in the nonparametric limit our method yields an exact maximum-likelihood assignment for the parameters of the generative model, as well as the exact posterior distribution over the latent variables given an observation. Contrary to competing approaches which combine VAEs with GANs, our approach has a clear theoretical justification, retains most advantages of standard Variational Autoencoders and is easy to implement.

avg

pdf suppmat Project Page arxiv-version Project Page [BibTex]

pdf suppmat Project Page arxiv-version Project Page [BibTex]


Coupling Adaptive Batch Sizes with Learning Rates
Coupling Adaptive Batch Sizes with Learning Rates

Balles, L., Romero, J., Hennig, P.

In Proceedings Conference on Uncertainty in Artificial Intelligence (UAI) 2017, pages: 410-419, (Editors: Gal Elidan and Kristian Kersting), Association for Uncertainty in Artificial Intelligence (AUAI), Conference on Uncertainty in Artificial Intelligence (UAI), August 2017 (inproceedings)

Abstract
Mini-batch stochastic gradient descent and variants thereof have become standard for large-scale empirical risk minimization like the training of neural networks. These methods are usually used with a constant batch size chosen by simple empirical inspection. The batch size significantly influences the behavior of the stochastic optimization algorithm, though, since it determines the variance of the gradient estimates. This variance also changes over the optimization process; when using a constant batch size, stability and convergence is thus often enforced by means of a (manually tuned) decreasing learning rate schedule. We propose a practical method for dynamic batch size adaptation. It estimates the variance of the stochastic gradients and adapts the batch size to decrease the variance proportionally to the value of the objective function, removing the need for the aforementioned learning rate decrease. In contrast to recent related work, our algorithm couples the batch size to the learning rate, directly reflecting the known relationship between the two. On three image classification benchmarks, our batch size adaptation yields faster optimization convergence, while simultaneously simplifying learning rate tuning. A TensorFlow implementation is available.

ps pn

Code link (url) Project Page [BibTex]

Code link (url) Project Page [BibTex]


Joint Graph Decomposition and Node Labeling by Local Search
Joint Graph Decomposition and Node Labeling by Local Search

Levinkov, E., Uhrig, J., Tang, S., Omran, M., Insafutdinov, E., Kirillov, A., Rother, C., Brox, T., Schiele, B., Andres, B.

In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages: 1904-1912, IEEE, IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), July 2017 (inproceedings)

ps

PDF Supplementary DOI Project Page [BibTex]

PDF Supplementary DOI Project Page [BibTex]


Dynamic {FAUST}: Registering Human Bodies in Motion
Dynamic FAUST: Registering Human Bodies in Motion

Bogo, F., Romero, J., Pons-Moll, G., Black, M. J.

In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, IEEE, Piscataway, NJ, USA, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017 (inproceedings)

Abstract
While the ready availability of 3D scan data has influenced research throughout computer vision, less attention has focused on 4D data; that is 3D scans of moving nonrigid objects, captured over time. To be useful for vision research, such 4D scans need to be registered, or aligned, to a common topology. Consequently, extending mesh registration methods to 4D is important. Unfortunately, no ground-truth datasets are available for quantitative evaluation and comparison of 4D registration methods. To address this we create a novel dataset of high-resolution 4D scans of human subjects in motion, captured at 60 fps. We propose a new mesh registration method that uses both 3D geometry and texture information to register all scans in a sequence to a common reference topology. The approach exploits consistency in texture over both short and long time intervals and deals with temporal offsets between shape and texture capture. We show how using geometry alone results in significant errors in alignment when the motions are fast and non-rigid. We evaluate the accuracy of our registration and provide a dataset of 40,000 raw and aligned meshes. Dynamic FAUST extends the popular FAUST dataset to dynamic 4D data, and is available for research purposes at http://dfaust.is.tue.mpg.de.

ps

pdf video Project Page Project Page Project Page [BibTex]

pdf video Project Page Project Page Project Page [BibTex]


Learning from Synthetic Humans
Learning from Synthetic Humans

Varol, G., Romero, J., Martin, X., Mahmood, N., Black, M. J., Laptev, I., Schmid, C.

In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, IEEE, Piscataway, NJ, USA, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017 (inproceedings)

Abstract
Estimating human pose, shape, and motion from images and videos are fundamental challenges with many applications. Recent advances in 2D human pose estimation use large amounts of manually-labeled training data for learning convolutional neural networks (CNNs). Such data is time consuming to acquire and difficult to extend. Moreover, manual labeling of 3D pose, depth and motion is impractical. In this work we present SURREAL (Synthetic hUmans foR REAL tasks): a new large-scale dataset with synthetically-generated but realistic images of people rendered from 3D sequences of human motion capture data. We generate more than 6 million frames together with ground truth pose, depth maps, and segmentation masks. We show that CNNs trained on our synthetic dataset allow for accurate human depth estimation and human part segmentation in real RGB images. Our results and the new dataset open up new possibilities for advancing person analysis using cheap and large-scale synthetic data.

ps

arXiv project data Project Page Project Page [BibTex]

arXiv project data Project Page Project Page [BibTex]


On human motion prediction using recurrent neural networks
On human motion prediction using recurrent neural networks

Martinez, J., Black, M. J., Romero, J.

In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, IEEE, Piscataway, NJ, USA, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017 (inproceedings)

Abstract
Human motion modelling is a classical problem at the intersection of graphics and computer vision, with applications spanning human-computer interaction, motion synthesis, and motion prediction for virtual and augmented reality. Following the success of deep learning methods in several computer vision tasks, recent work has focused on using deep recurrent neural networks (RNNs) to model human motion, with the goal of learning time-dependent representations that perform tasks such as short-term motion prediction and long-term human motion synthesis. We examine recent work, with a focus on the evaluation methodologies commonly used in the literature, and show that, surprisingly, state-of-the-art performance can be achieved by a simple baseline that does not attempt to model motion at all. We investigate this result, and analyze recent RNN methods by looking at the architectures, loss functions, and training procedures used in state-of-the-art approaches. We propose three changes to the standard RNN models typically used for human motion, which result in a simple and scalable RNN architecture that obtains state-of-the-art performance on human motion prediction.

ps

arXiv Project Page [BibTex]

arXiv Project Page [BibTex]


Slow Flow: Exploiting High-Speed Cameras for Accurate and Diverse Optical Flow Reference Data
Slow Flow: Exploiting High-Speed Cameras for Accurate and Diverse Optical Flow Reference Data

Janai, J., Güney, F., Wulff, J., Black, M., Geiger, A.

In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, pages: 1406-1416, IEEE, Piscataway, NJ, USA, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017 (inproceedings)

Abstract
Existing optical flow datasets are limited in size and variability due to the difficulty of capturing dense ground truth. In this paper, we tackle this problem by tracking pixels through densely sampled space-time volumes recorded with a high-speed video camera. Our model exploits the linearity of small motions and reasons about occlusions from multiple frames. Using our technique, we are able to establish accurate reference flow fields outside the laboratory in natural environments. Besides, we show how our predictions can be used to augment the input images with realistic motion blur. We demonstrate the quality of the produced flow fields on synthetic and real-world datasets. Finally, we collect a novel challenging optical flow dataset by applying our technique on data from a high-speed camera and analyze the performance of the state-of-the-art in optical flow under various levels of motion blur.

avg ps

pdf suppmat Project page Video DOI Project Page [BibTex]

pdf suppmat Project page Video DOI Project Page [BibTex]


Optical Flow in Mostly Rigid Scenes
Optical Flow in Mostly Rigid Scenes

Wulff, J., Sevilla-Lara, L., Black, M. J.

In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, pages: 6911-6920, IEEE, Piscataway, NJ, USA, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017 (inproceedings)

Abstract
The optical flow of natural scenes is a combination of the motion of the observer and the independent motion of objects. Existing algorithms typically focus on either recovering motion and structure under the assumption of a purely static world or optical flow for general unconstrained scenes. We combine these approaches in an optical flow algorithm that estimates an explicit segmentation of moving objects from appearance and physical constraints. In static regions we take advantage of strong constraints to jointly estimate the camera motion and the 3D structure of the scene over multiple frames. This allows us to also regularize the structure instead of the motion. Our formulation uses a Plane+Parallax framework, which works even under small baselines, and reduces the motion estimation to a one-dimensional search problem, resulting in more accurate estimation. In moving regions the flow is treated as unconstrained, and computed with an existing optical flow method. The resulting Mostly-Rigid Flow (MR-Flow) method achieves state-of-the-art results on both the MPISintel and KITTI-2015 benchmarks.

ps

pdf SupMat video code Project Page [BibTex]

pdf SupMat video code Project Page [BibTex]


OctNet: Learning Deep 3D Representations at High Resolutions
OctNet: Learning Deep 3D Representations at High Resolutions

Riegler, G., Ulusoy, O., Geiger, A.

In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, IEEE, Piscataway, NJ, USA, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017 (inproceedings)

Abstract
We present OctNet, a representation for deep learning with sparse 3D data. In contrast to existing models, our representation enables 3D convolutional networks which are both deep and high resolution. Towards this goal, we exploit the sparsity in the input data to hierarchically partition the space using a set of unbalanced octrees where each leaf node stores a pooled feature representation. This allows to focus memory allocation and computation to the relevant dense regions and enables deeper networks without compromising resolution. We demonstrate the utility of our OctNet representation by analyzing the impact of resolution on several 3D tasks including 3D object classification, orientation estimation and point cloud labeling.

avg ps

pdf suppmat Project Page Video Project Page [BibTex]

pdf suppmat Project Page Video Project Page [BibTex]


Reflectance Adaptive Filtering Improves Intrinsic Image Estimation
Reflectance Adaptive Filtering Improves Intrinsic Image Estimation

Nestmeyer, T., Gehler, P. V.

In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages: 1771-1780, IEEE, Piscataway, NJ, USA, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017 (inproceedings)

ps

pre-print DOI Project Page Project Page [BibTex]

pre-print DOI Project Page Project Page [BibTex]


A Multi-View Stereo Benchmark with High-Resolution Images and Multi-Camera Videos
A Multi-View Stereo Benchmark with High-Resolution Images and Multi-Camera Videos

Schöps, T., Schönberger, J. L., Galliani, S., Sattler, T., Schindler, K., Pollefeys, M., Geiger, A.

In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, IEEE, Piscataway, NJ, USA, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017 (inproceedings)

Abstract
Motivated by the limitations of existing multi-view stereo benchmarks, we present a novel dataset for this task. Towards this goal, we recorded a variety of indoor and outdoor scenes using a high-precision laser scanner and captured both high-resolution DSLR imagery as well as synchronized low-resolution stereo videos with varying fields-of-view. To align the images with the laser scans, we propose a robust technique which minimizes photometric errors conditioned on the geometry. In contrast to previous datasets, our benchmark provides novel challenges and covers a diverse set of viewpoints and scene types, ranging from natural scenes to man-made indoor and outdoor environments. Furthermore, we provide data at significantly higher temporal and spatial resolution. Our benchmark is the first to cover the important use case of hand-held mobile devices while also providing high-resolution DSLR camera images. We make our datasets and an online evaluation server available at http://www.eth3d.net.

avg

pdf suppmat Project Page Project Page [BibTex]

pdf suppmat Project Page Project Page [BibTex]