Animal Shape and Pose

In the past 20 years impressive advances have been made in capturing, modeling and tracking the human body in 3D, thanks to the availability of large amounts of 3D body scans and mocap data. Animals have received much less attention, despite many applications in biomechanics, biology, neuroscience, robotics, smart farming, and entertainment. The main reason for the lack of methods for the 3D modeling and tracking of animals is that the methods derived for the human body cannot be easily applied to animals: animals are not cooperative, cannot be brought to the lab in large numbers, and current scanners cannot be taken into the wild. It is also challenging to capture significant motion of animals using motion capture equipment.

In this project we develop methods to learn 3D articulated statistical shape models that can represent a wide variety of species in the animal kingdom, allowing intra- and inter-species analysis of 3D shape and the automatic and non-invasive assessment of animal shape and pose from images and video.

From scans of toy animals, we learn the SMAL (Skinned Multi Animal Linear) model [], a 3D articulated statistical shape model able to represent animal shapes for different species: big cats, dogs, cows, horses, zebras, and hippos. To capture animals outside the SMAL space, we developed SMALR (SMAL with Refinement) []. SMALR estimates a detailed 3D textured mesh using a small set of uncalibrated, non-simultaneous images of the animal. We are also developing species-specific models for dogs, rats and horses exploiting different modalities.

Today animal motion is mostly captured indoors for domestic species with marker-based systems. We are exploiting our 3D articulated animal shape models to develop markerless motion capture systems that can capture the shape and articulated motion of wild animals in their natural environment. In this context, we have applied our technology to capture the shape, pose, and texture of the Grevy's zebra from in-the-wild images [] using a novel neural-network regressor. The approach learns the zebra shape space during training using a photometric loss.

We have also focused on domestic animal species of particular relevance. Specifically, we have addressed the problem of 3D dog reconstruction from a single image in two works. In BARC [], we explored how incorporating dog breed information for training images enables learning a network that better estimates 3D shapes. In a subsequent work, BITE [], we tackled the challenge of 3D dog reconstruction for complex poses, such as sitting and lying down, by leveraging contact information.

Among domestic animals, horses are arguably the most interesting and widely studied. We have created the first 4D scanner for horses and, in collaboration with the Swedish University of Agricultural Sciences (SLU), have so far scanned more than 150 subjects. Additionally, we developed the first 3D articulated parametric shape model for animals, learned from real subjects. Our model, VAREN [], can also simulate muscle deformation during motion. Furthermore, we captured a unique dataset of horse motion, PFERD [], by employing dense motion capture on a diverse set of horses with varying shapes, performing both common and complex, uncommon poses and motions.

In AWOL [] we have addressed how to control parametric models of animals and trees using language, such that 3D models of species not seen during training can be easily created.

Members

Perceiving Systems

Silvia Zuffi

Guest Scientist

Perceiving Systems

Angjoo Kanazawa

Perceiving Systems

Michael Black

Emeritus / Acting Director

Perceiving Systems

Nadine Rueegg

Doctoral Researcher

Optics and Sensing Laboratory

Senya Polikovsky

Optics & Sensing Laboratory

Perceiving Systems

Peter Kulits

Doctoral Researcher

Publications

Perceiving Systems Conference Paper VAREN: Very Accurate and Realistic Equine Network Zuffi, S., Mellbin, Y., Li, C., Hoeschle, M., Kjellstrom, H., Polikovsky, S., Hernlund, E., Black, M. J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 5374-5383, Piscataway, NJ, CVPR, September 2024 (Published)

Abstract ›

Data-driven three-dimensional parametric shape models of the human body have gained enormous popularity both for the analysis of visual data and for the generation of synthetic humans. Following a similar approach for animals does not scale to the multitude of existing animal species, not to mention the difficulty of accessing subjects to scan in 3D. However, we argue that for domestic species of great importance, like the horse, it is a highly valuable investment to put effort into gathering a large dataset of real 3D scans, and learn a realistic 3D articulated shape model. We introduce VAREN, a novel 3D articulated parametric shape model learned from 3D scans of many real horses. VAREN bridges synthesis and analysis tasks, as the generated model instances have unprecedented realism, while being able to represent horses of different sizes and shapes. Differently from previous body models, VAREN has two resolutions, an anatomical skeleton, and interpretable, learned pose-dependent deformations, which are related to the body muscles. We show with experiments that this formulation has superior performance with respect to previous strategies for modeling pose-dependent deformations in the human body case, while also being more compact and allowing an analysis of the relationship between articulation and muscle deformation during articulated motion.

project page paper DOI URL BibTeX

Perceiving Systems Conference Paper AWOL: Analysis WithOut synthesis using Language Zuffi, S., Black, M. J. In European Conference on Computer Vision (ECCV 2024), LNCS, Springer Cham, September 2024 (Published)

Abstract ›

Many classical parametric 3D shape models exist, but creating novel shapes with such models requires expert knowledge of their parameters. For example, imagine creating a specific type of tree using procedural graphics or a new kind of animal from a statistical shape model. Our key idea is to leverage language to control such existing models to produce novel shapes. This involves learning a mapping between the latent space of a vision-language model and the parameter space of the 3D model, which we do using a small set of shape and text pairs. Our hypothesis is that mapping from language to parameters allows us to generate parameters for objects that were never seen during training. If the mapping between language and parameters is sufficiently smooth, then interpolation or generalization in language should translate appropriately into novel 3D shapes. We test our approach with two very different types of parametric shape models (quadrupeds and arboreal trees). We use a learned statistical shape model of quadrupeds and show that we can use text to generate new animals not present during training. In particular, we demonstrate state-of-the-art shape estimation of 3D dogs. This work also constitutes the first language-driven method for generating 3D trees. Finally, embedding images in the CLIP latent space enables us to generate animals and trees directly from images.

Paper URL BibTeX

Perceiving Systems Article The Poses for Equine Research Dataset (PFERD) Li, C., Mellbin, Y., Krogager, J., Polikovsky, S., Holmberg, M., Ghorbani, N., Black, M. J., Kjellström, H., Zuffi, S., Hernlund, E. Nature Scientific Data, 11, May 2024 (Published)

Abstract ›

Studies of quadruped animal motion help us to identify diseases, understand behavior and unravel the mechanics behind gaits in animals. The horse is likely the best-studied animal in this aspect, but data capture is challenging and time-consuming. Computer vision techniques improve animal motion extraction, but the development relies on reference datasets, which are scarce, not open-access and often provide data from only a few anatomical landmarks. Addressing this data gap, we introduce PFERD, a video and 3D marker motion dataset from horses using a full-body set-up of densely placed over 100 skin-attached markers and synchronized videos from ten camera angles. Five horses of diverse conformations provide data for various motions from basic poses (eg. walking, trotting) to advanced motions (eg. rearing, kicking). We further express the 3D motions with current techniques and a 3D parameterized model, the hSMAL model, establishing a baseline for 3D horse markerless motion capture. PFERD enables advanced biomechanical studies and provides a resource of ground truth data for the methodological development of markerless motion capture.

paper DOI URL BibTeX

Perceiving Systems Conference Paper Three-D Safari: Learning to Estimate Zebra Pose, Shape, and Texture from Images "In the Wild" Zuffi, S., Kanazawa, A., Berger-Wolf, T., Black, M. J. In International Conference on Computer Vision, 5358-5367, IEEE, International Conference on Computer Vision, October 2019 (Published)

Abstract ›

We present the first method to perform automatic 3D pose, shape and texture capture of animals from images acquired in-the-wild. In particular, we focus on the problem of capturing 3D information about Grevy's zebras from a collection of images. The Grevy's zebra is one of the most endangered species in Africa, with only a few thousand individuals left. Capturing the shape and pose of these animals can provide biologists and conservationists with information about animal health and behavior. In contrast to research on human pose, shape and texture estimation, training data for endangered species is limited, the animals are in complex natural scenes with occlusion, they are naturally camouflaged, travel in herds, and look similar to each other. To overcome these challenges, we integrate the recent SMAL animal model into a network-based regression pipeline, which we train end-to-end on synthetically generated images with pose, shape, and background variation. Going beyond state-of-the-art methods for human shape and pose estimation, our method learns a shape space for zebras during training. Learning such a shape space from images using only a photometric loss is novel, and the approach can be used to learn shape in other settings with limited 3D supervision. Moreover, we couple 3D pose and shape prediction with the task of texture synthesis, obtaining a full texture map of the animal from a single image. We show that the predicted texture map allows a novel per-instance unsupervised optimization over the network features. This method, SMALST (SMAL with learned Shape and Texture) goes beyond previous work, which assumed manual keypoints and/or segmentation, to regress directly from pixels to 3D animal shape, pose and texture. Code and data are available at https://github.com/silviazuffi/smalst

code pdf supmat iccv19 presentation DOI BibTeX

Perceiving Systems Conference Paper Lions and Tigers and Bears: Capturing Non-Rigid, 3D, Articulated Shape from Images Zuffi, S., Kanazawa, A., Black, M. J. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3955-3963, IEEE Computer Society, IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2018

Abstract ›

Animals are widespread in nature and the analysis of their shape and motion is important in many fields and industries. Modeling 3D animal shape, however, is difficult because the 3D scanning methods used to capture human shape are not applicable to wild animals or natural settings. Consequently, we propose a method to capture the detailed 3D shape of animals from images alone. The articulated and deformable nature of animals makes this problem extremely challenging, particularly in unconstrained environments with moving and uncalibrated cameras. To make this possible, we use a strong prior model of articulated animal shape that we fit to the image data. We then deform the animal shape in a canonical reference pose such that it matches image evidence when articulated and projected into multiple images. Our method extracts significantly more 3D shape detail than previous methods and is able to model new species, including the shape of an extinct animal, using only a few video frames. Additionally, the projected 3D shapes are accurate enough to facilitate the extraction of a realistic texture map from multiple frames.

pdf code/data 3D models DOI BibTeX

Perceiving Systems Conference Paper 3D Menagerie: Modeling the 3D Shape and Pose of Animals Zuffi, S., Kanazawa, A., Jacobs, D., Black, M. J. In Proceedings IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, 5524-5532, IEEE, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017

Abstract ›

There has been significant work on learning realistic, articulated, 3D models of the human body. In contrast, there are few such models of animals, despite many applications. The main challenge is that animals are much less cooperative than humans. The best human body models are learned from thousands of 3D scans of people in specific poses, which is infeasible with live animals. Consequently, we learn our model from a small set of 3D scans of toy figurines in arbitrary poses. We employ a novel part-based shape model to compute an initial registration to the scans. We then normalize their pose, learn a statistical shape model, and refine the registrations and the model together. In this way, we accurately align animal scans from different quadruped families with very different shapes and poses. With the registration to a common template we learn a shape space representing animals including lions, cats, dogs, horses, cows and hippos. Animal shapes can be sampled from the model, posed, animated, and fit to data. We demonstrate generalization by fitting it to images of real animals including species not seen in training.

pdf video BibTeX