Header logo is


no image
Predictors from causal features do not generalize better to new domains

Nastl, V. Y., Hardt, M.

Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), December 2024 (conference) Submitted

We study how well machine learning models trained on causal features generalize across domains. We consider 16 prediction tasks on tabular datasets covering applications in health, employment, education, social benefits, and politics. Each dataset comes with multiple domains, allowing us to test how well a model trained in one domain performs in another. For each prediction task, we select features that have a causal influence on the target of prediction. Our goal is to test the hypothesis that models trained on causal features generalize better across domains. Without exception, we find that predictors using all available features, regardless of causality, have better in-domain and out-of-domain accuracy than predictors using causal features. Moreover, even the absolute drop in accuracy from one domain to the other is no better for causal predictors than for models that use all features. If the goal is to generalize to new domains, practitioners might as well train the best possible model on all available features.


ArXiv [BibTex]


ArXiv [BibTex]

no image
ImageNot: A contrast with ImageNet preserves model rankings

Salaudeen, O., Hardt, M.

Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), December 2024 (conference) Submitted

We introduce ImageNot, a dataset designed to match the scale of ImageNet while differing drastically in other aspects. We show that key model architectures developed for ImageNet over the years rank identically when trained and evaluated on ImageNot to how they rank on ImageNet. This is true when training models from scratch or fine-tuning them. Moreover, the relative improvements of each model over earlier models strongly correlate in both datasets. We further give evidence that ImageNot has a similar utility as ImageNet for transfer learning purposes. Our work demonstrates a surprising degree of external validity in the relative performance of image classification models. This stands in contrast with absolute accuracy numbers that typically drop sharply even under small changes to a dataset.


ArXiv [BibTex]

ArXiv [BibTex]

no image
Questioning the survey responses of large language models

Dominguez-Olmedo, R., Hardt, M., Mendler-Dünner, C.

Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), December 2024 (conference) Submitted

As large language models increase in capability, researchers have started to conduct surveys of all kinds on these models in order to investigate the population represented by their responses. In this work, we critically examine language models' survey responses on the basis of the well-established American Community Survey by the U.S. Census Bureau and investigate whether they elicit a faithful representations of any human population. Using a de-facto standard multiple-choice prompting technique and evaluating 39 different language models using systematic experiments, we establish two dominant patterns: First, models' responses are governed by ordering and labeling biases, leading to variations across models that do not persist after adjusting for systematic biases. Second, models' responses do not contain the entropy variations and statistical signals typically found in human populations. As a result, a binary classifier can almost perfectly differentiate model-generated data from the responses of the U.S. census. At the same time, models' relative alignment with different demographic subgroups can be predicted from the subgroups' entropy, irrespective of the model's training data or training strategy. Taken together, our findings suggest caution in treating models' survey responses as equivalent to those of human populations.


ArXiv [BibTex]

ArXiv [BibTex]

no image
Training on the Test Task Confounds Evaluation and Emergence

Dominguez-Olmedo, R., Dorner, F. E., Hardt, M.

Thirty-eighth Annual Conference on Neural Information Processing Systems (NeurIPS), December 2024 (conference) Submitted

We study a fundamental problem in the evaluation of large language models that we call training on the test task. Unlike wrongful practices like training on the test data, leakage, or data contamination, training on the test task is not malpractice. Rather, the term describes a growing set of techniques to include task-relevant data in the pretraining stage of a language model. We demonstrate that training on the test task confounds both relative model evaluations and claims about emergent capabilities. We argue that the seeming superiority of one model family over another may be explained by a different degree of training on the test task. To this end, we propose an effective method to adjust for training on the test task by fine-tuning each model under comparison on the same task-relevant data before evaluation. We then show that instances of emergent behavior largely vanish once we adjust for training on the test task. This also applies to reported instances of emergent behavior that cannot be explained by the choice of evaluation metric. Our work promotes a new perspective on the evaluation of large language models with broad implications for benchmarking and the study of emergent capabilities.


ArXiv [BibTex]

ArXiv [BibTex]

On predicting {3D} bone locations inside the human body
On predicting 3D bone locations inside the human body

Dakri, A., Arora, V., Challier, L., Keller, M., Black, M. J., Pujades, S.

In 26th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI), October 2024 (inproceedings)

Knowing the precise location of the bones inside the human body is key in several medical tasks, such as patient placement inside an imaging device or surgical navigation inside a patient. Our goal is to predict the bone locations using only an external 3D body surface obser- vation. Existing approaches either validate their predictions on 2D data (X-rays) or with pseudo-ground truth computed from motion capture using biomechanical models. Thus, methods either suffer from a 3D-2D projection ambiguity or directly lack validation on clinical imaging data. In this work, we start with a dataset of segmented skin and long bones obtained from 3D full body MRI images that we refine into individual bone segmentations. To learn the skin to bones correlations, one needs to register the paired data. Few anatomical models allow to register a skeleton and the skin simultaneously. One such method, SKEL, has a skin and skeleton that is jointly rigged with the same pose parameters. How- ever, it lacks the flexibility to adjust the bone locations inside its skin. To address this, we extend SKEL into SKEL-J to allow its bones to fit the segmented bones while its skin fits the segmented skin. These precise fits allow us to train SKEL-J to more accurately infer the anatomical joint locations from the skin surface. Our qualitative and quantitative results show how our bone location predictions are more accurate than all existing approaches. To foster future research, we make available for research purposes the individual bone segmentations, the fitted SKEL-J models as well as the new inference methods.


Project page [BibTex]

Project page [BibTex]

GraspXL: Generating Grasping Motions for Diverse Objects at Scale
GraspXL: Generating Grasping Motions for Diverse Objects at Scale

Zhang, H., Christen, S., Fan, Z., Hilliges, O., Song, J.

European Conference on Computer Vision (ECCV), ECCV 2024, September 2024 (conference) Accepted


Code Video Paper [BibTex]

Code Video Paper [BibTex]

Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects
Benchmarks and Challenges in Pose Estimation for Egocentric Hand Interactions with Objects

Fan, Z., Ohkawa, T., Yang, L., Lin, N., Zhou, Z., Zhou, S., Liang, J., Gao, Z., Zhang, X., Zhang, X., Li, F., Zheng, L., Lu, F., Zeid, K. A., Leibe, B., On, J., Baek, S., Prakash, A., Gupta, S., He, K., Sato, Y., Hilliges, O., Chang, H. J., Yao, A.

European Conference on Computer Vision (ECCV), ECCV 2024, September 2024 (conference) Accepted


Paper Leaderboard [BibTex]

Paper Leaderboard [BibTex]

no image
Learning to Control Emulated Muscles in Real Robots: Towards Exploiting Bio-Inspired Actuator Morphology

Schumacher, P., Krause, L., Schneider, J., Büchler, D., Martius, G., Haeufle, D.

In 10th International Conference on Biomedical Robotics and Biomechatronics (BioRob), September 2024 (inproceedings) Accepted


arXiv [BibTex]

arXiv [BibTex]

no image
Modelling Variability in Human Annotator Simulation

Wu*, W., Chen*, W., Zhang, C., Woodland, P. C.

62nd Annual Meeting of the Association for Computational Linguistics (ACL), August 2024, *equal contribution (conference) Accepted




no image
On the Growth of Mistakes in Differentially Private Online Learning: A Lower Bound Perspective

Dmitriev, D., Szabó, K., Sanyal, A.

Proceedings of the 37th Annual Conference on Learning Theory (COLT), 247, pages: 1379-1398, Proceedings of Machine Learning Research, (Editors: Agrawal, Shipra and Roth, Aaron), PMLR, July 2024, (talk) (conference)


link (url) [BibTex]

link (url) [BibTex]

no image
Robustness of Nonlinear Representation Learning

Buchholz, S., Schölkopf, B.

41st International Conference on Machine Learning (ICML), July 2024 (conference) Accepted




no image
Diffusive Gibbs Sampling

Chen*, W., Zhang*, M., Paige, B., Hernández-Lobato, J. M., Barber, D.

41st International Conference on Machine Learning (ICML), July 2024, *equal contribution (conference) Accepted




no image
Simultaneous identification of models and parameters of scientific simulators

Schröder, C., Macke, J. H.

41st International Conference on Machine Learning (ICML), Vienna, Austria, July 2024 (conference) Accepted




no image
Causal Action Influence Aware Counterfactual Data Augmentation

Urpi, N. A., Bagatella, M., Vlastelica, M., Martius, G.

In 41st International Conference on Machine Learning (ICML), July 2024 (inproceedings) Accepted




no image
Diffusion Tempering Improves Parameter Estimation with Probabilistic Integrators for ODEs

Beck, J., Bosch, N., Deistler, M., Kadhim, K. L., Macke, J. H., Hennig, P., Berens, P.

41st International Conference on Machine Learning (ICML), July 2024 (conference) Accepted




no image
What Makes Safety Fine-tuning Methods Safe? A Mechanistic Study

Jain, S., Lubana, E. S., Oksuz, K., Joy, T., Torr, P. H. S., Sanyal, A., Dokania, P. K.

ICML 2024 Workshop on Mechanistic Interpretability (Spotlight), July 2024 (conference) Accepted


link (url) [BibTex]

link (url) [BibTex]

no image
Inherent Trade-Offs between Diversity and Stability in Multi-Task Benchmarks

Zhang, G., Hardt, M.

In Proceedings of the 41st International Conference on Machine Learning (ICML), 235, pages: 58984-59002, Proceedings of Machine Learning Research, (Editors: Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix), PMLR, July 2024 (inproceedings)

We examine multi-task benchmarks in machine learning through the lens of social choice theory. We draw an analogy between benchmarks and electoral systems, where models are candidates and tasks are voters. This suggests a distinction between cardinal and ordinal benchmark systems. The former aggregate numerical scores into one model ranking; the latter aggregate rankings for each task. We apply Arrow's impossibility theorem to ordinal benchmarks to highlight the inherent limitations of ordinal systems, particularly their sensitivity to the inclusion of irrelevant models. Inspired by Arrow's theorem, we empirically demonstrate a strong trade-off between diversity and sensitivity to irrelevant changes in existing multi-task benchmarks. Our result is based on new quantitative measures of diversity and sensitivity that we introduce. Sensitivity quantifies the impact that irrelevant changes to tasks have on a benchmark. Diversity captures the degree of disagreement in model rankings across tasks. We develop efficient approximation algorithms for both measures, as exact computation is computationally challenging. Through extensive experiments on seven cardinal benchmarks and eleven ordinal benchmarks, we demonstrate a clear trade-off between diversity and stability: The more diverse a multi-task benchmark, the more sensitive to trivial changes it is. Additionally, we show that the aggregated rankings of existing benchmarks are highly unstable under irrelevant changes. The codes and data are available at https://socialfoundations.github.io/benchbench/.


ArXiv link (url) [BibTex]

ArXiv link (url) [BibTex]

no image
Allocation Requires Prediction Only if Inequality Is Low

Shirali, A., Abebe, R., Hardt, M.

In Proceedings of the 41st International Conference on Machine Learning (ICML), pages: 45114-45153, Proceedings of Machine Learning Research, (Editors: Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix), PMLR, July 2024, equal contribution (inproceedings)

Algorithmic predictions are emerging as a promising solution concept for efficiently allocating societal resources. Fueling their use is an underlying assumption that such systems are necessary to identify individuals for interventions. We propose a principled framework for assessing this assumption: Using a simple mathematical model, we evaluate the efficacy of prediction-based allocations in settings where individuals belong to larger units such as hospitals, neighborhoods, or schools. We find that prediction-based allocations outperform baseline methods using aggregate unit-level statistics only when between-unit inequality is low and the intervention budget is high. Our results hold for a wide range of settings for the price of prediction, treatment effect heterogeneity, and unit-level statistics’ learnability. Combined, we highlight the potential limits to improving the efficacy of interventions through prediction


ArXiv link (url) [BibTex]

ArXiv link (url) [BibTex]

no image
Position: Understanding LLMs Requires More Than Statistical Generalization

Reizinger, P., Ujváry, S., Mészáros, A., Kerekes, A., Brendel, W., Huszár, F.

41st International Conference on Machine Learning (ICML), July 2024 (conference) Accepted


arXiv [BibTex]

arXiv [BibTex]

{ContourCraft}: Learning to Resolve Intersections in Neural Multi-Garment Simulations
ContourCraft: Learning to Resolve Intersections in Neural Multi-Garment Simulations

Grigorev, A., Becherini, G., Black, M., Hilliges, O., Thomaszewski, B.

In ACM SIGGRAPH 2024 Conference Papers, pages: 1-10, SIGGRAPH ’24, Association for Computing Machinery, New York, NY, USA, July 2024 (inproceedings)

Learning-based approaches to cloth simulation have started to show their potential in recent years. However, handling collisions and intersections in neural simulations remains a largely unsolved problem. In this work, we present ContourCraft, a learning-based solution for handling intersections in neural cloth simulations. Unlike conventional approaches that critically rely on intersection-free inputs, ContourCraft robustly recovers from intersections introduced through missed collisions, self-penetrating bodies, or errors in manually designed multi-layer outfits. The technical core of ContourCraft is a novel intersection contour loss that penalizes interpenetrations and encourages rapid resolution thereof. We integrate our intersection loss with a collision-avoiding repulsion objective into a neural cloth simulation method based on graph neural networks (GNNs). We demonstrate our method’s ability across a challenging set of diverse multi-layer outfits under dynamic human motions. Our extensive analysis indicates that ContourCraft significantly improves collision handling for learned simulation and produces visually compelling results.


paper arXiv DOI [BibTex]

paper arXiv DOI [BibTex]

no image
Learning with 3D rotations, a hitchhiker’s guide to SO(3)

Geist, A. R., Frey, J., Zhobro, M., Levina, A., Martius, G.

In 41st International Conference on Machine Learning (ICML), July 2024 (inproceedings) Accepted




no image
LPGD: A General Framework for Backpropagation through Embedded Optimization Layers

Paulus, A., Martius, G., Musil, V.

In 41st International Conference on Machine Learning (ICML), July 2024 (inproceedings) Accepted




no image
Improving Neural Additive Models with Bayesian Principles

Bouchiat, K., Immer, A., Yèche, H., Rätsch, G., Fortuin, V.

41st International Conference on Machine Learning (ICML), July 2024 (conference) Accepted




no image
Unveiling CLIP Dynamics: Linear Mode Connectivity and Generalization

Abdolahpourrostam, A., Sanyal, A., Moosavi-Dezfooli, S.

ICML 2024 Workshop on Foundation Models in the Wild, July 2024 (conference) Accepted


link (url) [BibTex]

link (url) [BibTex]

no image
A Sparsity Principle for Partially Observable Causal Representation Learning

Xu, D., Yao, D., Lachapelle, S., Taslakian, P., von Kügelgen, J., Locatello, F., Magliacane, S.

41st International Conference on Machine Learning (ICML), July 2024 (conference) Accepted




no image
Don’t Label Twice: Quantity Beats Quality when Comparing Binary Classifiers on a Budget

Dorner, F. E., Hardt, M.

In Proceedings of the 41st International Conference on Machine Learning (ICML), 235, pages: 11544-11572, (Editors: Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix), PMLR, July 2024 (inproceedings)

We study how to best spend a budget of noisy labels to compare the accuracy of two binary classifiers. It's common practice to collect and aggregate multiple noisy labels for a given data point into a less noisy label via a majority vote. We prove a theorem that runs counter to conventional wisdom. If the goal is to identify the better of two classifiers, we show it's best to spend the budget on collecting a single label for more samples. Our result follows from a non-trivial application of Cram\'er's theorem, a staple in the theory of large deviations. We discuss the implications of our work for the design of machine learning benchmarks, where they overturn some time-honored recommendations. In addition, our results provide sample size bounds superior to what follows from Hoeffding's bound.


ArXiv link (url) [BibTex]

ArXiv link (url) [BibTex]

no image
Targeted Reduction of Causal Models

Kekic, A., Schölkopf, B., Besserve, M.

40th Conference on Uncertainty in Artificial Intelligence (UAI), July 2024 (conference) Accepted




no image
Geometry-Aware Instrumental Variable Regression

Kremer, H., Schölkopf, B.

41st International Conference on Machine Learning (ICML), July 2024 (conference) Accepted




no image
Stitching Manifolds: Leveraging Interaction to Compose Object Representations into Scenes

Keurti, H., Schölkopf, B., Aceituno, P. V., Grewe, B.

ICML 2024 Workshop on Geometry-grounded Representation Learning and Generative Modeling (GRaM), July 2024 (conference) Accepted


link (url) [BibTex]

link (url) [BibTex]

no image
Causal Inference from Competing Treatments

Stoica, A., Nastl, V. Y., Hardt, M.

In Proceedings of the 41st International Conference on Machine Learning (ICML), 235, pages: 46657-46691, Proceedings of Machine Learning Research, (Editors: Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix), PMLR, July 2024 (inproceedings)

Many applications of RCTs involve the presence of multiple treatment administrators -- from field experiments to online advertising -- that compete for the subjects' attention. In the face of competition, estimating a causal effect becomes difficult, as the position at which a subject sees a treatment influences their response, and thus the treatment effect. In this paper, we build a game-theoretic model of agents who wish to estimate causal effects in the presence of competition, through a bidding system and a utility function that minimizes estimation error. Our main technical result establishes an approximation with a tractable objective that maximizes the sample value obtained through strategically allocating budget on subjects. This allows us to find an equilibrium in our model: we show that the tractable objective has a pure Nash equilibrium, and that any Nash equilibrium is an approximate equilibrium for our general objective that minimizes estimation error under broad conditions. Conceptually, our work successfully combines elements from causal inference and game theory to shed light on the equilibrium behavior of experimentation under competition.


ArXiv link (url) [BibTex]

ArXiv link (url) [BibTex]

no image
Reflectance Outperforms Force and Position in Model-Free Needle Puncture Detection

L’Orsa, R., Bisht, A., Yu, L., Murari, K., Westwick, D. T., Sutherland, G. R., Kuchenbecker, K. J.

In Proceedings of the Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), Orlando, USA, July 2024 (inproceedings) Accepted

The surgical procedure of needle thoracostomy temporarily corrects accidental over-pressurization of the space between the chest wall and the lungs. However, failure rates of up to 94.1% have been reported, likely because this procedure is done blind: operators estimate by feel when the needle has reached its target. We believe instrumented needles could help operators discern entry into the target space, but limited success has been achieved using force and/or position to try to discriminate needle puncture events during simulated surgical procedures. We thus augmented our needle insertion system with a novel in-bore double-fiber optical setup. Tissue reflectance measurements as well as 3D force, torque, position, and orientation were recorded while two experimenters repeatedly inserted a bevel-tipped percutaneous needle into ex vivo porcine ribs. We applied model-free puncture detection to various filtered time derivatives of each sensor data stream offline. In the held-out test set of insertions, puncture-detection precision improved substantially using reflectance measurements compared to needle insertion force alone (3.3-fold increase) or position alone (11.6-fold increase).


Project Page [BibTex]

Project Page [BibTex]

no image
Products, Abstractions and Inclusions of Causal Spaces

Buchholz, S., Park, J., Schölkopf, B.

40th Conference on Uncertainty in Artificial Intelligence (UAI), July 2024 (conference) Accepted




no image
Provable Privacy with Non-Private Pre-Processing

Hu, Y., Sanyal, A., Schölkopf, B.

41st International Conference on Machine Learning (ICML), July 2024 (conference) Accepted




no image
Do Language Models Exhibit the Same Cognitive Biases in Problem Solving as Human Learners?

Opedal, A., Stolfo, A., Shirakami, H., Jiao, Y., Cotterell, R., Schölkopf, B., Saparov, A., Sachan, M.

41st International Conference on Machine Learning (ICML), July 2024 (conference) Accepted




no image
The Role of Learning Algorithms in Collective Action

Ben-Dov*, O., Fawkes*, J., Samadi, S., Sanyal, A.

41st International Conference on Machine Learning (ICML), July 2024, *equal contribution (conference) Accepted

ei hml



no image
Accuracy on the wrong line: On the pitfalls of noisy data for OOD generalisation

Sanyal, A., Hu, Y., Yu, Y., Ma, Y., Wang, Y., Schölkopf, B.

ICML 2024 Next Generation of AI Safety Workshop (Oral), July 2024 (conference) Accepted


arXiv [BibTex]

arXiv [BibTex]

no image
Detecting and Identifying Selection Structure in Sequential Data

Zheng, Y., Tang, Z., Qiu, Y., Schölkopf, B., Zhang, K.

41st International Conference on Machine Learning (ICML), July 2024 (conference) Accepted




no image
Modelling Microbial Communities with Graph Neural Networks

Ruaud, A., Sancaktar, C., Bagatella, M., Ratzke, C., Martius, G.

In 41st International Conference on Machine Learning (ICML), July 2024 (inproceedings) Accepted




no image
All-in-one simulation-based inference

Gloeckler, M., Deistler, M., Weilbach, C. D., Wood, F., Macke, J. H.

41st International Conference on Machine Learning (ICML), July 2024 (conference) Accepted




no image
Causal Inference out of Control: Estimating Performativity without Treatment Randomization

Cheng, G., Hardt, M., Mendler-Dünner, C.

In Proceedings of the 41st International Conference on Machine Learning, 235, pages: 8077-8103, Proceedings of Machine Learning Research, (Editors: Salakhutdinov, Ruslan and Kolter, Zico and Heller, Katherine and Weller, Adrian and Oliver, Nuria and Scarlett, Jonathan and Berkenkamp, Felix), PMLR, July 2024 (inproceedings)

Regulators and academics are increasingly interested in the causal effect that algorithmic actions of a digital platform have on user consumption. In pursuit of estimating this effect from observational data, we identify a set of assumptions that permit causal identifiability without assuming randomized platform actions. Our results are applicable to platforms that rely on machine-learning-powered predictions and leverage knowledge from historical data. The key novelty of our approach is to explicitly model the dynamics of consumption over time, exploiting the repeated interaction of digital platforms with their participants to prove our identifiability results. By viewing the platform as a controller acting on a dynamical system, we can show that exogenous variation in consumption and appropriately responsive algorithmic control actions are sufficient for identifying the causal effect of interest. We complement our claims with an analysis of ready-to-use finite sample estimators and empirical investigations. More broadly, our results deriving identifiability conditions tailored to digital platform settings illustrate a fruitful interplay of control theory and causal inference


ArXiv link (url) [BibTex]

ArXiv link (url) [BibTex]

{HOLD}: Category-agnostic {3D} Reconstruction of Interacting Hands and Objects from Video
HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video

(Accepted as Highlight: Top 11.9%)

Fan, Z., Parelli, M., Kadoglou, M. E., Kocabas, M., Chen, X., Black, M. J., Hilliges, O.

In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), CVPR 2024, June 2024 (inproceedings)


Paper Project Code [BibTex]

Paper Project Code [BibTex]

{EMAGE}: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling
EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling

Liu, H., Zhu, Z., Becherini, G., Peng, Y., Su, M., Zhou, Y., Zhe, X., Iwamoto, N., Zheng, B., Black, M. J.

In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), CVPR 2024, June 2024 (inproceedings)

We propose EMAGE, a framework to generate full-body human gestures from audio and masked gestures, encompassing facial, local body, hands, and global movements. To achieve this, we first introduce BEAT2 (BEAT-SMPLX-FLAME), a new mesh-level holistic co-speech dataset. BEAT2 combines a MoShed SMPL-X body with FLAME head parameters and further refines the modeling of head, neck, and finger movements, offering a community-standardized, high-quality 3D motion captured dataset. EMAGE leverages masked body gesture priors during training to boost inference performance. It involves a Masked Audio Gesture Transformer, facilitating joint training on audio-to-gesture generation and masked gesture reconstruction to effectively encode audio and body gesture hints. Encoded body hints from masked gestures are then separately employed to generate facial and body movements. Moreover, EMAGE adaptively merges speech features from the audio's rhythm and content and utilizes four compositional VQ-VAEs to enhance the results' fidelity and diversity. Experiments demonstrate that EMAGE generates holistic gestures with state-of-the-art performance and is flexible in accepting predefined spatial-temporal gesture inputs, generating complete, audio-synchronized results. Our code and dataset are available.


arXiv project dataset code gradio colab video [BibTex]

arXiv project dataset code gradio colab video [BibTex]

{HUGS}: Human Gaussian Splats
HUGS: Human Gaussian Splats

Kocabas, M., Chang, R., Gabriel, J., Tuzel, O., Ranjan, A.

In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), CVPR 2024, June 2024 (inproceedings)

Recent advances in neural rendering have improved both training and rendering times by orders of magnitude. While these methods demonstrate state-of-the-art quality and speed, they are designed for photogrammetry of static scenes and do not generalize well to freely moving humans in the environment. In this work, we introduce Human Gaussian Splats (HUGS) that represents an animatable human together with the scene using 3D Gaussian Splatting (3DGS). Our method takes only a monocular video with a small number of (50-100) frames, and it automatically learns to disentangle the static scene and a fully animatable human avatar within 30 minutes. We utilize the SMPL body model to initialize the human Gaussians. To capture details that are not modeled by SMPL (e.g., cloth, hairs), we allow the 3D Gaussians to deviate from the human body model. Utilizing 3D Gaussians for animated humans brings new challenges, including the artifacts created when articulating the Gaussians. We propose to jointly optimize the linear blend skinning weights to coordinate the movements of individual Gaussians during animation. Our approach enables novel-pose synthesis of human and novel view synthesis of both the human and the scene. We achieve state-of-the-art rendering quality with a rendering speed of 60 FPS while being ∼100× faster to train over previous work.


arXiv Github Project Page YouTube Poster [BibTex]

arXiv Github Project Page YouTube Poster [BibTex]

{SCULPT}: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes
SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes

Sanyal, S., Ghosh, P., Yang, J., Black, M. J., Thies, J., Bolkart, T.

In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), CVPR 2024, June 2024 (inproceedings)

We present SCULPT, a novel 3D generative model for clothed and textured 3D meshes of humans. Specifically, we devise a deep neural network that learns to represent the geometry and appearance distribution of clothed human bodies. Training such a model is challenging, as datasets of textured 3D meshes for humans are limited in size and accessibility. Our key observation is that there exist medium-sized 3D scan datasets like CAPE, as well as large-scale 2D image datasets of clothed humans and multiple appearances can be mapped to a single geometry. To effectively learn from the two data modalities, we propose an unpaired learning procedure for pose-dependent clothed and textured human meshes. Specifically, we learn a pose-dependent geometry space from 3D scan data. We represent this as per vertex displacements w.r.t. the SMPL model. Next, we train a geometry conditioned texture generator in an unsupervised way using the 2D image data. We use intermediate activations of the learned geometry model to condition our texture generator. To alleviate entanglement between pose and clothing type, and pose and clothing appearance, we condition both the texture and geometry generators with attribute labels such as clothing types for the geometry, and clothing colors for the texture generator. We automatically generated these conditioning labels for the 2D images based on the visual question answering model BLIP and CLIP. We validate our method on the SCULPT dataset, and compare to state-of-the-art 3D generative models for clothed human bodies.


Project page Data Code Video Arxiv [BibTex]

no image
GraphDreamer: Compositional 3D Scene Synthesis from Scene Graphs

Gao, G., Liu, W., Chen, A., Geiger, A., Schölkopf, B.

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2024 (conference) Accepted




{HIT}: Estimating Internal Human Implicit Tissues from the Body Surface
HIT: Estimating Internal Human Implicit Tissues from the Body Surface

Keller, M., Arora, V., Dakri, A., Chandhok, S., Machann, J., Fritsche, A., Black, M. J., Pujades, S.

In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), CVPR 2024, June 2024 (inproceedings)

The creation of personalized anatomical digital twins is important in the fields of medicine, computer graphics, sports science, and biomechanics. To observe a subject's anatomy, expensive medical devices (MRI or CT) are required and the creation of the digital model is often time-consuming and involves manual effort. Instead, we leverage the fact that the shape of the body surface is correlated with the internal anatomy; for example, from surface observations alone, one can predict body composition and skeletal structure. In this work, we go further and learn to infer the 3D location of three important anatomic tissues: subcutaneous adipose tissue (fat), lean tissue (muscles and organs), and long bones. To learn to infer these tissues, we tackle several key challenges. We first create a dataset of human tissues by segmenting full-body MRI scans and registering the SMPL body mesh to the body surface. With this dataset, we train HIT (Human Implicit Tissues), an implicit function that, given a point inside a body, predicts its tissue class. HIT leverages the SMPL body model shape and pose parameters to canonicalize the medical data. Unlike SMPL, which is trained from upright 3D scans, the MRI scans are taken of subjects lying on a table, resulting in significant soft-tissue deformation. Consequently, HIT uses a learned volumetric deformation field that undoes these deformations. Since HIT is parameterized by SMPL, we can repose bodies or change the shape of subjects and the internal structures deform appropriately. We perform extensive experiments to validate HIT's ability to predict plausible internal structure for novel subjects. The dataset and HIT model are publicly available to foster future research in this direction.


Project page Paper [BibTex]

Project page Paper [BibTex]

{WHAM}: Reconstructing World-grounded Humans with Accurate {3D} Motion
WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion

Shin, S., Kim, J., Halilaj, E., Black, M. J.

In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), CVPR 2024, June 2024 (inproceedings)

The estimation of 3D human motion from video has progressed rapidly but current methods still have several key limitations. First, most methods estimate the human in camera coordinates. Second, prior work on estimating humans in global coordinates often assumes a flat ground plane and produces foot sliding. Third, the most accurate methods rely on computationally expensive optimization pipelines, limiting their use to offline applications. Finally, existing video-based methods are surprisingly less accurate than single-frame methods. We address these limitations with WHAM (World-grounded Humans with Accurate Motion), which accurately and efficiently reconstructs 3D human motion in a global coordinate system from video. WHAM learns to lift 2D keypoint sequences to 3D using motion capture data and fuses this with video features, integrating motion context and visual information. WHAM exploits camera angular velocity estimated from a SLAM method together with human motion to estimate the body's global trajectory. We combine this with a contact-aware trajectory refinement method that lets WHAM capture human motion in diverse conditions, such as climbing stairs. WHAM outperforms all existing 3D human motion recovery methods across multiple in-the-wild benchmarks. Code will be available for research purposes.


arXiv project code [BibTex]

arXiv project code [BibTex]

{WANDR}: Intention-guided Human Motion Generation
WANDR: Intention-guided Human Motion Generation

Diomataris, M., Athanasiou, N., Taheri, O., Wang, X., Hilliges, O., Black, M. J.

In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), CVPR 2024, June 2024 (inproceedings)

Synthesizing natural human motions that enable a 3D human avatar to walk and reach for arbitrary goals in 3D space remains an unsolved problem with many applications. Existing methods (data-driven or using reinforcement learning) are limited in terms of generalization and motion naturalness.A primary obstacle is the scarcity of training data that combines locomotion with goal reaching. To address this, we introduce WANDR, a data-driven model that takes an avatar's initial pose and a goal's 3D position and generates natural human motions that place the end effector (wrist) on the goal location. To solve this, we introduce novel \textit{intention} features that drive rich goal-oriented movement. \textit{Intention} guides the agent to the goal, and interactively adapts the generation to novel situations without needing to define sub-goals or the entire motion path. Crucially, intention allows training on datasets that have goal-oriented motions as well as those that do not. WANDR is a conditional Variational Auto-Encoder (c-VAE), which we train using the AMASS and CIRCLE datasets. We evaluate our method extensively and demonstrate its ability to generate natural and long-term motions that reach 3D goals and generalize to unseen goal locations.


project website arXiv YouTube Video Code [BibTex]

project website arXiv YouTube Video Code [BibTex]

Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles
Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles

Sklyarova, V., Zakharov, E., Hilliges, O., Black, M. J., Thies, J.

In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), CVPR 2024, June 2024 (inproceedings)

We present HAAR, a new strand-based generative model for 3D human hairstyles. Specifically, based on textual inputs, HAAR produces 3D hairstyles that could be used as production-level assets in modern computer graphics engines. Current AI-based generative models take advantage of powerful 2D priors to reconstruct 3D content in the form of point clouds, meshes, or volumetric functions. However, by using the 2D priors, they are intrinsically limited to only recovering the visual parts. Highly occluded hair structures can not be reconstructed with those methods, and they only model the "outer shell", which is not ready to be used in physics-based rendering or simulation pipelines. In contrast, we propose a first text-guided generative method that uses 3D hair strands as an underlying representation. Leveraging 2D visual question-answering (VQA) systems, we automatically annotate synthetic hair models that are generated from a small set of artist-created hairstyles. This allows us to train a latent diffusion model that operates in a common hairstyle UV space. In qualitative and quantitative studies, we demonstrate the capabilities of the proposed model and compare it to existing hairstyle generation approaches.

hvlg ncs ps

ArXiv Code link (url) [BibTex]

ArXiv Code link (url) [BibTex]