Publications

DEPARTMENTS

Emperical Interference

Haptic Intelligence

Modern Magnetic Systems

Perceiving Systems

Physical Intelligence

Robotic Materials

Social Foundations of Computation


Research Groups

Autonomous Vision

Autonomous Learning

Bioinspired Autonomous Miniature Robots

Dynamic Locomotion

Embodied Vision

Human Aspects of Machine Learning

Intelligent Control Systems

Learning and Dynamical Systems

Locomotion in Biorobotic and Somatic Systems

Micro, Nano, and Molecular Systems

Movement Generation and Control

Neural Capture and Synthesis

Physics for Inference and Optimization

Organizational Leadership and Diversity

Probabilistic Learning Group


Topics

Robot Learning

Conference Paper

2022

Autonomous Learning

Robotics

AI

Career

Award


Empirical Inference Conference Paper Analyzing the Role of Semantic Representations in the Era of Large Language Models Jin*, Z., Chen*, Y., Gonzalez*, F., Liu, J., Zhang, J., Michael, J., Schölkopf, B., Diab, M. Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Volume 1: Long Papers:3781-3798, (Editors: Duh, Kevin and Gomez, Helena and Bethard, Steven), Association for Computational Linguistics, June 2024, *equal contribution (Published) arXiv DOI URL BibTeX

Empirical Inference Conference Paper Automatic Generation of Model and Data Cards: A Step Towards Responsible AI Liu, J., Li, W., Jin, Z., Diab, M. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL), Volume 1: Long Papers:1975-1997, (Editors: Duh, Kevin and Gomez, Helena and Bethard, Steven), Association for Computational Linguistics, June 2024 (Published) DOI URL BibTeX

Haptic Intelligence Bachelor Thesis Kalman Filter Approach to Sensor Fusion of Ultra-Wideband Positioning and IMU Readings for Enhanced Indoor Tracking of Collaborating Humans Hudhud Mughrabi, M. Kadir Has University, Istanbul, Turkey, June 2024, Bachelor of Science (BSc) in Mechatronics Engineering (Published)
The question of how humans collaborate to perform complex tasks such as surgery has previously been investigated via multimodal sensing and analysis. Ultra-wideband (UWB) localization systems can be deployed to track collaborating team members due to good maneuverability even in cramped environments. However, UWB systems' sampling rate is inversely proportional to the number of people tracked, and their accuracy is hindered by electromagnetic occlusion. This thesis combines UWB positioning with measurements from a wearable inertial measurement unit (IMU) by applying an error-state extended Kalman filter (ES-EKF) to improve position and orientation estimation during team collaborative studies. ES-EKF offers faster and more consistent estimation and can be estimated even without UWB input. Single-human and multi-human sessions were recorded and filtered for evaluation in comparison to ground truth from optical motion capture. By integrating the IMU, the ES-EKF increases the sampling rate from 0.5–20 Hz to 100 Hz. As it is corrected in only 2 degrees of freedom (DOF), the ES-EKF yields improved results over UWB in 4 out of 6 DOF: lateral and longitudinal position and yaw and pitch orientation. Further filter design implications are suggested for future application of ES-EKF in position and orientation estimation of collaborating humans.
BibTeX

Perceiving Systems Conference Paper Multi-Track Timeline Control for Text-Driven 3D Human Motion Generation Petrovich, M., Litany, O., Iqbal, U., Black, M. J., Varol, G., Peng, X. B., Rempe, D. In CVPR Workshop on Human Motion Generation, Seattle, CVPR, June 2024 (Published)
Recent advances in generative modeling have led to promising progress on synthesizing 3D human motion from text, with methods that can generate character animations from short prompts and specified durations. However, using a single text prompt as input lacks the fine-grained control needed by animators, such as composing multiple actions and defining precise durations for parts of the motion. To address this, we introduce the new problem of timeline control for text-driven motion synthesis, which provides an intuitive, yet fine-grained, input interface for users. Instead of a single prompt, users can specify a multi-track timeline of multiple prompts organized in temporal intervals that may overlap. This enables specifying the exact timings of each action and composing multiple actions in sequence or at overlapping intervals. To generate composite animations from a multi-track timeline, we propose a new test-time denoising method. This method can be integrated with any pre-trained motion diffusion model to synthesize realistic motions that accurately reflect the timeline. At every step of denoising, our method processes each timeline interval (text prompt) individually, subsequently aggregating the predictions with consideration for the specific body parts engaged in each action. Experimental comparisons and ablations validate that our method produces realistic motions that respect the semantics and timing of given text prompts.
code website paper-arxiv video URL BibTeX

Neural Capture and Synthesis Perceiving Systems Conference Paper Neuropostors: Neural Geometry-aware 3D Crowd Character Impostors Ostrek, M., Mitra, N. J., O’Sullivan, C. In 27th International Conference on Pattern Recognition (ICPR), Springer, 27th International Conference on Pattern Recognition (ICPR), June 2024 (Published)
Crowd rendering and animation was a very active research area over a decade ago, but in recent years this has lessened, mainly due to improvements in graphics acceleration hardware. Nevertheless, there is still a high demand for generating varied crowd appearances and animation for games, movie production, and mixed-reality applications. Current approaches are still limited in terms of both the behavioral and appearance aspects of virtual characters due to (i) high memory and computational demands; and (ii) person-hours needed of skilled artists in the context of short production cycles. A promising previous approach to generating varied crowds was the use of pre-computed impostor representations for crowd characters, which could replace an animation of a 3D mesh with a simplified 2D impostor for every frame of an animation sequence, e.g., Geopostors [1]. However, with their high memory demands at a time when improvements in consumer graphics accelerators were outpacing memory availability, the practicality of such methods was limited. Inspired by this early work and recent advances in the field of Neural Rendering, we present a new character representation: Neuropostors. We train a Convolutional Neural Network as a means of compressing both the geometric properties and animation key-frames for a 3D character, thereby allowing for constant-time rendering of animated characters from arbitrary camera views. Our method also allows for explicit illumination and material control, by utilizing a flexible rendering equation that is connected to the outputs of the neural network.
BibTeX

Robust Machine Learning Article Translational symmetry in convolutions with localized kernels causes an implicit bias toward high frequency adversarial examples Caro, J. O., Ju, Y., Pyle, R., Dey, S., Brendel, W., Anselmi, F., Patel, A. B. Frontiers in Computational Neuroscience, 18:1387077, June 2024 (Published) Frontiers in Computational Neuroscience BibTeX

Perceiving Systems Conference Paper 4D-DRESS: A 4D Dataset of Real-World Human Clothing With Semantic Annotations Wang, W., Ho, H., Guo, C., Rong, B., Grigorev, A., Song, J., Zarate, J. J., Hilliges, O. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), CVPR, June 2024 (Published)
The studies of human clothing for digital avatars have predominantly relied on synthetic datasets. While easy to collect, synthetic data often fall short in realism and fail to capture authentic clothing dynamics. Addressing this gap, we introduce 4D-DRESS, the first real-world 4D dataset advancing human clothing research with its high-quality 4D textured scans and garment meshes. 4D-DRESS captures 64 outfits in 520 human motion sequences amounting to a total of 78k textured scans. Creating a real-world clothing dataset is challenging, particularly in annotating and segmenting the extensive and complex 4D human scans. To address this, we develop a semi-automatic 4D human parsing pipeline. We efficiently combine a human-in-the-loop process with automation to accurately label 4D scans in diverse garments and body movements. Leveraging precise annotations and high-quality garment meshes, we establish a number of benchmarks for clothing simulation and reconstruction. 4D-DRESS offers realistic and challenging data that complements synthetic sources, paving the way for advancements in research of lifelike human clothing.
arXiv project code data BibTeX

Perceiving Systems Conference Paper MonoHair: High-Fidelity Hair Modeling from a Monocular Video Wu, K., Yang, L., Kuang, Z., Feng, Y., Han, X., Shen, Y., Fu, H., Zhou, K., Zheng, Y. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 24164-24173, CVPR, June 2024 (Published)
Undoubtedly, high-fidelity 3D hair is crucial for achieving realism, artistic expression, and immersion in computer graphics. While existing 3D hair modeling methods have achieved impressive performance, the challenge of achieving high-quality hair reconstruction persists: they either require strict capture conditions, making practical applications difficult, or heavily rely on learned prior data, obscuring fine-grained details in images. To address these challenges, we propose a generic framework to achieve high-fidelity hair reconstruction from a monocular video, without specific requirements for environments. Our approach bifurcates the hair modeling process into two main stages: precise exterior reconstruction and interior structure inference. The exterior is meticulously crafted using our Patch-based Multi-View Optimization (PMVO). This method strategically collects and integrates hair information from multiple views, independent of prior data, to produce a high-fidelity exterior 3D line map. This map not only captures intricate details but also facilitates the inference of the hair’s inner structure. For the interior, we employ a data-driven, multi-view 3D hair reconstruction method. This method utilizes 2D structural renderings derived from the reconstructed exterior, mirroring the synthetic 2D inputs used during training. This alignment effectively bridges the domain gap between our training data and real-world data, thereby enhancing the accuracy and reliability of our interior structure inference. Lastly, we generate a strand model and resolve the directional ambiguity by our hair growth algorithm. Our experiments demonstrate that our method exhibits robustness across diverse hairstyles and achieves state-of-the-art performance.
Project Arxiv DOI URL BibTeX

Perceiving Systems Conference Paper TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation Dwivedi, S. K., Sun, Y., Patel, P., Feng, Y., Black, M. J. In IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), 1323-1333, CVPR, June 2024 (Published)
We address the problem of regressing 3D human pose and shape from a single image, with a focus on 3D accuracy. The current best methods leverage large datasets of 3D pseudo-ground-truth (p-GT) and 2D keypoints, leading to robust performance. With such methods, we observe a paradoxical decline in 3D pose accuracy with increasing 2D accuracy. This is caused by biases in the p-GT and the use of an approximate camera projection model. We quantify the error induced by current camera models and show that fitting 2D keypoints and p-GT accurately causes incorrect 3D poses. Our analysis defines the invalid distances within which minimizing 2D and p-GT losses is detrimental. We use this to formulate a new loss Threshold-Adaptive Loss Scaling (TALS) that penalizes gross 2D and p-GT losses but not smaller ones. With such a loss, there are many 3D poses that could equally explain the 2D evidence. To reduce this ambiguity we need a prior over valid human poses but such priors can introduce unwanted bias. To address this, we exploit a tokenized representation of human pose and reformulate the problem as token prediction. This restricts the estimated poses to the space of valid poses, effectively providing a uniform prior. Extensive experiments on the EMDB and 3DPW datasets show that our reformulated keypoint loss and tokenization allows us to train on in-the-wild data while improving 3D accuracy over the state-of-the-art.
Paper Project Code Poster Video DOI URL BibTeX

Haptic Intelligence Article AiroTouch: Enhancing Telerobotic Assembly through Naturalistic Haptic Feedback of Tool Vibrations Gong, Y., Mat Husin, H., Erol, E., Ortenzi, V., Kuchenbecker, K. J. Frontiers in Robotics and AI, 11(1355205):1-15, May 2024 (Published)
Teleoperation allows workers to safely control powerful construction machines; however, its primary reliance on visual feedback limits the operator's efficiency in situations with stiff contact or poor visibility, hindering its use for assembly of pre-fabricated building components. Reliable, economical, and easy-to-implement haptic feedback could fill this perception gap and facilitate the broader use of robots in construction and other application areas. Thus, we adapted widely available commercial audio equipment to create AiroTouch, a naturalistic haptic feedback system that measures the vibration experienced by each robot tool and enables the operator to feel a scaled version of this vibration in real time. Accurate haptic transmission was achieved by optimizing the positions of the system's off-the-shelf accelerometers and voice-coil actuators. A study was conducted to evaluate how adding this naturalistic type of vibrotactile feedback affects the operator during telerobotic assembly. Thirty participants used a bimanual dexterous teleoperation system (Intuitive da Vinci Si) to build a small rigid structure under three randomly ordered haptic feedback conditions: no vibrations, one-axis vibrations, and summed three-axis vibrations. The results show that users took advantage of both tested versions of the naturalistic haptic feedback after gaining some experience with the task, causing significantly lower vibrations and forces in the second trial. Subjective responses indicate that haptic feedback increased the realism of the interaction and reduced the perceived task duration, task difficulty, and fatigue. As hypothesized, higher haptic feedback gains were chosen by users with larger hands and for the smaller sensed vibrations in the one-axis condition. These results elucidate important details for effective implementation of naturalistic vibrotactile feedback and demonstrate that our accessible audio-based approach could enhance user performance and experience during telerobotic assembly in construction and other application domains.
DOI BibTeX

Empirical Inference Conference Paper Can Large Language Models Infer Causation from Correlation? Jin, Z., Liu, J., Lyu, Z., Poff, S., Sachan, M., Mihalcea, R., Diab*, M., Schölkopf*, B. The Twelfth International Conference on Learning Representations (ICLR), May 2024, *equal supervision (Published) arXiv URL BibTeX

Empirical Inference Conference Paper Causal Modeling with Stationary Diffusions Lorch, L., Krause*, A., Schölkopf*, B. Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS), 238:1927-1935, Proceedings of Machine Learning Research, (Editors: Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen), PMLR, May 2024, *equal supervision (Published) URL BibTeX

Empirical Inference Conference Paper Certified private data release for sparse Lipschitz functions Donhauser, K., Lokna, J., Sanyal, A., Boedihardjo, M., Hönig, R., Yang, F. Proceedings of the 27th International Conference on Artificial Intelligence and Statistics (AISTATS), 238:1396-1404, Proceedings of Machine Learning Research, (Editors: Dasgupta, Sanjoy and Mandt, Stephan and Li, Yingzhen), PMLR, May 2024 (Published) URL BibTeX

Haptic Intelligence Article Closing the Loop in Minimally Supervised Human-Robot Interaction: Formative and Summative Feedback Mohan, M., Nunez, C. M., Kuchenbecker, K. J. Scientific Reports, 14(1):10564, May 2024 (Published)
Human instructors fluidly communicate with hand gestures, head and body movements, and facial expressions, but robots rarely leverage these complementary cues. A minimally supervised social robot with such skills could help people exercise and learn new activities. Thus, we investigated how nonverbal feedback from a humanoid robot affects human behavior. Inspired by the education literature, we evaluated formative feedback (real-time corrections) and summative feedback (post-task scores) for three distinct tasks: positioning in the room, mimicking the robot's arm pose, and contacting the robot's hands. Twenty-eight adults completed seventy-five 30-second-long trials with no explicit instructions or experimenter help. Motion-capture data analysis shows that both formative and summative feedback from the robot significantly aided user performance. Additionally, formative feedback improved task understanding. These results show the power of nonverbal cues based on human movement and the utility of viewing feedback through formative and summative lenses.
DOI BibTeX

Empirical Inference Conference Paper Delphic Offline Reinforcement Learning under Nonidentifiable Hidden Confounding Pace, A., Yèche, H., Schölkopf, B., Rätsch, G., Tennenholtz, G. The Twelfth International Conference on Learning Representations (ICLR), May 2024 (Published) arXiv BibTeX

Autonomous Learning Conference Paper Emergent mechanisms for long timescales depend on training curriculum and affect performance in memory tasks Khajehabdollahi, S., Zeraati, R., Giannakakis, E., Schäfer, T. J., Martius, G., Levina, A. In The Twelfth International Conference on Learning Representations, ICLR 2024, May 2024 (Published) URL BibTeX

Perceiving Systems Article Exploring Weight Bias and Negative Self-Evaluation in Patients with Mood Disorders: Insights from the BodyTalk Project, Meneguzzo, P., Behrens, S. C., Pavan, C., Toffanin, T., Quiros-Ramirez, M. A., Black, M. J., Giel, K., Tenconi, E., Favaro, A. Frontiers in Psychiatry, 15, Sec. Psychopathology, May 2024 (Published)
Background: Negative body image and adverse body self-evaluation represent key psychological constructs within the realm of weight bias (WB), potentially intertwined with the negative self-evaluation characteristic of depressive symptomatology. Although WB encapsulates an implicit form of self-critical assessment, its exploration among people with mood disorders (MD) has been under-investigated. Our primary goal is to comprehensively assess both explicit and implicit WB, seeking to reveal specific dimensions that could interconnect with the symptoms of MDs. Methods: A cohort comprising 25 MD patients and 35 demographically matched healthy peers (with 83\%\/ female representation) participated in a series of tasks designed to evaluate the congruence between various computer-generated body representations and a spectrum of descriptive adjectives. Our analysis delved into multiple facets of body image evaluation, scrutinizing the associations between different body sizes and emotionally charged adjectives (e.g., active, apple-shaped, attractive). Results: No discernible differences emerged concerning body dissatisfaction or the correspondence of different body sizes with varying adjectives. Interestingly, MD patients exhibited a markedly higher tendency to overestimate their body weight (p = 0.011). Explicit WB did not show significant variance between the two groups, but MD participants demonstrated a notable implicit WB within a specific weight rating task for BMI between 18.5 and 25 kg/m2 (p = 0.012). Conclusions: Despite the striking similarities in the assessment of participants’ body weight, our investigation revealed an implicit WB among individuals grappling with MD. This bias potentially assumes a role in fostering self-directed negative evaluations, shedding light on a previously unexplored facet of the interplay between WB and mood disorders.
paper paper DOI URL BibTeX

Social Foundations of Computation Conference Paper Fairness Rising from the Ranks: HITS and PageRank on Homophilic Networks Stoica, A., Litvak, N., Chaintreau, A. In Proceedings of the Association for Computing Machinery (ACM) Web Conference 2024, ACM, The 2024 ACM Web Conference, May 2024 (Published)
In this paper, we investigate the conditions under which link analysis algorithms prevent minority groups from reaching high-ranking slots. We find that the most common link-based algorithms using centrality metrics, such as PageRank and HITS, can reproduce and even amplify bias against minority groups in networks. Yet, their behavior differs: on the one hand, we empirically show that PageRank mirrors the degree distribution for most of the ranking positions and it can equalize representation of minorities among the top-ranked nodes; on the other hand, we find that HITS amplifies pre-existing bias in homophilic networks through a novel theoretical analysis, supported by empirical results. We find the root cause of bias amplification in HITS to be the level of homophily present in the network, modeled through an evolving network model with two communities. We illustrate our theoretical analysis on both synthetic and real datasets and we present directions for future work.
ArXiv URL BibTeX

Haptic Intelligence Robotics Miscellaneous GaitGuide: A Wearable Device for Vibrotactile Motion Guidance Rokhmanova, N., Martus, J., Faulkner, R., Fiene, J., Kuchenbecker, K. J. Workshop paper (3 pages) presented at the ICRA Workshop on Advancing Wearable Devices and Applications Through Novel Design, Sensing, Actuation, and AI, Yokohama, Japan, May 2024 (Published)
Wearable vibrotactile devices can provide salient sensations that attract the user's attention or guide them to change. The future integration of such feedback into medical or consumer devices would benefit from understanding how vibrotactile cues vary in amplitude and perceived strength across the heterogeneity of human skin. Here, we developed an adhesive vibrotactile device (the GaitGuide) that uses two individually mounted linear resonant actuators to deliver directional motion guidance. By measuring the mechanical vibrations of the actuators via small on-board accelerometers, we compared vibration amplitudes and perceived signal strength across 20 subjects at five signal voltages and four sites around the shank. Vibrations were consistently smallest in amplitude—but perceived to be strongest—at the site located over the tibia. We created a fourth-order linear dynamic model to capture differences in tissue properties across subjects and sites via optimized stiffness and damping parameters. The anterior site had significantly higher skin stiffness and damping; these values also correlate with subject-specific body-fat percentages. Surprisingly, our study shows that the perception of vibrotactile stimuli does not solely depend on the vibration magnitude delivered to the skin. These findings also help to explain the clinical practice of evaluating vibrotactile sensitivity over a bony prominence.
URL BibTeX

Perceiving Systems Empirical Inference Conference Paper Ghost on the Shell: An Expressive Representation of General 3D Shapes Liu, Z., Feng, Y., Xiu, Y., Liu, W., Paull, L., Black, M. J., Schölkopf, B. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR), The Twelfth International Conference on Learning Representations (ICLR), May 2024 (Published)
The creation of photorealistic virtual worlds requires the accurate modeling of 3D surface geometry for a wide range of objects. For this, meshes are appealing since they 1) enable fast physics-based rendering with realistic material and lighting, 2) support physical simulation, and 3) are memory-efficient for modern graphics pipelines. Recent work on reconstructing and statistically modeling 3D shape, however, has critiqued meshes as being topologically inflexible. To capture a wide range of object shapes, any 3D representation must be able to model solid, watertight, shapes as well as thin, open, surfaces. Recent work has focused on the former, and methods for reconstructing open surfaces do not support fast reconstruction with material and lighting or unconditional generative modelling. Inspired by the observation that open surfaces can be seen as islands floating on watertight surfaces, we parameterize open surfaces by defining a manifold signed distance field on watertight templates. With this parameterization, we further develop a grid-based and differentiable representation that parameterizes both watertight and non-watertight meshes of arbitrary topology. Our new representation, called Ghost-on-the-Shell (G-Shell), enables two important applications: differentiable rasterization-based reconstruction from multiview images and generative modelling of non-watertight meshes. We empirically demonstrate that G-Shell achieves state-of-the-art performance on non-watertight mesh reconstruction and generation tasks, while also performing effectively for watertight meshes.
Home Code Video Project BibTeX

Empirical Inference Conference Paper Identifying Policy Gradient Subspaces Schneider, J., Schumacher, P., Guist, S., Chen, L., Häufle, D., Schölkopf, B., Büchler, D. The Twelfth International Conference on Learning Representations (ICLR), May 2024 (Published) arXiv BibTeX

Autonomous Learning Conference Paper Learning Hierarchical World Models with Adaptive Temporal Abstractions from Discrete Latent Dynamics Gumbsch, C., Sajid, N., Martius, G., Butz, M. V. In The Twelfth International Conference on Learning Representations, ICLR 2024, May 2024 URL BibTeX

Empirical Inference Autonomous Learning Conference Paper Multi-View Causal Representation Learning with Partial Observability Yao, D., Xu, D., Lachapelle, S., Magliacane, S., Taslakian, P., Martius, G., von Kügelgen, J., Locatello, F. The Twelfth International Conference on Learning Representations (ICLR), May 2024 (Published) arXiv BibTeX

Empirical Inference Conference Paper Open X-Embodiment: Robotic Learning Datasets and RT-X Models Open X-Embodiment Collaboration ( incl. Guist, S., Schneider, J., Schölkopf, B., Büchler, D. ). IEEE International Conference on Robotics and Automation (ICRA), 6892-6903, May 2024 (Published) arXiv DOI URL BibTeX

Empirical Inference Conference Paper Out-of-Variable Generalization for Discriminative Models Guo, S., Wildberger, J., Schölkopf, B. The Twelfth International Conference on Learning Representations (ICLR), May 2024 (Published) arXiv BibTeX

Empirical Inference Perceiving Systems Conference Paper Parameter-Efficient Orthogonal Finetuning via Butterfly Factorization Liu, W., Qiu, Z., Feng, Y., Xiu, Y., Xue, Y., Yu, L., Feng, H., Liu, Z., Heo, J., Peng, S., Wen, Y., Black, M. J., Weller, A., Schölkopf, B. In Proceedings of the Twelfth International Conference on Learning Representations (ICLR), The Twelfth International Conference on Learning Representations, May 2024 (Published)
Large foundation models are becoming ubiquitous, but training them from scratch is prohibitively expensive. Thus, efficiently adapting these powerful models to downstream tasks is increasingly important. In this paper, we study a principled finetuning paradigm -- Orthogonal Finetuning (OFT) -- for downstream task adaptation. Despite demonstrating good generalizability, OFT still uses a fairly large number of trainable parameters due to the high dimensionality of orthogonal matrices. To address this, we start by examining OFT from an information transmission perspective, and then identify a few key desiderata that enable better parameter-efficiency. Inspired by how the Cooley-Tukey fast Fourier transform algorithm enables efficient information transmission, we propose an efficient orthogonal parameterization using butterfly structures. We apply this parameterization to OFT, creating a novel parameter-efficient finetuning method, called Orthogonal Butterfly (BOFT). By subsuming OFT as a special case, BOFT introduces a generalized orthogonal finetuning framework. Finally, we conduct an extensive empirical study of adapting large vision transformers, large language models, and text-to-image diffusion models to various downstream tasks in vision and language.
Home Code HuggingFace project URL BibTeX

Empirical Inference Conference Paper Skill or Luck? Return Decomposition via Advantage Functions Pan, H., Schölkopf, B. The Twelfth International Conference on Learning Representations (ICLR), May 2024 (Published) arXiv BibTeX

Empirical Inference Conference Paper Some Intriguing Aspects about Lipschitz Continuity of Neural Networks Khromov*, G., Singh*, S. P. The Twelfth International Conference on Learning Representations (ICLR), May 2024, *equal contribution (Published) arXiv BibTeX

Empirical Inference Conference Paper Stochastic Gradient Descent for Gaussian Processes Done Right Lin*, J. A., Padhy*, S., Antorán*, J., Tripp, A., Terenin, A., Szepesvari, C., Hernández-Lobato, J. M., Janz, D. The Twelfth International Conference on Learning Representations (ICLR), May 2024, *equal contribution (Published) arXiv BibTeX

Empirical Inference Conference Paper Targeted Reduction of Causal Models Kekić, A., Schölkopf, B., Besserve, M. ICLR 2024 Workshop on AI4DifferentialEquations In Science, May 2024 (Published) URL BibTeX

Social Foundations of Computation Conference Paper Test-Time Training on Nearest Neighbors for Large Language Models Hardt, M., Sun, Y. In The Twelfth International Conference on Learning Representations (ICLR 2024), May 2024 (Published)
Many recent efforts augment language models with retrieval, by adding retrieved data to the input context. For this approach to succeed, the retrieved data must be added at both training and test time. Moreover, as input length grows linearly with the size of retrieved data, cost in computation and memory grows quadratically for modern Transformers. To avoid these complications, we simply fine-tune the model on retrieved data at test time, using its standard training setup. We build a large-scale distributed index based on text embeddings of the Pile dataset. For each test input, our system retrieves its neighbors and fine-tunes the model on their text. Surprisingly, retrieving and training on as few as 20 neighbors, each for only one gradient iteration, drastically improves performance across more than 20 language modeling tasks in the Pile. For example, test-time training with nearest neighbors significantly narrows the performance gap between a small GPT-2 and a GPT-Neo model more than 10 times larger. Sufficient index quality and size, however, are necessary. Our work establishes a first baseline of test-time training for language modeling.
ArXiv Code URL BibTeX

Perceiving Systems Article The Poses for Equine Research Dataset (PFERD) Li, C., Mellbin, Y., Krogager, J., Polikovsky, S., Holmberg, M., Ghorbani, N., Black, M. J., Kjellström, H., Zuffi, S., Hernlund, E. Nature Scientific Data, 11, May 2024 (Published)
Studies of quadruped animal motion help us to identify diseases, understand behavior and unravel the mechanics behind gaits in animals. The horse is likely the best-studied animal in this aspect, but data capture is challenging and time-consuming. Computer vision techniques improve animal motion extraction, but the development relies on reference datasets, which are scarce, not open-access and often provide data from only a few anatomical landmarks. Addressing this data gap, we introduce PFERD, a video and 3D marker motion dataset from horses using a full-body set-up of densely placed over 100 skin-attached markers and synchronized videos from ten camera angles. Five horses of diverse conformations provide data for various motions from basic poses (eg. walking, trotting) to advanced motions (eg. rearing, kicking). We further express the 3D motions with current techniques and a 3D parameterized model, the hSMAL model, establishing a baseline for 3D horse markerless motion capture. PFERD enables advanced biomechanical studies and provides a resource of ground truth data for the methodological development of markerless motion capture.
paper DOI URL BibTeX

Haptic Intelligence Robotic Materials Miscellaneous Three-Dimensional Surface Reconstruction of a Soft System via Distributed Magnetic Sensing Sundaram, V. H., Smith, L., Turin, Z., Rentschler, M. E., Gonzalez Welker, C. Workshop paper (3 pages) presented at the ICRA Workshop on Advancing Wearable Devices and Applications Through Novel Design, Sensing, Actuation, and AI, Yokohama, Japan, May 2024 (Published)
This study presents a new method for reconstructing continuous 3D surface deformations for a soft pneumatic actuation system using embedded magnetic sensors. A finite element analysis (FEA) model was developed to quantify the surface deformation given the magnetometer readings, with a relative error between the experimental and the simulated sensor data of 7.8%. Using the FEA simulation solutions and a basic model-based mapping, our method achieves sub-millimeter accuracy in measuring deformation from sensor data with an absolute error between the experimental and simulated sensor data of 13.5%. These results show promise for real-time adjustments to deformation, crucial in environments like prosthetic and orthotic interfaces with human limbs.
URL BibTeX

Empirical Inference Conference Paper Towards Meta-Pruning via Optimal Transport Theus, A., Geimer, O., Wicke, F., Hofmann, T., Anagnostidis, S., Singh, S. P. The Twelfth International Conference on Learning Representations (ICLR), May 2024 (Published) arXiv BibTeX

Empirical Inference Conference Paper Towards Training Without Depth Limits: Batch Normalization Without Gradient Explosion Meterez*, A., Joudaki*, A., Orabona, F., Immer, A., Rätsch, G., Daneshmand, H. The Twelfth International Conference on Learning Representations (ICLR), May 2024, *equal contribution (Published) arXiv BibTeX

Empirical Inference Conference Paper Transformer Fusion with Optimal Transport Imfeld*, M., Graldi*, J., Giordano*, M., Hofmann, T., Anagnostidis, S., Singh, S. P. The Twelfth International Conference on Learning Representations (ICLR), May 2024, *equal contribution (Published) arXiv BibTeX

Social Foundations of Computation Conference Paper Unprocessing Seven Years of Algorithmic Fairness Cruz, A. F., Hardt, M. In The Twelfth International Conference on Learning Representations (ICLR 2024), May 2024 (Published)
Seven years ago, researchers proposed a postprocessing method to equalize the error rates of a model across different demographic groups. The work launched hundreds of papers purporting to improve over the postprocessing baseline. We empirically evaluate these claims through thousands of model evaluations on several tabular datasets. We find that the fairness-accuracy Pareto frontier achieved by postprocessing contains all other methods we were feasibly able to evaluate. In doing so, we address two common methodological errors that have confounded previous observations. One relates to the comparison of methods with different unconstrained base models. The other concerns methods achieving different levels of constraint relaxation. At the heart of our study is a simple idea we call unprocessing that roughly corresponds to the inverse of postprocessing. Unprocessing allows for a direct comparison of methods using different underlying models and levels of relaxation. Interpreting our findings, we recall a widely overlooked theoretical argument, present seven years ago, that accurately predicted what we observe.
ArXiv Code URL BibTeX

Autonomous Learning Conference Paper Wild Visual Navigation: Fast Traversability Learning via Pre-Trained Models and Online Self-Supervision Mattamala, M., Frey, J., Libera, P., Chebrolu, N., Martius, G., Cadena, C., Hutter, M., Fallon, M. April 2024 (Accepted)
Natural environments such as forests and grasslands are challenging for robotic navigation because of the false perception of rigid obstacles from high grass, twigs, or bushes. In this work, we present Wild Visual Navigation (WVN), an online self-supervised learning system for visual traversability estimation. The system is able to continuously adapt from a short human demonstration in the field, only using onboard sensing and computing. One of the key ideas to achieve this is the use of high-dimensional features from pre-trained self-supervised models, which implicitly encode semantic information that massively simplifies the learning task. Further, the development of an online scheme for supervision generator enables concurrent training and inference of the learned model in the wild. We demonstrate our approach through diverse real-world deployments in forests, parks, and grasslands. Our system is able to bootstrap the traversable terrain segmentation in less than 5 min of in-field training time, enabling the robot to navigate in complex, previously unseen outdoor terrains.
URL BibTeX

Empirical Inference Master Thesis Algorithmic Compositional Learning of Language Models Thomm, J. ETH Zurich, Switzerland, April 2024 (Published) BibTeX

Haptic Intelligence Robotic Materials Miscellaneous Cutaneous Electrohydraulic (CUTE) Wearable Devices for Multimodal Haptic Feedback Sanchez-Tamayo, N., Yoder, Z., Ballardini, G., Rothemund, P., Keplinger, C., Kuchenbecker, K. J. Extended abstract (1 page) presented at the IEEE RoboSoft Workshop on Multimodal Soft Robots for Multifunctional Manipulation, Locomotion, and Human-Machine Interaction, San Diego, USA, April 2024 (Published) BibTeX

Empirical Inference Miscellaneous Evidence for eccentricity in the population of binary black holes observed by LIGO-Virgo-KAGRA Gupte, N., Ramos-Buades, A., Buonanno, A., Gair, J., Miller, M. C., Dax, M., Green, S. R., Pürrer, M., Wildberger, J., Macke, J. H., Romero-Shaw, I. M., Schölkopf, B. April 2024 (Published) URL BibTeX

Social Foundations of Computation Conference Paper ImageNot: A Contrast with ImageNet Preserves Model Rankings Salaudeen, O., Hardt, M. April 2024 (Submitted)
We introduce ImageNot, a dataset designed to match the scale of ImageNet while differing drastically in other aspects. We show that key model architectures developed for ImageNet over the years rank identically when trained and evaluated on ImageNot to how they rank on ImageNet. This is true when training models from scratch or fine-tuning them. Moreover, the relative improvements of each model over earlier models strongly correlate in both datasets. We further give evidence that ImageNot has a similar utility as ImageNet for transfer learning purposes. Our work demonstrates a surprising degree of external validity in the relative performance of image classification models. This stands in contrast with absolute accuracy numbers that typically drop sharply even under small changes to a dataset.
ArXiv BibTeX

Empirical Inference Conference Paper PILLAR: How to make semi-private learning more effective Pinto, F., Hu, Y., Yang, F., Sanyal, A. 2nd IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), 110-139, April 2024 (Published) DOI BibTeX