Estimating Human and Camera Motion From RGB Data

Perceiving Systems

Guest Scientist

This thesis presents a unified framework for markerless 3D human motion analysis from monocular videos, addressing three interrelated challenges that have limited the fidelity of existing approaches: (i) achieving temporally consistent and physically plausible human motion estimation, (ii) accurately modeling perspective camera effects in unconstrained settings, and (iii) disentangling human motion from camera motion in dynamic scenes. Our contributions are realized through three complementary methods. First, we introduce VIBE (Video Inference for Body Pose and Shape Estimation), a novel video pose and shape estimation framework. Despite progress on single-image 3D pose and shape estimation, existing video-based state-of-the-art methods fail to produce accurate and natural motion sequences due to a lack of ground-truth 3D motion data for training. To address this problem, we propose VIBE, which makes use of an existing large-scale motion capture dataset (AMASS) together with unpaired, in-the-wild, 2D keypoint annotations. Our key novelty is an adversarial learning framework that leverages AMASS to discriminate between real human motions and those produced by our temporal pose and shape regression networks. We define a temporal network architecture and show that adversarial training, at the sequence level, produces kinematically plausible motion sequences without in-the-wild ground-truth 3D labels. Second, we propose SPEC (Seeing People in the wild with Estimated Cameras), the first in-the-wild 3D human and shape (HPS) method that estimates the perspective camera from a single image and employs this to reconstruct 3D human bodies more accurately. Due to the lack of camera parameter information for in-the-wild images, existing 3D HPS estimation methods make several simplifying assumptions: weak-perspective projection, large constant focal length, and zero camera rotation. These assumptions often do not hold and we show, quantitatively and qualitatively, that they cause errors in the reconstructed 3D shape and pose. To address this, we introduce SPEC, the first in-the-wild 3D HPS method that estimates the perspective camera from a single image and employs this to reconstruct 3D human bodies more accurately. First, we train a neural network to estimate the field of view, camera pitch, and roll given an input image. We employ novel losses that improve the camera calibration accuracy over previous work. We then train a novel network that concatenates the camera calibration to the image features and uses these together to regress 3D body shape and pose. SPEC is more accurate than the prior art on the standard benchmark (3DPW) as well as two new datasets with more challenging camera views and varying focal lengths. Specifically, we create a new photorealistic synthetic dataset (SPEC-SYN) with ground truth 3D bodies and a novel in-the-wild dataset (SPEC-MTP) with calibration and high-quality reference bodies. Third, we develop PACE (Person And Camera Estimation), a method to estimate human motion in a global scene from moving cameras. This is a highly challenging task due to the entangling of human and camera motions in the video. Existing works assume camera is static and focus on solving the human motion in camera space. To address this problem, we propose a joint optimization framework that disentangles human and camera motions using both foreground human motion priors and background scene features. Unlike existing methods that use Simultaneous Localization and Mapping (SLAM) as initialization, we propose to tightly integrate SLAM and human motion priors in an optimization that is inspired by bundle adjustment. Specifically, we optimize human and camera motions to match both the observed human pose and scene features. This design combines the strengths of SLAM and motion priors, which leads to significant improvements in human and camera motion estimation. We additionally introduce a motion prior that is suitable for batch optimization, making our approach significantly more efficient than existing approaches. Finally, we propose a novel synthetic dataset that enables evaluating camera motion in addition to human motion from dynamic videos. Experiments on the synthetic and real-world datasets demonstrate that our approach substantially outperforms prior art in recovering both human and camera motions. Extensive experiments on standard benchmarks and new datasets we introduced demonstrate that our integrated approach substantially outperforms prior methods in terms of temporal consistency, reconstruction accuracy, and global motion estimation. While these results represent a significant advance in markerless human motion analysis, further work is needed to extend these techniques to multi-person scenarios, severe occlusions, and real-time applications. Overall, this thesis lays a strong foundation for more robust and accurate human motion analysis in unconstrained environments, with promising applications in robotics, augmented reality, sports analysis, and beyond.

Author(s):	Kocabas, Muhammed
Year:	2025
Month:	April
Day:	8

BibTeX Type:	Ph.D. Thesis (phdthesis)

Degree Type:	PhD
Digital:	True
State:	Published
Attachments:	Thesis PDF

BibTeX

@phdthesis{human_cam_motion,
title = {Estimating Human and Camera Motion From RGB Data},
abstract = {This thesis presents a unified framework for markerless 3D human motion analysis from monocular videos, addressing three interrelated challenges that have limited the fidelity of existing approaches: (i) achieving temporally consistent and physically plausible human motion estimation, (ii) accurately modeling perspective camera effects in unconstrained settings, and (iii) disentangling human motion from camera motion in dynamic scenes. Our contributions are realized through three complementary methods.

First, we introduce VIBE (Video Inference for Body Pose and Shape Estimation), a novel video pose and shape estimation framework. Despite progress on single-image 3D pose and shape estimation, existing video-based state-of-the-art methods fail to produce accurate and natural motion sequences due to a lack of ground-truth 3D motion data for training. To address this problem, we propose VIBE, which makes use of an existing large-scale motion capture dataset (AMASS) together with unpaired, in-the-wild, 2D keypoint annotations. Our key novelty is an adversarial learning framework that leverages AMASS to discriminate between real human motions and those produced by our temporal pose and shape regression networks. We define a temporal network architecture and show that adversarial training, at the sequence level, produces kinematically plausible motion sequences without in-the-wild ground-truth 3D labels.

Second, we propose SPEC (Seeing People in the wild with Estimated Cameras), the first in-the-wild 3D human and shape (HPS) method that estimates the perspective camera from a single image and employs this to reconstruct 3D human bodies more accurately. Due to the lack of camera parameter information for in-the-wild images, existing 3D HPS estimation methods make several simplifying assumptions: weak-perspective projection, large constant focal length, and zero camera rotation. These assumptions often do not hold and we show, quantitatively and qualitatively, that they cause errors in the reconstructed 3D shape and pose. To address this, we introduce SPEC, the first in-the-wild 3D HPS method that estimates the perspective camera from a single image and employs this to reconstruct 3D human bodies more accurately. First, we train a neural network to estimate the field of view, camera pitch, and roll given an input image. We employ novel losses that improve the camera calibration accuracy over previous work. We then train a novel network that concatenates the camera calibration to the image features and uses these together to regress 3D body shape and pose. SPEC is more accurate than the prior art on the standard benchmark (3DPW) as well as two new datasets with more challenging camera views and varying focal lengths. Specifically, we create a new photorealistic synthetic dataset (SPEC-SYN) with ground truth 3D bodies and a novel in-the-wild dataset (SPEC-MTP) with calibration and high-quality reference bodies.

Third, we develop PACE (Person And Camera Estimation), a method to estimate human motion in a global scene from moving cameras. This is a highly challenging task due to the entangling of human and camera motions in the video. Existing works assume camera is static and focus on solving the human motion in camera space. To address this problem, we propose a joint optimization framework that disentangles human and camera motions using both foreground human motion priors and background scene features. Unlike existing methods that use Simultaneous Localization and Mapping (SLAM) as initialization, we propose to tightly integrate SLAM and human motion priors in an optimization that is inspired by bundle adjustment. Specifically, we optimize human and camera motions to match both the observed human pose and scene features. This design combines the strengths of SLAM and motion priors, which leads to significant improvements in human and camera motion estimation. We additionally introduce a motion prior that is suitable for batch optimization, making our approach significantly more efficient than existing approaches. Finally, we propose a novel synthetic dataset that enables evaluating camera motion in addition to human motion from dynamic videos. Experiments on the synthetic and real-world datasets demonstrate that our approach substantially outperforms prior art in recovering both human and camera motions.

Extensive experiments on standard benchmarks and new datasets we introduced demonstrate that our integrated approach substantially outperforms prior methods in terms of temporal consistency, reconstruction accuracy, and global motion estimation. While these results represent a significant advance in markerless human motion analysis, further work is needed to extend these techniques to multi-person scenarios, severe occlusions, and real-time applications.

Overall, this thesis lays a strong foundation for more robust and accurate human motion analysis in unconstrained environments, with promising applications in robotics, augmented reality, sports analysis, and beyond.},
degree_type = {PhD},
month = apr,
year = {2025},
author = {Kocabas, Muhammed},
month_numeric = {4}
}

Research

Departments

Max Planck Research Groups

Start-Up Teams

Research Groups

People

Contact

Our Institute

Our History

Career

Doctoral Programs

Training

Service Units

Central Scientific Facilities

Workshops

Campus Services

Impact

Cooperation

Partners and Initiatives