Human Pose, Shape and Action
3D Pose from Images
2D Pose from Images
Beyond Motion Capture
Action and Behavior
Body Perception
Body Applications
Pose and Motion Priors
Clothing Models (2011-2015)
Reflectance Filtering
Learning on Manifolds
Markerless Animal Motion Capture
Multi-Camera Capture
2D Pose from Optical Flow
Body Perception
Neural Prosthetics and Decoding
Part-based Body Models
Intrinsic Depth
Lie Bodies
Layers, Time and Segmentation
Understanding Action Recognition (JHMDB)
Intrinsic Video
Intrinsic Images
Action Recognition with Tracking
Neural Control of Grasping
Flowing Puppets
Faces
Deformable Structures
Model-based Anthropometry
Modeling 3D Human Breathing
Optical flow in the LGN
FlowCap
Smooth Loops from Unconstrained Video
PCA Flow
Efficient and Scalable Inference
Motion Blur in Layers
Facade Segmentation
Smooth Metric Learning
Robust PCA
3D Recognition
Object Detection
Robot Arm Pose Estimation as a Learning Problem

Purposeful and robust manipulation requires a good hand-eye coordination. To a certain extend this can be achieved using information from joint encoders and known kinematics. However, for many robots a significant error in the pose of the end-effector and fingers of several centimeters remains. Especially for fine manipulation tasks, this poses a challenge.
For achieving the desired accuracy, we aim to visually track the arm in the camera; the same frame in which we usually detect the target object. Given these estimates, we can then control the manipulation tasks with techniques such as visual servoing.
In this project, we propose to frame the problem of marker-less robot arm pose estimation as a learning problem. The only input to the method is the depth image from an RGB-D sensor. The output is the joint configuration of the robot arm. We learn the mapping from a large number of synthetically generated and labeled depth images.
In [], we treat this problem as a pixel-wise classification problem using a random decision forest. From all the training samples ending up at a leaf node, a set of offsets is learned that votes for relative joint positions. Pooling these votes over all foreground pixels and subsequent clustering gives us an estimate of the true joint positions. Due to the intrinsic parallelism of pixel-wise classification, this approach can run faster than 30Hz. The approach is a frame-by-frame method and does not require any initialization as for example ICP-style or tracking methods.
Video
Members
Publications