Detecting small objects in images is a challenging problem particularly when they are often occluded by hands or other body parts.
Recently, joint modelling of human pose and objects has been proposed to improve both pose estimation as well as object detection.
These approaches, however, focus on explicit interaction with an object and lack the flexibility to combine both modalities when interaction is not obvious.
We therefore propose to use human pose as an additional context information for object detection.
To this end, we represent an object category by a tree model and train regression forests that localize parts of an object for each modality separately.
Predictions of the two modalities are then combined to detect the bounding box of the object.
We evaluate our approach on three challenging datasets which vary in the amount of object interactions and the quality of automatically extracted human poses.
Hough-based voting approaches have been successfully applied to object detection. While these methods can be efficiently implemented by random forests, they estimate the probability for an object hypothesis for each feature independently. In this work, we address this problem by grouping features in a local neighborhood to obtain a better estimate of the probability. To this end, we propose oblique classification-regression forests that combine features of different trees. We further investigate the benefit of combining independent and grouped features and evaluate the approach on RGB and RGB-D datasets.
In order to avoid an expensive manual labeling process or to learn object classes autonomously without human intervention, object discovery techniques have been proposed that extract visual similar objects from weakly labelled videos. However, the problem of discovering small or medium sized objects is largely unexplored. We observe that videos with activities involving human-object interactions can serve as weakly labelled data for such cases. Since neither object appearance nor motion is distinct enough to discover objects in these videos, we propose a framework that samples from a space of algorithms and their parameters to extract sequences of object proposals. Furthermore, we model similarity of objects based on appearance and functionality, which is derived from human and object motion. We show that functionality is an
important cue for discovering objects from activities and demonstrate the generality of the model on three challenging RGB-D and RGB datasets.