Haptic technologies in both kinesthetic and tactile aspects benefit a brand-new opportunity to recent human-machine interactive applications. In this talk, I, who believe in that one of the essential role of a researcher is pioneering new insights and knowledge, will present my previous research topics about haptic technologies and human-machine interactive applications in two branches: laser-based mid-air haptics and sensorimotor skill learning. For the former branch, I will introduce our approach named indirect laser radiation and its application. Indirect laser radiation utilizes a laser and a light-absorbing elastic medium to evoke a tapping-like tactile sensation. For the latter, I will introduce our data-driven approach for both modeling and learning of sensorimotor skills (especially, driving) with kinesthetic assistance and artificial neural networks; I call it human-like haptic assistance. To unify two different branches of my earlier studies for exploring the feasibility of the sensory channel named "touch", I will present a general research paradigm for human-machine interactive applications to which current haptic technologies can aim in future.
Organizers: Katherine J. Kuchenbecker
Needle insertion is the most essential skill in medical care; training has to be imparted not only for physicians but also for nurses and paramedics. In most needle insertion procedures, haptic feedback from the needle is the main stimulus that novices are to be trained in. For better patient safety, the classical methods of training the haptic skills have to be replaced with simulators based on new robotic and graphics technologies. The main objective of this work is to develop analytical models of needle insertion (a special case of epidural anesthesia) including the biomechanical and psychophysical concepts that simulate the needle-tissue interaction forces in linear heterogeneous tissues and to validate the model with a series of experiments. The biomechanical and perception models were validated with experiments in two stages: with and without the human intervention. The second stage is the validation using the Turing test with two different experiments: 1) to observe the perceptual difference between the simulated and the physical phantom model, and 2) to verify the effectiveness of perceptual filter between the unfiltered and filtered model response. The results showed that the model could replicate the physical phantom tissues with good accuracy. This can be further extended to a non-linear heterogeneous model. The proposed needle/tissue interaction force models can be used more often in improving realism, performance and enabling future applications in needle simulators in heterogeneous tissue. Needle insertion training simulator was developed with the simulated models using Omni Phantom and clinical trials are conducted for the face validity and construct validity. The face validity results showed that the degree of realism of virtual environments and instruments had the overall lowest mean score and ease of usage and training in hand – eye coordination had the highest mean score. The construct validity results showed that the simulator was able to successfully differentiate force and psychomotor signatures of anesthesiologists with experiences less than 5 years and more than 5 years. For the performance index of the trainees, a novel measure, Just Controllable Difference (JCD) was proposed and a preliminary study on JCD measure is explored using two experiments for the novice. A preliminary study on the use of clinical training simulations, especially needle insertion procedure in virtual environments is emphasized on two objectives: Firstly, measures of force JND with the three fingers and secondly, comparison of these measures in Non-Immersive Virtual Reality (NIVR) to that of the Immersive Virtual Reality (IVR) using psychophysical study with the Force Matching task, Constant Stimuli method, and Isometric Force Probing stimuli. The results showed a better force JND in the IVR compared to that of the NIVR. Also, a simple state observer model was proposed to explain the improvement of force JND in the IVR. This study would quantitatively reinforce the use of the IVR for the design of various medical simulators.
Organizers: Katherine J. Kuchenbecker
Functional polymers can be easily tailored for their interaction with living organismes. In our Group, we have worked during the last 15 years in the development of this kind of polymeric materials with different funcionalities, high biocompatibility and in different forms. In this talk, we will describe the synthesis of thermosensitive thin films that can be used to prevent biofilm formation in medical devices, the preparation of biodegradable polymers specially designed for vectors for gene transfection and a new familliy of zwitterionic polymers that are able to cross intestine mucouse for oral delivery applications. The relationship between structure-functionality- applications will be discussed for every example.
Organizers: Metin Sitti
Since Hubel and Wiesel's seminal findings in the primary visual cortex (V1) more than 50 years ago, progress in vision science has been very limited along previous frameworks and schools of thoughts on understanding vision. Have we been asking the right questions? I will show observations motivating the new path. First, a drastic information bottleneck forces the brain to process only a tiny fraction of the massive visual input information; this selection is called the attentional selection, how to select this tiny fraction is critical. Second, a large body of evidence has been accumulating to suggest that the primary visual cortex (V1) is where this selection starts, suggesting that the visual cortical areas along the visual pathway beyond V1 must be investigated in light of this selection in V1. Placing attentional selection as the center stage, a new path to understanding vision is proposed (articulated in my book "Understanding vision: theory, models, and data", Oxford University Press 2014). I will show a first example of using this new path, which aims to ask new questions and make fresh progresses. I will relate our insights to artificial vision systems to discuss issues like top-down feedbacks in hierachical processing, analysis-by-synthesis, and image understanding.
Studying the interface between artificial and biological vision has been an area of research that has been greatly promoted for a long time. It seems promising that cognitive science can provide new ideas to interface computer vision and human perception, yet no established design principles do exist. In the first part of my talk I am going to introduce the novel concept of 'object detectability'. Object detectability refers to a measure of how likely a human observer is visually aware of the location and presence of specific object types in a complex, dynamic, urban scene.
We have shown a proof of concept of how to maximize human observers' scene awareness in a dynamic driving context. Nonlinear functions are learnt from experimental samples of a combined feature vector of human gaze and visual features mapping to object detectabilities. We obtain object detectabilities through a detection experiment, simulating a proxy task of distracted real-world driving. In order to specifically enhance overall pedestrian detectability in a dynamic scene, the sum of individual detectability predictors defines a complex cost function that we seek to optimize with respect to human gaze. Results show significantly increased human scene awareness in hazardous test situations comparing optimized gaze and random fixation. Thus, our approach can potentially help a driver to save reaction time and resolve a risky maneuvre. In our framework, the remarkable ability of the human visual system to detect specific objects in the periphery has been implicitly characterized by our perceptual detectability task and has thus been taken into account.
The framework may provide a foundation for future work to determine what kind of information a Computer Vision system should process reliably, e.g. certain pose or motion features, in order to optimally alert a driver in time-critical situations. Dynamic image data was taken from the Caltech Pedestrian database. I will conclude with a brief overview of recent work, including a new circular output random regression forest for continuous object viewpoint estimation and a novel learning-based, monocular odometry approach based on robust LVMs and sensorimotor learning, offering stable 3D information integration. Last but not least, I present results of a perception experiment to quantify emotion in estimated facial movement synergy components that can be exploited to control emotional content of 3D avatars in a perceptually meaningful way.
This work was done in particular with David Engel (now a Post-Doc at M.I.T.), Christian Herdtweck (a PhD student at MPI Biol. Cybernetics), and in collaboration with Prof. Martin A. Giese and Dr. Enrico Chiovetto, Center for Integrated Neuroscience, Tübingen.
We present a supervised learning based method to estimate a per-pixel confidence for optical flow vectors. Regions of low texture and pixels close to occlusion boundaries are known to be difficult for optical flow algorithms. Using a spatiotemporal feature vector, we estimate if a flow algorithm is likely to fail in a given region.
Our method is not restricted to any specific class of flow algorithm, and does not make any scene specific assumptions. By automatically learning this confidence we can combine the output of several computed flow fields from different algorithms to select the best performing algorithm per pixel. Our optical flow confidence measure allows one to achieve better overall results by discarding the most troublesome pixels. We illustrate the effectiveness of our method on four different optical flow algorithms over a variety of real and synthetic sequences. For algorithm selection, we achieve the top overall results on a large test set, and at times even surpasses the results of the best algorithm among the candidates.
Semantic image segmentation is the task of assigning semantic labels to the pixels of a natural image. It is an important step towards general scene understanding and has lately received much attention in the computer vision community. It was found that detailed annotation of images are helpful for solving this task, but obtaining accurate and consistent annotations still proves to be difficult on a large scale. One possible way forward is to work with partial supervision and latent variable models to infer semantic annotations from the data during training.
The talk will present two approaches working with partial supervision for image segmentation. The first uses an efficient multi-instance formulation to obtain object class segmentations when trained on class labels alone. The second uses a latent CRF formulation to extract object parts based on object class segmentation.
In this talk I will present two lines of research which are both applied to the problem of stereo matching. The first line of research tries to make progress on the very traditional problem of stereo matching. In BMVC 11 we presented the PatchmatchStereo work which achieves surprisingly good results with a simple energy function consisting of unary terms only. As optimization engine we used the PatchMatch method, which was designed for image editing purposes. In BMVC 12 we extended this work by adding to the energy function the standard pairwise smoothness terms. The main contribution of this work is the optimization technique, which we call PatchMatch-BeliefPropagation (PMBP). It is a special case of max-product Particle Belief Propagation, with a new sampling schema motivated by Patchmatch.
The method may be suitable for many energy minimization problems in computer vision, which have a non-convex, continuous and potentially high-dimensional label space. The second line of research combines the problem of stereo matching with the problem of object extracting in the scene. We show that both tasks can be solved jointly and boost the performance of each individual task. In particular, stereo matching improves since objects have to obey physical properties, e.g. they are not allowed to fly in the air. Object extracting improves, as expected, since we have additional information about depth in the scene.
Three-dimensional object shape is commonly represented in terms of deformations of a triangular mesh from an exemplar shape. In particular, statistical generative models of human shape deformation are widely used in computer vision, graphics, ergonomics, and anthropometry. Existing statistical models, however, are based on a Euclidean representation of shape deformations. In contrast, we argue that shape has a manifold structure: For example, averaging the shape deformations for two people does not necessarily yield a meaningful shape deformation, nor does the Euclidean difference of these two deformations provide a meaningful measure of shape dissimilarity. Consequently, we define a novel manifold for shape representation, with emphasis on body shapes, using a new Lie group of deformations. This has several advantages.
First, we define triangle deformations exactly, removing non-physical deformations and redundant degrees of freedom common to previous methods. Second, the Riemannian structure of Lie Bodies enables a more meaningful definition of body shape similarity by measuring distance between bodies on the manifold of body shape deformations. Third, the group structure allows the valid composition of deformations.
This is important for models that factor body shape deformations into multiple causes or represent shape as a linear combination of basis shapes. Similarly, interpolation between two mesh deformations results in a meaningful third deformation. Finally body shape variation is modeled using statistics on manifolds. Instead of modeling Euclidean shape variation with Principal Component Analysis we capture shape variation on the manifold using Principal Geodesic Analysis. Our experiments show consistent visual and quantitative advantages of Lie Bodies over traditional Euclidean models of shape deformation and our representation can be easily incorporated into existing methods. This project is part of a larger effort that brings together statistics and geometry to model statistics on manifolds.
Our research on manifold-valued statistics addresses the problem of modeling statistics in curved feature spaces. We try to find the geometrically most natural representations that respect the constraints; e.g. by modeling the data as belonging to a Lie group or a Riemannian manifold. We take a geometric approach as this keeps the focus on good distance measures, which are essential for good statistics. I will also present some recent unpublished results related to statistics on manifolds with broad application.
We, first, address the problems of large scale image classification. We present and evaluate different ways of aggregating local image descriptors into a vector and show that the Fisher kernel achieves better performance than the reference bag-of-visual words approach for any given vector dimension. We show and interpret the importance of an appropriate vector normalization.
Furthermore, we discuss how to learn given a large number of classes and images with stochastic gradient descent and show results on ImageNet10k. We, then, present a weakly supervised approach for learning human actions modeled as interactions between humans and objects.
Our approach is human-centric: we first localize a human in the image and then determine the object relevant for the action and its spatial relation with the human. The model is learned automatically from a set of still images annotated (only) with the action label.
Finally, we present work on learning object detectors from realworld web videos known only to contain objects of a target class. We propose a fully automatic pipeline that localizes objects in a set of videos of the class and learns a detector for it. The approach extracts candidate spatio-temporal tubes based on motion segmentation and then selects one tube per video jointly over all videos.
The grand goal of Computer Vision is to generate an automatic description of an image based on its visual content. Category level object detection is an important building block towards such capability. The first part of this talk deals with three established object detection techniques in Computer Vision, their shortcomings and how they are improved. i) Hough Voting methods efficiently handle the high complexity of multi-scale, category-level object detection in cluttered scenes.
However, the primary weakness of this approach is that mutually dependent local observations independently vote for intrinsically global object properties such as object scale. We model the feature dependencies by presenting an objective function that combines various intimately related problems in Hough Voting. ii) Shape is a highly prominent characteristic of objects that human vision utilizes for detecting objects. However, shape poses significant challenges for object detection in cluttered scenes: Object form is an emergent property that cannot be perceived locally but becomes available only once the whole object has been detected. Thus we address the detection of objects and assembling of their shape simultaneously in a Max-Margin Multiple Instance Learning framework, while avoiding fragile bottom-up grouping in query images altogether. iii) Chamfer matching is a widely used technique for detecting objects because of its speed. However, it treats objects as being a mere sum of the distance transformation of all their contour pixels. Also, spurious matches in background clutter is a huge problem for chamfer matching. We address these two issues by a) applying a discriminative approach to distance transformation computation in chamfer matching and b) estimating the accidentalness of a foreground template match by a small dictionary of simple background contours.
The second part of the talk explores the question: what insights can automatic object detection and intra-category object relationships bring to art historians ? It turns out that techniques from Computer Vision have helped the art historians in discovering different artistic workshops within an Upper German manuscript, understanding the variations of art within a particular school of design and studying the transitions across artistic styles by 1-d ordering of objects. Obtaining such insights manually is a tedious task and Computer Vision made the job of art historians easier.
1. Pradeep Yarlagadda and Björn Ommer From Meaningful Contours to Discriminative Object Shape, ECCV 2012.
2. Pradeep Yarlagadda, Angela Eigenstetter and Björn Ommer Learning Discriminative Chamfer Regularization, BMVC 2012.
3. Pradeep Yarlagadda, Antonio Monroy and Björn Ommer Voting by Grouping Dependent Parts, ECCV 2010.
4. Pradeep Yarlagadda, Antonio Monroy, Bernd Carque and Björn Ommer Recognition and Analysis of Objects in Medieval Images, ACCV (e-heritage) 2010.
5. Pradeep Yarlagadda, Antonio Monroy, Bernd Carque and Björn Ommer Top-down Analysis of Low-level Object Relatedness Leading to Semantic Understanding of Medieval Image Collections, Computer Vision and Image Analysis of art SPIE, 2010.
Navigating a car safely through complex environments is considered a relatively easy task for humans. Computer algorithms, however, can't nearly match human performance and often rely on 3D laser scanners or detailed maps. The reason for this is that the level and accuracy of current computer vision and scene understanding algorithms is still far from that of a human being. In this talk I will argue that pushing these limits requires solving a set of core computer vision problems, ranging from low-level tasks (stereo, optical flow) to high-level problems (object detection, 3D scene understanding).
First, I will introduce the KITTI datasets and benchmarks with accurate ground truth for evaluating stereo, optical flow, SLAM and 3D object detection/tracking on realistic video sequences. Results from state-of-the-art algorithms reveal that methods ranking high on established datasets such as Middlebury perform below average when being moved outside the laboratory to the real world.
Second, I will propose a novel generative model for 3D scene understanding that is able to reason jointly about the scene layout (topology and geometry of streets) as well as the location and orientation of objects. By using context from this model, performance of state-of-the-art object detectors in terms of estimating object orientation can be significantly increased.
Finally, I will give an outlook on how prior information in form of large-scale community-driven maps (OpenStreetMap) can be used in the context of 3D scene understanding.
Markov random fields (MRFs) have found widespread use as models of natural image and scene statistics. Despite progress in modeling image properties beyond gradient statistics with high-order cliques, and learning image models from example data, existing MRFs only exhibit a limited ability of actually capturing natural image statistics.
In this talk I will present recent work that investigates this limitation of previous filter-based MRF models, including Fields of Experts (FoEs). We found that these limitations are due to inadequacies in the leaning procedure and suggest various modifications to address them. These "secrets of FoE learning" allow training more suitable potential functions, whose shape approaches that of a Dirac-delta function, as well as models with larger and more filters.
Our experiments not only indicate a substantial improvement of the models' ability to capture relevant statistical properties of natural images, but also demonstrate a significant performance increase in a denoising application to levels previously unattained by generative approaches. This is joint work with Qi Gao.
The great majority of object analysis methods are based on visual object properties - objects are categorized according to how they appear in images. Visual appearance is measured in terms of image features (e.g., SIFTs) extracted from images or video. However, besides appearance, objects also have many properties that can be of interest, e.g., for a robot who wants to employ them in activities: Temperature, weight, surface softness, and also the functionalities or affordances of the object, i.e., how it is intended to be used. One example, recently addressed in the vision community, are chairs. Chairs can look vastly different, but have one thing in common: they afford sitting. At the Computer Vision and Active Perception Lab at KTH, we study the problem of inferring non-observable object properties in a number of ways. In this presentation I will describe some of this work.