Header logo is


2009


no image
Solution Stability in Linear Programming Relaxations: Graph Partitioning and Unsupervised Learning

Nowozin, S., Jegelka, S.

In ICML 2009, pages: 769-776, (Editors: Danyluk, A. , L. Bottou, M. Littman), ACM Press, New York, NY, USA, 26th International Conference on Machine Learning, June 2009 (inproceedings)

Abstract
We propose a new method to quantify the solution stability of a large class of combinatorial optimization problems arising in machine learning. As practical example we apply the method to correlation clustering, clustering aggregation, modularity clustering, and relative performance significance clustering. Our method is extensively motivated by the idea of linear programming relaxations. We prove that when a relaxation is used to solve the original clustering problem, then the solution stability calculated by our method is conservative, that is, it never overestimates the solution stability of the true, unrelaxed problem. We also demonstrate how our method can be used to compute the entire path of optimal solutions as the optimization problem is increasingly perturbed. Experimentally, our method is shown to perform well on a number of benchmark problems.

ei

PDF Web DOI [BibTex]

2009


PDF Web DOI [BibTex]


no image
Kernel Measures of Independence for Non-IID Data

Zhang, X., Song, L., Gretton, A., Smola, A.

In Advances in neural information processing systems 21, pages: 1937-1944, (Editors: Koller, D. , D. Schuurmans, Y. Bengio, L. Bottou), Curran, Red Hook, NY, USA, Twenty-Second Annual Conference on Neural Information Processing Systems (NIPS), June 2009 (inproceedings)

Abstract
Many machine learning algorithms can be formulated in the framework of statistical independence such as the Hilbert Schmidt Independence Criterion. In this paper, we extend this criterion to deal with structured and interdependent observations. This is achieved by modeling the structures using undirected graphical models and comparing the Hilbert space embeddings of distributions. We apply this new criterion to independent component analysis and sequence clustering.

ei

PDF Web [BibTex]

PDF Web [BibTex]


no image
Using Bayesian Dynamical Systems for Motion Template Libraries

Chiappa, S., Kober, J., Peters, J.

In Advances in neural information processing systems 21, pages: 297-304, (Editors: Koller, D. , D. Schuurmans, Y. Bengio, L. Bottou), Curran, Red Hook, NY, USA, Twenty-Second Annual Conference on Neural Information Processing Systems (NIPS), June 2009 (inproceedings)

Abstract
Motor primitives or motion templates have become an important concept for both modeling human motor control as well as generating robot behaviors using imitation learning. Recent impressive results range from humanoid robot movement generation to timing models of human motions. The automatic generation of skill libraries containing multiple motion templates is an important step in robot learning. Such a skill learning system needs to cluster similar movements together and represent each resulting motion template as a generative model which is subsequently used for the execution of the behavior by a robot system. In this paper, we show how human trajectories captured as multidimensional time-series can be clustered using Bayesian mixtures of linear Gaussian state-space models based on the similarity of their dynamics. The appropriate number of templates is automatically determined by enforcing a parsimonious parametrization. As the resulting model is intractable, we introduce a novel approximation method based on variational Bayes, which is especially designed to enable the use of efficient inference algorithms. On recorded human Balero movements, this method is not only capable of finding reasonable motion templates but also yields a generative model which works well in the execution of this complex task on a simulated anthropomorphic SARCOS arm.

ei

PDF Web [BibTex]

PDF Web [BibTex]


no image
Multi-way set enumeration in real-valued tensors

Georgii, E., Tsuda, K., Schölkopf, B.

In Proceedings of the 2nd Workshop on Data Mining using Matrices and Tensors (DMMT 2009), pages: 32-41, (Editors: C Ding and T Li), ACM Press, New York, NY, USA, 2nd Workshop on Data Mining using Matrices and Tensors (DMMT/KDD), June 2009 (inproceedings)

Abstract
The analysis of n-ary relations receives attention in many different fields, for instance biology, web mining, and social studies. In the basic setting, there are n sets of instances, and each observation associates n instances, one from each set. A common approach to explore these n-way data is the search for n-set patterns. An n-set pattern consists of specific subsets of the n instance sets such that all possible n- ary associations between the corresponding instances are observed. This provides a higher-level view of the data, revealing associative relationships between groups of instances. Here, we generalize this approach in two respects. First, we tolerate missing observations to a certain degree, that means we are also interested in n-sets where most (although not all) of the possible combinations have been recorded in the data. Second, we take association weights into account. More precisely, we propose a method to enumerate all n- sets that satisfy a minimum threshold with respect to the average association weight. Non-observed associations obtain by default a weight of zero. Technically, we solve the enumeration task using a reverse search strategy, which allows for effective pruning of the search space. In addition, our algorithm provides a ranking of the solutions and can consider further constraints. We show experimental results on artificial and real-world data sets from different domains.

ei

PDF DOI [BibTex]

PDF DOI [BibTex]


no image
Non-parametric Regression between Riemannian Manifolds

Steinke, F., Hein, M.

In Advances in neural information processing systems 21, pages: 1561-1568, (Editors: Koller, D. , D. Schuurmans, Y. Bengio, L. Bottou), Curran, Red Hook, NY, USA, Twenty-Second Annual Conference on Neural Information Processing Systems (NIPS), June 2009 (inproceedings)

Abstract
This paper discusses non-parametric regression between Riemannian manifolds. This learning problem arises frequently in many application areas ranging from signal processing, computer vision, over robotics to computer graphics. We present a new algorithmic scheme for the solution of this general learning problem based on regularized empirical risk minimization. The regularization functional takes into account the geometry of input and output manifold, and we show that it implements a prior which is particularly natural. Moreover, we demonstrate that our algorithm performs well in a difficult surface registration problem.

ei

PDF Web [BibTex]

PDF Web [BibTex]


no image
Near-optimal supervised feature selection among frequent subgraphs

Thoma, M., Cheng, H., Gretton, A., Han, J., Kriegel, H., Smola, A., Song, L., Yu, P., Yan, X., Borgwardt, K.

In Proccedings of the 2009 SIAM Conference on Data Mining (SDM 2009), pages: 1076-1087, (Editors: Park, H. , S. Parthasarathy, H. Liu), Philadelphia, PA, USA, Society for Industrial and Applied Mathematics, 9th SIAM Conference on Data Mining (SDM), May 2009 (inproceedings)

Abstract
Graph classification is an increasingly important step in numerous application domains, such as function prediction of molecules and proteins, computerised scene analysis, and anomaly detection in program flows. Among the various approaches proposed in the literature, graph classification based on frequent subgraphs is a popular branch: Graphs are represented as (usually binary) vectors, with components indicating whether a graph contains a particular subgraph that is frequent across the dataset. On large graphs, however, one faces the enormous problem that the number of these frequent subgraphs may grow exponentially with the size of the graphs, but only few of them possess enough discriminative power to make them useful for graph classification. Efficient and discriminative feature selection among frequent subgraphs is hence a key challenge for graph mining. In this article, we propose an approach to feature selection on frequent subgraphs, called CORK, that combines two central advantages. First, it optimizes a submodular quality criterion, which means that we can yield a near-optimal solution using greedy feature selection. Second, our submodular quality function criterion can be integrated into gSpan, the state-of-the-art tool for frequent subgraph mining, and help to prune the search space for discriminative frequent subgraphs even during frequent subgraph mining.

ei

PDF PDF [BibTex]

PDF PDF [BibTex]


no image
Link Propagation: A Fast Semi-supervised Learning Algorithm for Link Prediction

Kashima, H., Kato, T., Yamanishi, Y., Sugiyama, M., Tsuda, K.

In Proceedings of the 2009 SIAM International Conference on Data Mining, pages: 1099-1110, (Editors: Park, H. , S. Parthasarathy, H. Liu), Philadelphia, PA, USA, Society for Industrial and Applied Mathematics, SDM, May 2009 (inproceedings)

Abstract
We propose Link Propagation as a new semi-supervised learning method for link prediction problems, where the task is to predict unknown parts of the network structure by using auxiliary information such as node similarities. Since the proposed method can fill in missing parts of tensors, it is applicable to multi-relational domains, allowing us to handle multiple types of links simultaneously. We also give a novel efficient algorithm for Link Propagation based on an accelerated conjugate gradient method.

ei

PDF Web [BibTex]

PDF Web [BibTex]


no image
Learning motor primitives for robotics

Kober, J., Peters, J.

In Proceedings of the 2009 IEEE International Conference on Robotics and Automation (ICRA 2009), pages: 2112-2118, IEEE Service Center, Piscataway, NJ, USA, IEEE International Conference on Robotics and Automation (ICRA '09), May 2009 (inproceedings)

Abstract
The acquisition and self-improvement of novel motor skills is among the most important problems in robotics. Motor primitives offer one of the most promising frameworks for the application of machine learning techniques in this context. Employing an improved form of the dynamic systems motor primitives originally introduced by Ijspeert et al. [2], we show how both discrete and rhythmic tasks can be learned using a concerted approach of both imitation and reinforcement learning. For doing so, we present both learning algorithms and representations targeted for the practical application in robotics. Furthermore, we show that it is possible to include a start-up phase in rhythmic primitives. We show that two new motor skills, i.e., Ball-in-a-Cup and Ball-Paddling, can be learned on a real Barrett WAM robot arm at a pace similar to human learning while achieving a significantly more reliable final performance.

ei

PDF Web DOI [BibTex]

PDF Web DOI [BibTex]


no image
A robust Bayesian two-sample test for detecting intervals of differential gene expression in microarray time series

Stegle, O., Denby, K., Wild, DL., Ghahramani, Z., Borgwardt, KM.

In Research in Computational Molecular Biology, pages: 201-216, (Editors: Batzoglou, S. ), Springer, Berlin, Germany, 13th Annual International Conference on Research in Computational Molecular Biology (RECOMB), May 2009 (inproceedings)

Abstract
Understanding the regulatory mechanisms that are responsible for an organism’s response to environmental changes is an important question in molecular biology. A first and important step towards this goal is to detect genes whose expression levels are affected by altered external conditions. A range of methods to test for differential gene expression, both in static as well as in time-course experiments, have been proposed. While these tests answer the question whether a gene is differentially expressed, they do not explicitly address the question when a gene is differentially expressed, although this information may provide insights into the course and causal structure of regulatory programs. In this article, we propose a two-sample test for identifying intervals of differential gene expression in microarray time series. Our approach is based on Gaussian process regression, can deal with arbitrary numbers of replicates and is robust with respect to outliers. We apply our algorithm to study the response of Arabidopsis thaliana genes to an infection by a fungal pathogen using a microarray time series dataset covering 30,336 gene probes at 24 time points. In classification experiments our test compares favorably with existing methods and provides additional insights into time-dependent differential expression.

ei

Web DOI [BibTex]

Web DOI [BibTex]


no image
A Bayesian Approach to Graph Regression with Relevant Subgraph Selection

Chiappa, S., Saigo, H., Tsuda, K.

In SIAM International Conference on Data Mining, pages: 295-304, (Editors: Park, H. , S. Parthasarathy, H. Liu), Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, SDM, May 2009 (inproceedings)

Abstract
Many real-world applications with graph data require the efficient solution of a given regression task as well as the identification of the subgraphs which are relevant for the task. In these cases graphs are commonly represented as binary vectors of indicators of subgraphs, giving rise to an intractable input dimensionality. An efficient solution to this problem was recently proposed by a Lasso-type method where the objective function optimization over an intractable number of variables is reformulated as a dual mathematical programming problem over a small number of variables but a large number of constraints. The dual problem is then solved by column generation where the subgraphs corresponding to the most violated constraints are found by weighted subgraph mining. This paper proposes an extension of this method to a fully Bayesian approach which defines a prior distribution on the parameters and integrate them out from the model, thus providing a posterior distribution on the target variable as opposed to a single estimate. The advantage of this approach is that the extra information given by the target posterior distribution can be used for improving the model in several ways. In this paper, we use the target posterior variance as a measure of uncertainty in the prediction and show that, by rejecting unconfident predictions, we can improve state-of-the-art performance on several molecular graph datasets.

ei

PDF Web [BibTex]

PDF Web [BibTex]


no image
Efficient data reuse in value function approximation

Hachiya, H., Akiyama, T., Sugiyama, M., Peters, J.

In IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pages: 8-15, IEEE Service Center, Piscataway, NJ, USA, IEEE ADPRL, May 2009 (inproceedings)

Abstract
Off-policy reinforcement learning is aimed at efficiently using data samples gathered from a policy that is different from the currently optimized policy. A common approach is to use importance sampling techniques for compensating for the bias of value function estimators caused by the difference between the data-sampling policy and the target policy. However, existing off-policy methods often do not take the variance of the value function estimators explicitly into account and therefore their performance tends to be unstable. To cope with this problem, we propose using an adaptive importance sampling technique which allows us to actively control the trade-off between bias and variance. We further provide a method for optimally determining the trade-off parameter based on a variant of cross-validation. The usefulness of the proposed approach is demonstrated through simulated swing-up inverted-pendulum problem.

ei

PDF Web DOI [BibTex]

PDF Web DOI [BibTex]


no image
Using reward-weighted imitation for robot Reinforcement Learning

Peters, J., Kober, J.

In IEEE ADPRL 2009, pages: 226-232, IEEE Service Center, Piscataway, NJ, USA, 2009 IEEE International Symposium on Adaptive Dynamic Programming and Reinforcement Learning, May 2009 (inproceedings)

Abstract
Reinforcement Learning is an essential ability for robots to learn new motor skills. Nevertheless, few methods scale into the domain of anthropomorphic robotics. In order to improve in terms of efficiency, the problem is reduced onto reward-weighted imitation. By doing so, we are able to generate a framework for policy learning which both unifies previous reinforcement learning approaches and allows the derivation of novel algorithms. We show our two most relevant applications both for motor primitive learning (e.g., a complex Ball-in-a-Cup task using a real Barrett WAM robot arm) and learning task-space control.

ei

Web DOI [BibTex]

Web DOI [BibTex]


no image
Denoising photographs using dark frames optimized by quadratic programming

Gomez Rodriguez, M., Kober, J., Schölkopf, B.

In Proceedings of the First IEEE International Conference on Computational Photography (ICCP 2009), pages: 1-9, IEEE, Piscataway, NJ, USA, First IEEE International Conference on Computational Photography (ICCP), April 2009 (inproceedings)

Abstract
Photographs taken with long exposure or high ISO setting may contain substantial amounts of noise, drastically reducing the Signal-To-Noise Ratio (SNR). This paper presents a novel optimization approach for denoising. It is based on a library of dark frames previously taken under varying conditions of temperature, ISO setting and exposure time, and a quality measure or prior for the class of images to denoise. The method automatically computes a synthetic dark frame that, when subtracted from an image, optimizes the quality measure. For specific choices of the quality measure, the denoising problem reduces to a quadratic programming (QP) problem that can be solved efficiently. We show experimentally that it is sufficient to consider a limited subsample of pixels when evaluating the quality measure in the optimization, in which case the complexity of the procedure does not depend on the size of the images but only on the number of dark frames. We provide quantitative experimental results showing that our method automatically computes dark frames that are competitive with those taken under idealized conditions (controlled temperature, ISO setting, exposure time, and averaging of multiple exposures). We provide application examples in astronomical image denoising. The method is validated on two CMOS SLRs.

ei

PDF Web DOI [BibTex]

PDF Web DOI [BibTex]


no image
On Pairwise Kernels: An Efficient Alternative and Generalization Analysis

Kashima, H., Oyama, S., Yamanishi, Y., Tsuda, K.

In Advances in Knowledge Discovery and Data Mining: 13th Pacific-Asia Conference, pages: 1030-1037, (Editors: Theeramunkong, T. , B. Kijsirikul, N. Cercone, T. B. Ho), Springer, Berlin, Germany, PAKDD, April 2009 (inproceedings)

Abstract
Pairwise classification has many applications including network prediction, entity resolution, and collaborative filtering. The pairwise kernel has been proposed for those purposes by several research groups independently, and become successful in various fields. In this paper, we propose an efficient alternative which we call Cartesian kernel. While the existing pairwise kernel (which we refer to as Kronecker kernel) can be interpreted as the weighted adjacency matrix of the Kronecker product graph of two graphs, the Cartesian kernel can be interpreted as that of the Cartesian graph which is more sparse than the Kronecker product graph. Experimental results show the Cartesian kernel is much faster than the existing pairwise kernel, and at the same time, competitive with the existing pairwise kernel in predictive performance.We discuss the generalization bounds by the two pairwise kernels by using eigenvalue analysis of the kernel matrices.

ei

PDF Web DOI [BibTex]

PDF Web DOI [BibTex]


no image
Convex Perturbations for Scalable Semidefinite Programming

Kulis, B., Sra, S., Dhillon, I.

In JMLR Workshop and Conference Proceedings Volume 5: AISTATS 2009, pages: 296-303, (Editors: van Dyk, D. , M. Welling), MIT Press, Cambridge, MA, USA, Twelfth International Conference on Artificial Intelligence and Statistics, April 2009 (inproceedings)

Abstract
Many important machine learning problems are modeled and solved via semidefinite programs; examples include metric learning, nonlinear embedding, and certain clustering problems. Often, off-the-shelf software is invoked for the associated optimization, which can be inappropriate due to excessive computational and storage requirements. In this paper, we introduce the use of convex perturbations for solving semidefinite programs (SDPs), and for a specific perturbation we derive an algorithm that has several advantages over existing techniques: a) it is simple, requiring only a few lines of Matlab, b) it is a first-order method, and thereby scalable, and c) it can easily exploit the structure of a given SDP (e.g., when the constraint matrices are low-rank, a situation common to several machine learning SDPs). A pleasant byproduct of our method is a fast, kernelized version of the large-margin nearest neighbor metric learning algorithm. We demonstrate that our algorithm is effective in finding fast approximations to large-scale SDPs arising in some machine learning applications.

ei

PDF Web [BibTex]

PDF Web [BibTex]


no image
Block Jacobi-type methods for non-orthogonal joint diagonalisation

Shen, H., Hüper, K.

In ICASSP09, pages: 3285-3288, IEEE Service Center, Piscataway, NJ, USA, 34th International Conference on Acoustics, Speech, and Signal Processing, April 2009 (inproceedings)

Abstract
In this paper, we study the problem of non-orthogonal joint diagonalisation of a set of real symmetric matrices via simultaneous conjugation. A family of block Jacobi-type methods are proposed to optimise two popular cost functions for the non-orthogonal joint diagonalisation, namely, the off-norm function and the log-likelihood function. By exploiting the appropriate underlying manifold, namely the so-called oblique manifold, rigorous analysis shows that, under the exact non-orthogonal joint diagonalisation setting, the proposed methods converge locally quadratically fast to a joint diagonaliser. Finally, performance of our methods is investigated by numerical experiments for both exact and approximate non-orthogonal joint diagonalisation.

ei

PDF Web DOI [BibTex]

PDF Web DOI [BibTex]


no image
An Expectation Maximization Algorithm for Continuous Markov Decision Processes with Arbitrary Reward

Hoffman, M., Freitas, N., Doucet, A., Peters, J.

In JMLR Workshop and Conference Proceedings Volume 5: AISTATS 2009, pages: 232-239, (Editors: van Dyk, D. , M. Welling), MIT Press, Cambridge, MA, USA, Twelfth International Conference on Artificial Intelligence and Statistics, April 2009 (inproceedings)

Abstract
We derive a new expectation maximization algorithm for policy optimization in linear Gaussian Markov decision processes, where the reward function is parameterised in terms of a flexible mixture of Gaussians. This approach exploits both analytical tractability and numerical optimization. Consequently, on the one hand, it is more flexible and general than closed-form solutions, such as the widely used linear quadratic Gaussian (LQG) controllers. On the other hand, it is more accurate and faster than optimization methods that rely on approximation and simulation. Partial analytical solutions (though costly) eliminate the need for simulation and, hence, avoid approximation error. The experiments will show that for the same cost of computation, policy optimization methods that rely on analytical tractability have higher value than the ones that rely on simulation.

ei

PDF Web [BibTex]

PDF Web [BibTex]


no image
Efficient Graphlet Kernels for Large Graph Comparison

Shervashidze, N., Vishwanathan, S., Petri, T., Mehlhorn, K., Borgwardt, K.

In JMLR Workshop and Conference Proceedings Volume 5: AISTATS 2009, pages: 488-495, (Editors: Van Dyk, D. , M. Welling), MIT Press, Cambridge, MA, USA, Twelfth International Conference on Artificial Intelligence and Statistics, April 2009 (inproceedings)

Abstract
State-of-the-art graph kernels do not scale to large graphs with hundreds of nodes and thousands of edges. In this article we propose to compare graphs by counting {it graphlets}, ie subgraphs with $k$ nodes where $k in { 3, 4, 5 }$. Exhaustive enumeration of all graphlets being prohibitively expensive, we introduce two theoretically grounded speedup schemes, one based on sampling and the second one specifically designed for bounded degree graphs. In our experimental evaluation, our novel kernels allow us to efficiently compare large graphs that cannot be tackled by existing graph kernels.

ei

PDF Web [BibTex]

PDF Web [BibTex]


no image
Online blind deconvolution for astronomical imaging

Harmeling, S., Hirsch, M., Sra, S., Schölkopf, B.

In Proceedings of the First IEEE International Conference Computational Photography (ICCP 2009), pages: 1-7, IEEE, Piscataway, NJ, USA, First IEEE International Conference on Computational Photography (ICCP), April 2009 (inproceedings)

Abstract
Atmospheric turbulences blur astronomical images taken by earth-based telescopes. Taking many short-time exposures in such a situation provides noisy images of the same object, where each noisy image has a different blur. Commonly astronomers apply a technique called “Lucky Imaging” that selects a few of the recorded frames that fulfill certain criteria, such as reaching a certain peak intensity (“Strehl ratio”). The selected frames are then averaged to obtain a better image. In this paper we introduce and analyze a new method that exploits all the frames and generates an improved image in an online fashion. Our initial experiments with controlled artificial data and real-world astronomical datasets yields promising results.

ei

PDF Web DOI [BibTex]

PDF Web DOI [BibTex]


no image
A kernel method for unsupervised structured network inference

Lippert, C., Stegle, O., Ghahramani, Z., Borgwardt, KM.

In JMLR Workshop and Conference Proceedings Volume 5: AISTATS 2009, pages: 368-375, (Editors: Van Dyk, D. , M. Welling), MIT Press, Cambridge, MA, USA, Twelfth International Conference on Artificial Intelligence and Statistics, April 2009 (inproceedings)

Abstract
Network inference is the problem of inferring edges between a set of real-world objects, for instance, interactions between pairs of proteins in bioinformatics. Current kernel-based approaches to this problem share a set of common features: (i) they are supervised and hence require labeled training data; (ii) edges in the network are treated as mutually independent and hence topological properties are largely ignored; (iii) they lack a statistical interpretation. We argue that these common assumptions are often undesirable for network inference, and propose (i) an unsupervised kernel method (ii) that takes the global structure of the network into account and (iii) is statistically motivated. We show that our approach can explain commonly used heuristics in statistical terms. In experiments on social networks, different variants of our method demonstrate appealing predictive performance.

ei

PDF Web [BibTex]

PDF Web [BibTex]


no image
PAC-Bayesian Generalization Bound for Density Estimation with Application to Co-clustering

Seldin, Y., Tishby, N.

In JMLR Workshop and Conference Proceedings Volume 5: AISTATS 2009, pages: 472-479, MIT Press, Cambridge, MA, USA, 12th International Conference on Artificial Intelligence and Statistics, April 2009 (inproceedings)

Abstract
We derive a PAC-Bayesian generalization bound for density estimation. Similar to the PAC-Bayesian generalization bound for classification, the result has the appealingly simple form of a tradeoff between empirical performance and the KL-divergence of the posterior from the prior. Moreover, the PAC-Bayesian generalization bound for classification can be derived as a special case of the bound for density estimation. To illustrate a possible application of our bound we derive a generalization bound for co-clustering. The bound provides a criterion to evaluate the ability of co-clustering to predict new co-occurrences, thus introducing a supervised flavor to this traditionally unsupervised task.

ei

PDF Web [BibTex]

PDF Web [BibTex]


no image
Efficient Bregman Range Search

Cayton, L.

In Advances in Neural Information Processing Systems 22, pages: 243-251, (Editors: Bengio, Y. , D. Schuurmans, J. Lafferty, C. Williams, A. Culotta), Curran, Red Hook, NY, USA, 23rd Annual Conference on Neural Information Processing Systems (NIPS), 2009 (inproceedings)

Abstract
We develop an algorithm for efficient range search when the notion of dissimilarity is given by a Bregman divergence. The range search task is to return all points in a potentially large database that are within some specified distance of a query. It arises in many learning algorithms such as locally-weighted regression, kernel density estimation, neighborhood graph-based algorithms, and in tasks like outlier detection and information retrieval. In metric spaces, efficient range search-like algorithms based on spatial data structures have been deployed on a variety of statistical tasks. Here we describe an algorithm for range search for an arbitrary Bregman divergence. This broad class of dissimilarity measures includes the relative entropy, Mahalanobis distance, Itakura-Saito divergence, and a variety of matrix divergences. Metric methods cannot be directly applied since Bregman divergences do not in general satisfy the triangle inequality. We derive geometric properties of Bregman divergences that yield an efficient algorithm for range search based on a recently proposed space decomposition for Bregman divergences.

ei

PDF Web [BibTex]

PDF Web [BibTex]


no image
Kernel Choice and Classifiability for RKHS Embeddings of Probability Distributions

Sriperumbudur, B., Fukumizu, K., Gretton, A., Lanckriet, G., Schölkopf, B.

In Advances in Neural Information Processing Systems 22, pages: 1750-1758, (Editors: Y Bengio and D Schuurmans and J Lafferty and C Williams and A Culotta), Curran, Red Hook, NY, USA, 23rd Annual Conference on Neural Information Processing Systems (NIPS), 2009 (inproceedings)

Abstract
Embeddings of probability measures into reproducing kernel Hilbert spaces have been proposed as a straightforward and practical means of representing and comparing probabilities. In particular, the distance between embeddings (the maximum mean discrepancy, or MMD) has several key advantages over many classical metrics on distributions, namely easy computability, fast convergence and low bias of finite sample estimates. An important requirement of the embedding RKHS is that it be characteristic: in this case, the MMD between two distributions is zero if and only if the distributions coincide. Three new results on the MMD are introduced in the present study. First, it is established that MMD corresponds to the optimal risk of a kernel classifier, thus forming a natural link between the distance between distributions and their ease of classification. An important consequence is that a kernel must be characteristic to guarantee classifiability between distributions in the RKHS. Second, the class of characteristic kernels is broadened to incorporate all strictly positive definite kernels: these include non-translation invariant kernels and kernels on non-compact domains. Third, a generalization of the MMD is proposed for families of kernels, as the supremum over MMDs on a class of kernels (for instance the Gaussian kernels with different bandwidths). This extension is necessary to obtain a single distance measure if a large selection or class of characteristic kernels is potentially appropriate. This generalization is reasonable, given that it corresponds to the problem of learning the kernel by minimizing the risk of the corresponding kernel classifier. The generalized MMD is shown to have consistent finite sample estimates, and its performance is demonstrated on a homogeneity testing example.

ei

PDF Web [BibTex]

PDF Web [BibTex]


no image
Nonlinear directed acyclic structure learning with weakly additive noise models

Tillman, R., Gretton, A., Spirtes, P.

In Advances in Neural Information Processing Systems 22, pages: 1847-1855, (Editors: Bengio, Y. , D. Schuurmans, J. Lafferty, C. Williams, A. Culotta), Curran, Red Hook, NY, USA, 23rd Annual Conference on Neural Information Processing Systems (NIPS), 2009 (inproceedings)

Abstract
The recently proposed emph{additive noise model} has advantages over previous structure learning algorithms, when attempting to recover some true data generating mechanism, since it (i) does not assume linearity or Gaussianity and (ii) can recover a unique DAG rather than an equivalence class. However, its original extension to the multivariate case required enumerating all possible DAGs, and for some special distributions, e.g. linear Gaussian, the model is invertible and thus cannot be used for structure learning. We present a new approach which combines a PC style search using recent advances in kernel measures of conditional dependence with local searches for additive noise models in substructures of the equivalence class. This results in a more computationally efficient approach that is useful for arbitrary distributions even when additive noise models are invertible. Experiments with synthetic and real data show that this method is more accurate than previous methods when data are nonlinear and/or non-Gaussian.

ei

PDF Web [BibTex]

PDF Web [BibTex]


no image
A Fast, Consistent Kernel Two-Sample Test

Gretton, A., Fukumizu, K., Harchaoui, Z., Sriperumbudur, B.

In Advances in Neural Information Processing Systems 22, pages: 673-681, (Editors: Bengio, Y. , D. Schuurmans, J. Lafferty, C. Williams, A. Culotta), Curran, Red Hook, NY, USA, 23rd Annual Conference on Neural Information Processing Systems (NIPS), 2009 (inproceedings)

Abstract
A kernel embedding of probability distributions into reproducing kernel Hilbert spaces (RKHS) has recently been proposed, which allows the comparison of two probability measures P and Q based on the distance between their respective embeddings: for a sufficiently rich RKHS, this distance is zero if and only if P and Q coincide. In using this distance as a statistic for a test of whether two samples are from different distributions, a major difficulty arises in computing the significance threshold, since the empirical statistic has as its null distribution (where P = Q) an infinite weighted sum of x2 random variables. Prior finite sample approximations to the null distribution include using bootstrap resampling, which yields a consistent estimate but is computationally costly; and fitting a parametric model with the low order moments of the test statistic, which can work well in practice but has no consistency or accuracy guarantees. The main result of the present work is a novel estimate of the null distribution, computed from the eigenspectrum of the Gram matrix on the aggregate sample from P and Q, and having lower computational cost than the bootstrap. A proof of consistency of this estimate is provided. The performance of the null distribution estimate is compared with the bootstrap and parametric approaches on an artificial example, high dimensional multivariate data, and text.

ei

PDF Web [BibTex]

PDF Web [BibTex]


no image
Graphical models for decoding in BCI visual speller systems

Martens, S., Farquhar, J., Hill, J., Schölkopf, B.

In pages: 470-473, IEEE, 4th International IEEE EMBS Conference on Neural Engineering (NER), 2009 (inproceedings)

ei

DOI [BibTex]

DOI [BibTex]


no image
Augmenting Feature-driven fMRI Analyses: Semi-supervised learning and resting state activity

Blaschko, M., Shelton, J., Bartels, A.

In Advances in Neural Information Processing Systems 22, pages: 126-134, (Editors: Bengio, Y. , D. Schuurmans, J. Lafferty, C. Williams, A. Culotta), Curran, Red Hook, NY, USA, 23rd Annual Conference on Neural Information Processing Systems (NIPS), 2009 (inproceedings)

Abstract
Resting state activity is brain activation that arises in the absence of any task, and is usually measured in awake subjects during prolonged fMRI scanning sessions where the only instruction given is to close the eyes and do nothing. It has been recognized in recent years that resting state activity is implicated in a wide variety of brain function. While certain networks of brain areas have different levels of activation at rest and during a task, there is nevertheless significant similarity between activations in the two cases. This suggests that recordings of resting state activity can be used as a source of unlabeled data to augment discriminative regression techniques in a semi-supervised setting. We evaluate this setting empirically yielding three main results: (i) regression tends to be improved by the use of Laplacian regularization even when no additional unlabeled data are available, (ii) resting state data seem to have a similar marginal distribution to that recorded during the execution of a visual processing task implying largely similar types of activation, and (iii) this source of information can be broadly exploited to improve the robustness of empirical inference in fMRI studies, an inherently data poor domain.

ei

PDF Web [BibTex]

PDF Web [BibTex]


no image
Fast subtree kernels on graphs

Shervashidze, N., Borgwardt, K.

In Advances in Neural Information Processing Systems 22, pages: 1660-1668, (Editors: Bengio, Y. , D. Schuurmans, J. Lafferty, C. Williams, A. Culotta), Curran, Red Hook, NY, USA, 23rd Annual Conference on Neural Information Processing Systems (NIPS), 2009 (inproceedings)

Abstract
In this article, we propose fast subtree kernels on graphs. On graphs with n nodes and m edges and maximum degree d, these kernels comparing subtrees of height h can be computed in O(mh), whereas the classic subtree kernel by Ramon & G{\"a}rtner scales as O(n24dh). Key to this efficiency is the observation that the Weisfeiler-Lehman test of isomorphism from graph theory elegantly computes a subtree kernel as a byproduct. Our fast subtree kernels can deal with labeled graphs, scale up easily to large graphs and outperform state-of-the-art graph kernels on several classification benchmark datasets in terms of accuracy and runtime.

ei

PDF Web [BibTex]

PDF Web [BibTex]

1996


no image
Incorporating invariances in support vector learning machines

Schölkopf, B., Burges, C., Vapnik, V.

In Artificial Neural Networks: ICANN 96, LNCS vol. 1112, pages: 47-52, (Editors: C von der Malsburg and W von Seelen and JC Vorbrüggen and B Sendhoff), Springer, Berlin, Germany, 6th International Conference on Artificial Neural Networks, July 1996, volume 1112 of Lecture Notes in Computer Science (inproceedings)

Abstract
Developed only recently, support vector learning machines achieve high generalization ability by minimizing a bound on the expected test error; however, so far there existed no way of adding knowledge about invariances of a classification problem at hand. We present a method of incorporating prior knowledge about transformation invariances by applying transformations to support vectors, the training examples most critical for determining the classification boundary.

ei

PDF DOI [BibTex]

1996


PDF DOI [BibTex]