We develop methods for accelerating metric similarity search that are effective on modern hardware. Our algorithms factor into easily parallelizable components, making them simple to deploy and efficient on multicore CPUs and GPUs. Despite the simple structure of our algorithms, their search performance is provably sublinear in the size of the database, with a factor dependent only on its intrinsic dimensionality. We demonstrate that our methods provide substantial speedups on a range of datasets and hardware platforms. In particular, we present results on a 48-core server machine, on graphics hardware, and on a multicore desktop.
The interplay between optimization and machine learning is one of the most important developments in modern computational science. Optimization formulations and methods are proving to be vital in designing algorithms to extract essential knowledge from huge volumes of data. Machine learning, however, is not simply a consumer of optimization technology but a rapidly evolving field that is itself generating new optimization ideas. This book captures the state of the art of the interaction between optimization and machine learning in a way that is accessible to researchers in both fields.
Optimization approaches have enjoyed prominence in machine learning because of their wide applicability and attractive theoretical properties. The increasing complexity, size, and variety of today's machine learning models call for the reassessment of existing assumptions. This book starts the process of reassessment. It describes the resurgence in novel contexts of established frameworks such as first-order methods, stochastic approximations, convex relaxations, interior-point methods, and proximal methods. It also devotes attention to newer themes such as regularized optimization, robust optimization, gradient and subgradient methods, splitting techniques, and second-order methods. Many of these techniques draw inspiration from other fields, including operations research, theoretical computer science, and subfields of optimization. The book will enrich the ongoing cross-fertilization between the machine learning community and these other fields, and within the broader optimization community.
We consider projected Newton-type methods for solving large-scale optimization problems arising in machine learning and related fields. We first introduce an algorithmic framework for projected Newton-type methods by reviewing a canonical projected (quasi-)Newton method. This method, while conceptually pleasing, has a high computation cost per iteration. Thus, we discuss two variants that are more scalable, namely, two-metric projection and inexact
projection methods. Finally, we show how to apply the Newton-type framework to handle non-smooth objectives. Examples are provided throughout the chapter to illustrate machine learning applications of our framework.
In this paper, we analyze the convergence of two general classes of optimization algorithms for regularized kernel methods with convex loss function and quadratic norm regularization. The first methodology is a new class of algorithms based on fixed-point iterations that are well-suited for a parallel implementation and can be used with any convex loss function. The second methodology is based on coordinate descent, and generalizes some techniques previously proposed for linear support vector machines. It exploits the structure of additively separable loss functions to compute solutions of line searches in closed form. The two methodologies are both very easy to implement. In this paper, we also show how to remove non-differentiability of the objective functional by exactly reformulating a convex regularization problem as an unconstrained differentiable stabilization problem.
We analyze a family of probability distributions that are characterized by an embedded combinatorial structure. This family includes models having arbitrary treewidth and arbitrary sized factors. Unlike general models with such freedom, where the “most probable explanation” (MPE) problem is inapproximable, the combinatorial structure within our model, in particular the indirect use of submodularity, leads to several MPE algorithms that all have approximation guarantees.
Most results for online decision problems with structured concepts, such as trees or cuts, assume linear costs. In many settings, however, nonlinear costs are more realistic. Owing to their non-separability, these lead to much harder optimization problems. Going beyond linearity, we address online approximation algorithms for structured concepts that allow the cost to be submodular, i.e., nonseparable. In particular, we show regret bounds for three Hannan-consistent strategies that capture different settings. Our results also tighten a regret bound for unconstrained online submodular minimization.
We propose a new family of non-submodular global energy functions that still use submodularity internally to couple edges in a graph cut. We show it is possible to develop an efficient approximation algorithm that, thanks to the internal submodularity, can use standard graph cuts as a subroutine. We demonstrate the advantages of edge coupling in a natural setting, namely image segmentation. In particular, for finestructured objects and objects with shading variation, our structured edge coupling leads to significant improvements over standard approaches.
Numerous scientific applications across a variety of fields depend on box-constrained convex optimization. Box-constrained problems therefore continue to attract research interest. We address box-constrained (strictly convex) problems by deriving two new quasi-Newton algorithms. Our algorithms are positioned between the projected-gradient [J. B. Rosen, J. SIAM, 8 (1960), pp. 181–217] and projected-Newton [D. P. Bertsekas, SIAM J. Control Optim., 20 (1982), pp. 221–246] methods. We also prove their convergence under a simple Armijo step-size rule. We provide experimental results for two particular box-constrained problems: nonnegative least squares (NNLS), and nonnegative Kullback–Leibler (NNKL) minimization. For both NNLS and NNKL our algorithms perform competitively as compared to well-established methods on medium-sized problems; for larger problems our approach frequently outperforms the competition.
We present a new algorithm for minimizing a convex loss-function subject to regularization. Our framework applies to numerous problems in machine learning and statistics; notably, for sparsity-promoting regularizers such as ℓ1 or ℓ1, ∞ norms, it enables efficient computation of sparse solutions. Our approach is based on the trust-region framework with nonsmooth objectives, which allows us to build on known results to provide convergence analysis. We avoid the computational overheads associated with the conventional Hessian approximation used by trust-region methods by instead using a simple separable quadratic approximation. This approximation also enables use of proximity operators for tackling nonsmooth regularizers. We illustrate the versatility of our resulting algorithm by specializing it to three mixed-norm regression problems: group lasso [36], group logistic regression [21], and multi-task lasso [19]. We experiment with both synthetic and real-world large-scale data—our method is seen to be competitive, robust, and scalable.
Empirical InferenceConference PaperConvex Perturbations for Scalable Semidefinite Programming
Kulis, B., Sra, S., Dhillon, I.
In Proceedings of the Twelfth International Conference on Artificial Intelligence and Statistics (AIStats 2009), JMLR Workshop and Conference Proceedings Volume 5: AISTATS 2009, 296-303, (Editors: van Dyk, D. , M. Welling), MIT Press, Cambridge, MA, USA, Twelfth International Conference on Artificial Intelligence and Statistics, April 2009
Many important machine learning problems are modeled and solved via semidefinite programs; examples include metric learning, nonlinear embedding, and certain clustering problems. Often, off-the-shelf software is invoked for the associated optimization, which can be inappropriate due to excessive computational and storage requirements. In this paper, we introduce the use of convex perturbations for solving semidefinite programs (SDPs), and for a specific perturbation we derive an algorithm that has several advantages over existing techniques: a) it is simple, requiring only a few lines of Matlab, b) it is a first-order method, and thereby scalable, and c) it can easily exploit the structure of a given SDP (e.g., when the constraint matrices are low-rank, a situation common to several machine learning SDPs). A pleasant byproduct of our method is a fast, kernelized version of the large-margin nearest neighbor metric learning algorithm. We demonstrate that our algorithm is effective in finding fast approximations to large-scale SDPs arising in some machine learning applications.
Empirical InferenceConference PaperEfficient Bregman Range SearchCayton, L.
In Advances in Neural Information Processing Systems 22: 23rd Annual Conference on Neural Information Processing Systems 2009, Advances in Neural Information Processing Systems 22, 243-251, (Editors: Bengio, Y. , D. Schuurmans, J. Lafferty, C. Williams, A. Culotta), Curran, Red Hook, NY, USA, 23rd Annual Conference on Neural Information Processing Systems (NIPS 2009), 2009
We develop an algorithm for efficient range search when the notion of dissimilarity is given by a Bregman divergence. The range search task is to return all points in a potentially large database that are within some specified distance of a query. It arises in many learning algorithms such as locally-weighted regression, kernel density estimation, neighborhood graph-based algorithms, and in
tasks like outlier detection and information retrieval. In metric spaces, efficient range search-like algorithms based on spatial data structures have been deployed on a variety of statistical tasks. Here we describe an algorithm for range search for an arbitrary Bregman divergence. This broad class of dissimilarity measures includes the relative entropy, Mahalanobis distance, Itakura-Saito divergence, and
a variety of matrix divergences. Metric methods cannot be directly applied since Bregman divergences do not in general satisfy the triangle inequality. We derive
geometric properties of Bregman divergences that yield an efficient algorithm for range search based on a recently proposed space decomposition for Bregman divergences.