CSAIL Research Abstract

Introduction

Architecture, Systems
& Networks

Language, Learning,
Vision & Graphics

Physical, Biological
& Social Systems

Theory

horizontal line

Fast Online Active Learning

Claire Monteleoni & Tommi Jaakkola

Goal

We seek to design active learning algorithms, for the sequential setting, with fast convergence rates. Our fundamental goal is to formalize the complexity tradeoffs inherent to this learning problem.

Online Learning Framework

Sequential or online learning is a framework in which training examples are only received one at a time, and the learner must make a prediction at each time-step. This requirement is well motivated in many practical problems where we are not usually granted the luxury of learning a concept from a batch of labeled examples. Instead we receive observations in a sequential fashion, and must update our beliefs online.

This framework also raises interesting computational issues: tradeoffs in complexity and resources for computational learners. The sequential constraint translates to limits on the learner's memory. The learner cannot store all previously seen examples and then apply a batch learning algorithm to them, but must instead intelligently summarize its observations. Additionally, the time complexity of the belief update step should be constrained against scaling with the number of past examples, in order for the algorithm to be effective in the online setting. Algorithms for the sequential framework can be compared by their mistake complexity: the number of prediction mistakes they make before converging on an accurate model.

Active Learning

Active or semi-supervised learning is a situation in which the data are only partially labeled, or the labels are missing, yet available at some cost. This framework is also well motivated in practice, and it too begs interesting computational and complexity theoretic questions for the algorithm designer. Beyond just minimizing the number of examples needed to learn a concept (to a fixed accuracy), in active learning the goal is to minimize label complexity: the number of labels that the algorithm needs to check, in doing so. In fact, intelligent choices of which examples on which to ask for labels can even reduce the total number of examples needed for learning.

Active Learning in a Sequential Framework

Under a learning framework that is both active and sequential, interesting additional issues arise. A distinction now exists between mistakes and errors: mistakes are a subset of the errors on which the algorithm requests labels, and thus receives feedback on its erroneous predictions. It is thus possible to analyze error complexity as a separate quantity from mistake complexity for active sequential algorithms.

Progress and Related Work

In recent joint work [1] with Dasgupta (UCSD) and Kalai (TTI-Chicago) we presented an algorithm yielding significantly lower label complexity than previously known -- from (with high probability) polynomial dependence down to (with high probability) logarithmic dependence on intended prediction accuracy -- for sequential active learning of half-spaces from independent uniform samples. Such guarantees previously existed only for Query By Committee [3], an algorithm whose update step scales with the number of seen examples, which is inconvenient for sequential use. Interestingly, our analysis also yielded matching mistake complexity and error complexity (even on unlabeled points). Additionally we provided a lower bound on the label complexity required for a class of existing algorithms [1]. Some negative results in similar frameworks have recently been provided by Dasgupta [2].

Ongoing and Future Work

We are working on extending [1] by:

relaxing the distributional assumption,
analyzing a version in which the data have a margin, and
generalizing to the unseparable case.

To address the distributional issues, we will first retain the i.i.d. assumption, yet stipulate that the sampling distribution be fixed, but arbitrary and unknown to the algorithm. Our goal is first to characterize which distributions yield the guarantees from [1], using similar techniques to the analysis in [1] (for which uniformity is sufficient but not necessary), and then to re-analyze the algorithm under arbitrary distributions, perhaps yielding similar bounds. A further scenario to consider would involve sampling from an unknown stationary distribution, yet relaxing the independence assumption.

We would also like to attain similar guarantees for this setting, when the data are known to have a margin (a positive minimum distance separating all negative examples from positive ones). The literature on kernel methods has demonstrated that in many learning problems, while the data do not appear (in their initial embedding) to separate easily into positive and negative classes, the problem can be mapped by a kernel function to a space in which the data are fully separable by a linear threshold, sometimes even with a margin. The tradeoff however is that the dimension of the mapped space need not be finite. The convergence rate in [1] is a function of the dimension, so it is future work to determine whether the same algorithm would work in this scenario, under a different analysis, and if not, to design an algorithm whose convergence rate is not a function of the dimensionality at which the data lie. At a more general level, we seek to quantify the tradeoff between dimension and margin, with respect to this problem framework.

A further goal is to attain fast convergence rates for a similar setting in which the data need not be separable, a scenario that allows for noise in the labels. The level of separability is also a quantity that can be analyzed as a complexity measure, and potentially traded off against dimension and margin. For example, distributions for which there is a (with high probability) bound on the distance that a noisy label can occur from the target decision surface, can be considered to have a margin within which errors are tolerated, thereby facilitating the use of margin-based analysis techniques.

Research Support:

This work is supported in part by a grant from British Aerospace Engineering and by the DARPA CALO project.

References:

[1] Sanjoy Dasgupta, Adam Tauman Kalai and Claire Monteleoni. Analysis of perceptron-based active learning. In The 18th Annual Conference on Learning Theory, Bertinoro, Italy, June 2005.

[2] Sanjoy Dasgupta. Analysis of a greedy active learning strategy. In Advances in Neural Information Processing Systems 17. Vancouver, Canada, December 2004.

[3] Yoav Freund, H. Sebastian Seung, Eli Shamir, Naftali Tishby. Selective Sampling Using the Query by Committee Algorithm. In Machine Learning, 28 (2-3):133--168, August/September, 1997.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu
(Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)