CSAIL Research Abstract

Introduction

Architecture, Systems
& Networks

Language, Learning,
Vision & Graphics

Physical, Biological
& Social Systems

Theory

horizontal line

An Active Learning Approach for Appropriate Sampling During Time-Series Expression Experiments

Rohit Singh, Nathan Palmer, David Gifford, Bonnie Berger & Ziv Bar-Joseph

Introduction

Time-series expression experiments are becoming an increasingly popular method for studying a wide range of biological systems. One of the most important steps in designing such experiments is to define a sampling strategy. If the system is under-sampled, the results will not accurately represent the genes' activity over the duration of the experiment, and key features of the system's time-dependent response may be missed. On the other hand, over-sampling is expensive and time-consuming, and diverts resources that could otherwise be used for performing complementary studies.

To date, the determination of sampling rates for microarray experiments has relied mainly on the intuition of biologists. As such, sampling rates have differed amongst labs, even when studying the same biological phenomenon. Indeed, multiple experiments from the same lab, reported in the same paper, and directed at the same biological system have utilized different sampling rates. As mentioned above, these inconsistencies can lead to incomplete results, making it hard to compare data from related but independent studies.

Methods

A unique feature of microarray analysis, crucial to our work, is that biological samples may be frozen prior to hybridization, allowing researchers to extract the biological material following treatment at a very high rate, then make decisions about which samples to hybridize at a later time. We note that the expensive part of a microarray experiment is the hybridization step, rather than the act of extracting the sample. Thus, we can iteratively pick samples to hybridize, basing our decisions on data collected from previously sampled time-points.

We present the first online method for efficiently sampling during time-series microarray experiments. Beginning with an initial coarse sampling of the available time-points, our algorithm proceeds by first estimating a time-dependent expression profile in the form of a set of continuous functions that approximate the available data, as prescribed by Bar-Joseph et al. [1]. In order to quantify the uncertainty in these estimated profiles, we extend and adapt some recently-proposed statistical techniques for estimating error over spline-based smoothing functions [2]. By using local cross validation (LCV), we are able to focus in on localized signal variations, and appropriately consider the effect of sampling decisions, even from non-uniform response data. We then use active learning to iteratively choose sample points, using the uncertainty in the quality of the currently estimated time-dependent profile as the objective function. One of our core contributions is the development of this efficiently computable objective function for measuring the uncertainty in the estimated profiles. Our active learning approach allows us to identify the unsampled location at which the predicted observation will lead to the greatest reduction in overall uncertainty in the derived profile.

Because expression experiments profile thousands of genes at a time, it is infeasible to select new sample points on a per-gene basis; some genes may simply be irrelevant (e.g., noncycling genes in cell cycle experiments). Instead, our algorithm evaluates expression profiles for clusters of co-expressed genes. We are in this way able to minimize the effect of noisy single-gene signals, and select new sample points that will contribute the most information to the expression profile as a whole.

Results

We have applied this method to both simulated and real biological data. Our algorithm performs well for both uniform and non-uniform response data. For biological data, we were able to reduce the number of time-points sampled by as much as 17% without significantly effecting the biological results.

References:

[1] Z. Bar-Joseph, G. Gerber, T.S. Jaakkola, D.K. Gifford, and I. Simon. Continuous representations of time series gene expression data. In Journal of Computational Biology, 3-4: pp. 341--356, 2003.

[2] D.J Cummins, T.G. Filloon, and Nychka D. Confidence intervals for nonparametric curve estimates: Toward more uniform pointwise coverage. In J Am Stat Assoc, 96: pp.453:233--246, 2001.

Computer Science and Artificial Intelligence Laboratory (CSAIL)
The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA
tel:+1-617-253-0073 - publications@csail.mit.edu
(Note: On July 1, 2003, the AI Lab and LCS merged to form CSAIL.)