From the 21 users, we have selected five representative samples. The first three examples (users 1, 3 and 20) show how the model performs when it is able to use related user's data to make predictions. With a single labeled data point, the model groups user 1 with two other military personnel (users 5 and 8). While at each step the model makes predictions by averaging over all tree-consistent partitions, the MAP partition listed in the figure has the largest contribution. For user 1, the MAP partition changes at each step, providing superior predictive performance. However, for the third user in the second figure, the model chooses and sticks with the MAP partition that groups the first and third user. In the third example, User 20 is grouped with user 9 initially, and then again later on. Roughly one third of the users witnessed improved initial performance that tapered off as the number of examples grew.
The fourth example (user 17) illustrates that, in some cases, the initial performance for a user with very few samples is not improved because there are no good candidate related users with which to cluster. Finally, the last example shows one of the four cases where predictions using the clustered model leads to worse performance. In this specific case, the model groups the user 7 with user 1. It is not until 128 samples that the model recovers from this mistake and achieves equal performance.
Figure 2 Progression of trees found by BHC for 16, 32, 64 and
128 examples per user. Short vertical edges indicate that two tasks
are strongly related. Long vertical edges indicate that the tasks are
unrelated. Key: (M)ilitary, (P)rofessor, (S)RI researcher.
Figure 2 shows the trees and corresponding partitions recovered by the BHC algorithm as the number of training examples for each user is increased. Inspecting the partitions, they fall along understandable lines; military personnel are most often grouped with other military personnel, and professors and SRI researchers are grouped together until there is enough data to warrant splitting them apart.
Figure 3 shows the relative performance of the clustered versus standard Naive Bayes model. The clustered variant outperforms the standard model when faced with very few examples. After 32 examples, the models perform roughly equivalently, although the standard model enjoys a slight advantage that does not grow with more examples.
Figure 3 The clustered model has more area under the ROC curve
than the standard model when there is less data available. After 32
training examples, the standard model has enough data to match the performance
of the clustered model. Dotted lines are standard error.
The central goal of this work was to evaluate the Clustered Naive Bayes model in a transfer-learning setting. We measured its performance on a real-world meeting acceptance task, and showed that the clustered model can use related users' data to provide better prediction even with very few examples.
The Clustered Naive Bayes model uses a Dirichlet Process prior to couple the parameters of several models applied to separate tasks. This approach is immediately applicable to any collection of tasks whose data are modelled by the same parameterized family of distributions, whether those models are generative or discriminative. This paper suggests that clustering parameters with the Dirichlet Process is worthwhile and can improve prediction performance in situations where we are presented with multiple, related tasks. A theoretical question that deserves attention is whether we can get improved generalization bounds using this technique. A logical next step is to investigate this model of sharing on more sophisticated base models and to relax the assumption that users are exactly identical.
See Daniel Roy's website.
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA), through the Department of the Interior, NBC, Acquisition Services Division, under Contract No. NBCHD030010.
[1] S. Thrun. Is learning the n-th thing any easier than learning the first? NIPS, 1996.
[2] J. Baxter. A model of inductive bias learning. JAIR, 12, 2000.
[3] C. Guestrin, D. Koller, C. Gearhart, and N. Kanodia. Generalizing Plans to New Environments in Relational MDPs. IJCAI, 2003.
[4] Y. Xue, X. Liao, L. Carin, and B. Krishnapuram. Learning multiple classifiers with Dirichlet process mixture priors. NIPS, 2005.
[5] B. E. Boser, I. M. Guyon, and V. N. Vapnik. A training algorithm for optimal margin classifiers. COLT, 1992.
[6] J. Lafferty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. ICML, 2001.
[7] M. E. Maron. Automatic indexing: An experimental inquiry. JACM, 8(3), 1961.
[8] K. Heller and Z. Ghahramani. Bayesian Hierarchical Clustering. ICML, 2005a.
[9] M. Rosenstein, Z. Marx, L. P. Kaelbling, and T. G. Dietterich. Transfer Learning with an Ensemble of Background Tasks. NIPS Workshop on Inductive Transfer , 2005.
Computer Science and Artificial Intelligence Laboratory (CSAIL) The Stata Center, Building 32 - 32 Vassar Street - Cambridge, MA 02139 - USA tel:+1-617-253-0073 - publications@csail.mit.edu |