Using Term Informativeness for Named Entity DetectionJason D. M. Rennie & Tommi JaakkolaInformal CommunicationWe are interested in the problem of extracting information from informal, written communication. At the time of this writing, Google.com catalogs eight billion web pages. There are easily that number of e-mail, newsgroup and bulletin-board posts every day. The web is filled with information, but even more information is available in the informal communications people send and receive every day. We call this communication informal because structure not explicit and the writing is not fully grammatical. Web pages are highly structured. They use links, headers and tags to mark-up the text and identify important pieces of information. Newspaper text is harder to deal with. Gone is the computer-readable structure. But, newspaper articles have proper grammar with correct punctuation and capitalization; as such, part-of-speech taggers show high tagging accuracy on newspaper text. With informal communication, even these basic cues are noisy. In e-mail and posts to bulletin boards, grammar rules are bent, capitalization can be ignored or used haphazardly and punctuation use is creative. There is good reason why little work has been done on this topic: the problem is challenging and data can be difficult to attain due to privacy issues. Yet, the volume of information that is available in informal communication makes us believe that trying to chip away at the information extraction problem is a useful endeavor. Specific Application: RestaurantsRestaurants are one subject where informal communication is highly valuable. Much information about restaurants can be found on the web and in newspaper articles. Zagat's publishes restaurant guides. Restaurants are also discussed on mailing lists and bulletin boards. When a new restaurant opens, it often takes weeks, or months before reviews are published on the web or in the newspaper (Zagat's guides take even longer). However, restaurant bulletin boards contain information about new restaurants almost immediately after they open (sometimes even before they open). They are also ``up'' on major changes: a temporary closure, new management, better service or a drop in food quality. This information is difficult to find elsewhere. As an exploratory task, we consider the problem of extracting restraunt names from bulletin board postings. Systems can accurately extract named entities from newspaper articles, but they rely heavily on capitalization, punctuation and correct part-of-speech information. In informal communication, much of this information is noisy---other features need to be incorporated. Term InformativenessIt has been found that named entities, like restaurant names, are highly relevant to the topic of a document [1]. If we had a good measure of how topic-oriented, or "informative," each word was, we would be better able to identify named entities. It is well known that informative words have ``peaked'' or ``heavy-tailed'' frequency distributions [2]. Yet, most term frequency models do not take advantage of this fact. Many informativeness scores have been introduced, including Inverse Document Frequency (IDF) [3], Residual IDF [4] and xI [5]. Only xI makes direct use of the fit of a word's frequency statistics to a peaked/heavy-tailed distribution, yet it does a poor job of finding informative words. We introduce a new informativeness score that is based on the fit of a word's frequency statistics to a mixture of 2 Unigram distributions. We find that it is effective at identifying topic-centric words. We also find that it combines well with IDF. Our combined IDF/Mixture score is highly effective at identifying informative words. In our restaurant extraction task, only one other informativeness score, Residual IDF, is competitive. Using our combined IDF/Mixture score, our ability to identify restaurant names is significantly better than using capitalization, punctuation and part-of-speech information alone.
The above table shows performance for extracting restaurant names from a collection of bulletin board posts. As can be seen, this task is difficult. Performance is much worse than the task of extracting named entities from newspaper articles (90% F1 scores are common---higher is better). The "Baseline" extractor uses features normally used for the task of named entity extraction, such as punctuation, capitalization and part-of-speech tags. The "Baseline+Informativeness" extractor uses those features in addition to our informativeness score. The F1 score improves by more than four percentage points with the addition of this single term informativeness feature. Additional tests we conducted revealed that this difference is significant at the p=0.05 level. Since features traditionally used for named entity extraction (such as capitalization, punctuation and part-of-speech tags) are somewhat noisy in informal communication, additional information is necessary for accurate prediction. We find that word statistics can be used to determine the informativeness of a word, which is indicative of whether or not it appears as part of a restaurant name. Future WorkWe have already achieved a significant gain in performance using the Mixture model. While this models term frequency better than the traditional unigram model, there is room for improvement. Early tests with a heavy-tailed, log-log term frequency model indicate further-improved ability to capture empirical term frequency distributions. In addition to the task of named entity extraction, we are also investigating improved methods for co-reference resolution and collaborative filtering. Even more so than newspaper articles, e-mails and bulletin board posts use context to refer to entities and concepts. Methods of context and reference resolution that do not rely on proper syntax and grammar are needed in order to accomplish complex extraction tasks on informal communication. One such task is collaborative prediction where preferences are extracted from the informal communication. We have developed novel collaborative filtering algorithms [6] to handle preference prediction. Research SupportThis project is supported in part by the DARPA CALO project. References[1] Chris Clifton and Robert Cooley. TopCat: Data Mining for Topic Identification in a Text Corpus. In Proceedings of the 3rd European Conference of Principles and Practice of Knowledge Discovery in Databases, 1999. [2] Kenneth W. Church and William A. Gale. Poisson Mixtures. Journal of Natural Language Engineering, 1995. [3] Karen Sparck Jones. Index Term Weighting. Information Storage and Retrieval, 9:619--633, 1973. [4] Kenneth W. Church and William A. Gale. Inverse Document Frequency (IDF): A Measure of Deviation from Poisson. In Proceedings of the Third Workshop on Very Large Corpora, 121--130, 1995. [5] Abraham Bookstein and Don R. Swanson. Probabilistic Models for Automatic Indexing. Journal of the American Society for Information Science, 25(5):312--318, 1974. [6] Nathan Srebro, Jason D. M. Rennie and Tommi S. Jaakkola. Maximum-Margin Matrix Factorization. In Advances in Neural Information Processing Systems 17, 2005. |
||||||
|