Automatically Generating Medical Ontologies
Ozlem Uzuner, Tawanda Sibanda & Peter Szolovits
An ontology represents the objects and concepts in a domain as well as the relations between them. For medical texts, most researchers use the UMLS ontology, and particularly the MeSH database of medical concepts. These ontologies, while useful, are incomplete and noisy.
Our research aims to automatically generate an ontology: it uses co-training and bootstrapping to iteratively improve precision and recall of existing concept hierarchies and to identify new concept groups that may be absent from current resources. This approach is domain independent: using only a few seed words and syntactic patterns, our iterative scheme compiles ontological information.
Natural language texts contain ample cues for automatically identifying hyponym-hypernym pairs from text. For example, the syntactic relation between the words dog and animal in the sentence "dog is an animal" indicates that dog is a hyponym of animal. Snow et al. exploit such syntactic relations between pairs of concepts to learn hyponym-hypernym relations . Their algorithm initially extracts hyponym-hypernym pairs from WordNet . Given these seed relations, it scans a 6-million-sentence corpus for each hyponym-hypernym pair and finds all of the sentences in which both words occur. From these sentences, it extracts the syntactic dependency paths linking the two words and generates an n dimensional feature vector, where each dimension represents a syntactic dependency. Given a previously unseen pair of words, Snow et al.'s algorithm searches for instances in the corpus in which the words occur together and increments the count of the dimensions that represent the observed syntactic dependencies of the pair. The resulting feature vector is input to a naïve Bayes classifier which returns a binary value indicating whether the pair is a hypernym or not.
Thelen and Riloff propose a bootstrapping approach to learning semantic lexicons . They define semantic categories, such as buildings, people, or cars, collect seed words for each semantic category, and identify context patterns containing the seed words. Using the identified context patterns, they extract more words from the corpus and divide these words into their semantic categories. Next they use the newly identified words to find additional context patterns in the corpus. These new patterns are combined with the previous collection of context patterns and the entire process is repeated.
On the medical domain, Kashyap et al. apply document clustering techniques, e.g., K means, to iteratively group MEDLINE abstracts into hierarchies . From these hierarchies, they create a taxonomy of terms that best represent the created clusters. Fiszman et al. exploit particular syntactic structures such as definition constructs and appositives in order to extract hypernymy relations from text . They use lexical items such as "including", "such as", "particularly", and "especially" to identify some relations. To identify the hypernym and the hyponym in cases of nominal modification, they use the UMLS Meta-thesaurus, the Semantic Network, and the SPECIALIST lexicon.
Our approach is a synthesis of the approaches described in related work. Like Thelen et al. , we use context patterns to extract hyponyms of a concept. However, our algorithm does not start by explicitly defining the concepts we would like to extract. Instead, we define a few syntactic dependency patterns, e.g., "NP such as NP", as was done by Fiszman et al. . Using our seed syntactic patterns, we extract hyponym-hypernym pairs from the corpus. Once we have our initial pairs, we proceed with Thelen's bootstrapping algorithm. We determine additional context patterns, extract new hyponyms using these patterns, use the new hyponyms to refine our pattern definition, extract more hyponyms, so on.
Initially, we begin by defining syntactic/lexical patterns that relate concepts to instances, e.g, “<concept> such as <instances>” where the words in triangle brackets indicate items that can match with words in the corpus. Using these patterns, we extract hyponym-hypernym pairs from the corpus. Suppose the corpus contains a sentence that begins as follows:
From this sentence we can extract the hypernym, proteins, and the hyponyms, HcA4 and Pap1. Now, the next step is to use the new hyponyms to identify additional patterns. Suppose that the corpus contains the following excerpt:
“Pap1’s tertiary structure is characterized by ...”
This sentence reveals that the pattern “<protein instance>’s tertiary structure” might be useful in extracting proteins from the text. After evaluating the effectiveness of the pattern in recognizing valid pairs, we apply this pattern to the corpus. If, for example, the corpus contains the sentence, “Pfs2’s tertiary structure is characterized by …”, this pattern will identify Pfs2 as a protein. At each stage, only the patterns that have high recall and precision will be added to our collection of patterns.
This project is funded by "National Multi-Protocol Ensemble for Self-Scaling Systems for Health" (NMESH) contract # N01-LM-3-3515.
 George Miller, Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine J. Miller. "Introduction to Wordnet: An on-line lexical database", International Journal of Lexicography, 3(4), 235--312, 1990.
 Michael Thelen, Ellen Riloff, "A Bootstrapping Method for Learning Semantic Lexicons using Extraction Pattern Contexts", Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2002.