Extraction of Diseases from Discharge Summaries
We are constructing software to extract diseases and procedures from the medical history, hospital course, and discharge diagnosis sections of patient discharge summaries to act as indices to search for cases. The system uses the SNOMED-CT medical dictionary in the UMLS to code the concepts into standard codes to provide a common terminology for the concepts and guarantee that all of the cases with a given disease can be found even if different phrases are used to describe it.
Clinical research of many kinds depends on the ability to find the cases pertinent to the problem. The necessary information is usually contained in the patient discharge summary. The problem is that these are written by hand. They tend to follow a common format with sections covering the present illness, past medical history, hospital course, and so forth, but there is considerable variability. While the authors attempt to avoid abbreviations and local terms, there is still considerable variety in the terms used, such as "type-2 diabetes" for "non-insulin dependent diabetes mellitus".
There have been a number of efforts to extract useful data from discharge summaries. Friedman's group at Columbia have been prominent in using natural language processing with their MedLEE parser to interpret discharge summaries and extract diseases and other information from them[1,2]. Another effort used triggering words to look for adverse events, with less success because of the variety of ways these events may be expressed.
Program of Development
We have developed a program for extracting the diseases and procedures from patient discharge summaries and coding them, complete with coded and unrecognized modifiers, using SNOMED-CT from the UMLS. The program uses a limited amount of natural language processing, only using a small (less than 200 word) dictionary to divide up disease statements into phrases for coding. With these short phrases, the program is able to quickly find the most specific codes available in SNOMED-CT for the statements in the summary.
The program has been developed and tested on 23 discharge summaries containing 250 phrases to be coded. The program does an effective job coding all but 10 of the phrases with 19 false positives. We expect that the program will be an effective tool for providing a disease and procedure index for a large set of discharge summaries from the same source. From examination of discharge summaries from other hospitals, the technique should be transferable with possible changes to the code for finding the appropriate sections.
We are now in the process of refining the program to work on a much larger set of 2000 discharge summaries.