Abstracts - 2006
Hierarchical Recursive Feature Elimination: A Proposed Method for Reducing the Set of Features Used in an EEG-based Epileptic Seizure Detector
This research is concerned with reducing the number of channels and features computed from each channel of an EEG-based, patient-specific epileptic seizure detector , while still maintaining performance. The current Recursive Feature Elimination (RFE)  implementation we employ can significantly reduce the number of channels (and therefore electrodes) within practical time constraints. Yet RFE, an SVM-based greedy feature elimination algorithm that has become an established method for feature selection, can be computationally expensive. Numerous methods have been introduced to make RFE run within practical time constraints, such as E-RFE , SQRT-RFE, and 1-RFE .
The proposed hierarchical RFE method is another approach to speeding up the RFE process, which it is hoped will enable the removal of not only channels but individual features computed from the channels as well. Such a reduction will decrease overall computational complexity and power consumption, which will be especially important for mobile seizure detector systems.
The electroencephalogram (EEG) is an electrical record of brain activity that is collected using an array of electrodes uniformly distributed on a subject’s scalp. The seizure detector  declares onsets based on a collective examination of features extracted from 21 EEG channels. A channel is defined as the difference between a pair of (typically adjacent) electrodes. The detector is trained and tested on the dataset collected by Shoeb.
RFE uses the internal workings of a Support Vector Machine (SVM) to rank features. An SVM is trained on feature vectors derived from examples of two classes. The apparent importance of each feature is derived from the orientation of the class-separating hyperplane. The feature(s) with the least apparent importance are removed. The remaining features are used to retrain the SVM for the next iteration. The process is repeated recursively until some stopping criterion is met .
Motivation and Proposed Method
Since the number of features (4 per channel, 84 total) is large, the existing system does elimination on a channel-by-channel basis . At each iteration, the channel whose features contribute least to discrimination between the seizure and non-seizure classes is removed. If one begins with 21 channels and stops when all but one have been eliminated, the process takes one to three hours for each patient on a state-of-the-art desktop computer but can take as many as seven hours for patients with a relatively large amount of recorded data or whose seizure and normal data are particularly hard to discriminate between. Using this process to eliminate features individually would increase the processing time by a factor of four. Posed as a search space problem, how can this space, which is too large, be effectively narrowed?
Because the patient-specific detector is trained on only one patient’s data, the number of data points is small; the ratio of abnormal (seizure) class data points to features is roughly 1:1. The recommended ratio of data points to features is at least 10:1 . Whittling down the channels to the best subset is, consequently, vulnerable to random variation, possibly causing unimportant channels to be kept while important channels are eliminated. This would be evident if a patient’s seizures were localized to one side of the head, and the final selected subset included an electrode from the opposite hemisphere.
To prevent this and integrate a priori knowledge, one can consider meta-features: groups of spatially similar channels. Constraining the selection of channels by grouping them based on anatomical information could make the process more robust.
This paper proposes an algorithm for finding a robust subset of channels, disregarding the unnecessary features from those channels, and completing the process within practical time constraints. The algorithm takes advantage of the data’s hierarchical structure by first performing recursive group elimination until a stopping criterion (such as a threshold of performance) is met. Then recursive channel elimination is performed on the channels remaining after eliminating unnecessary groups. Finally, recursive feature elimination is performed on the features from the remaining channels. The same stopping criterion is used at each level. This algorithm will be referred to in the rest of this paper as Hierarchical Recursive Feature Elimination or HRFE.
Though a literature search for hierarchical feature selection retrieves many papers using this terminology, it was only after this paper’s submission that a manuscript  was found, still in press, proposing hierarchical feature selection in a manner similar to H-RFE. The main difference is that the method proposed by Dijck et al. bundles correlated features instead of using a priori knowledge to bundle features (which will not necessarily be correlated).
Since H-RFE is a multi-resolution feature selection method by which it is possible to zoom in on a subset of features and potentially narrow the search space rapidly, it is similar to methods that remove an exponentially decreasing number of electrodes at each iteration. Yet H-RFE incorporates scale: it transitions from eliminating groups of channels (low resolution) to individual features (high resolution).
This proposed method is similar to Entropy-based RFE (E-RFE) in its ability to automatically adapt to the data . With E-RFE, the number of features removed in any one iteration is based upon the percentage of features determined to be unimportant. If a data-based stopping criterion is used, H-RFE will adapt to the data as well, determining at each scale when no further elements should be eliminated and it is necessary to recurse down to a finer scale.
Initial Results and Future Work
Though H-RFE has not yet been completely implemented, its operation on a single (channel) scale is functional and has been applied to twenty patients’ data. The stopping criterion is based on the margin, or distance between classes’ support vectors, in the feature space of the SVM used to rank the features, as well as cross-validation accuracy.
For the twenty patients analyzed so far, the number of channels was reduced, on average, from 21 to 8.35. Compared to the original 21-channel seizure detectors, the reduced channel detectors had a 12 percent increase in latency (0.9 seconds), with an 11 percent decrease in false alarms and 12 percent increase in misses, on average (results pooled across all patients). Eventually, using the full multi-scale implementation of H-RFE, it may be possible, as a result of narrowing the search space through the use of multiple scales (such as groups of channels), to eliminate superfluous features from the remaining channels within a practical amount of time. Though the concept of hierarchical feature selection has already been recently proposed, this method of bundling features–not based on statistical correlations, which are vulnerable to random variation when analyzing small datasets, but on a priori knowledge–may be particularly suited to problems in which the data lends itself to hierarchical organization.
I gratefully acknowledge many helpful discussions with Dorothy Curtis, Martin Glassman, Prof. John Guttag, Daniel Leeds, Eugene Shih, Ali Shoeb, and Zeeshan Syed.
This research is supported by CIMIT, the Center for the Integration of Medicine and Innovative Technology.
 Ali H. Shoeb. Patient-specific seizure onset detection. Master’s thesis, Massachusetts Institute of Technology, 2003.
 C. Furlanello, M. Serafini, S. Merler, & G. Jurman. Entropy-Based Gene Ranking without Selection Bias for the Predictive Classification of Microarray Data. BMC Bioinformatics, 54, 2003.
 C. Furlanello, M. Serafini, S. Merler, & G. Jurman. An accelerated procedure for recursive feature ranking on microarray data. Neural Networks, 16, 641–648, 2003.
 T.N. Lal, M. Schroder, T. Hinterberger, J. Weston, M. Bogdan, N. Birbaumer, & B. Scholkopf. Support vector channel selection in BCI. IEEE Transactions on Biomedical Engineering, 51, 1003–1010, 2004.
 A.K. Jain, R.P.W. Duin, & J. Mao. Statistical pattern recognition: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22, 4–37, 2000.
 G. V. Dijck., M. M. V. Hulle,
& M. Wevers. Hierarchical feature subset
selection for features computed from the continuous wavelet transform.
Proc. 2005 IEEE Workshop on Machine Learning for Signal Processing.