| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH |
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Submitted on March 9, 2007
Accepted on February 7, 2008
Affiliation of the authors: 1 National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD
* To whom correspondence should be addressed.
Objective The aim is to improve naïve Bayes prediction of MeSH® assignment to documents using optimal training sets found by an active learning inspired method.
Design We selected 20 MeSH terms whose occurrences cover a range of frequencies. For each MeSH term we found an optimal training set, a subset of the whole training set. An optimal training set consists of all documents including a given MeSH term (C1 class) and those documents not including a given MeSH term (C-1 class) that are closest to the C1 class. These small sets were used to predict MeSH assignments in the MEDLINE® database.
Measurements Average precision was used to compare MeSH assignment using the naïve Bayes learner trained on the whole training set, optimal sets, and random sets. We compared 95% lower confidence limits of average precisions of naïve Bayes with upper bounds for average precisions of a K-nearest neighbor (KNN) classifier.
Results For all 20 MeSH assignments, the optimal training sets produced nearly 200% improvement over use of the whole training sets. In 17 of those MeSH assignments, naïve Bayes using optimal training sets was statistically better than a KNN. In 15 of those, optimal training sets performed better than optimized feature selection. Overall naïve Bayes averaged 14% better than a KNN for all 20 MeSH assignments. Using these optimal sets with another classifier, CMLS, produced an additional 6% improvement over naïve Bayes.
Conclusion Using a smaller optimal training set greatly improved learning with naïve Bayes. The performance is superior to a KNN. The small training set can be used with other sophisticated learning methods, such as CMLS, where using the whole training set would not be feasible.
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH |