help button home button JAMIA Bigger figures
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS

First published June 28, 2007 as JAMIA PrePrint; doi:10.1197/jamia.M2392
This Article
Right arrow Full Text
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
M2392v1
14/5/641    most recent
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Liu, K.
Right arrow Articles by Crowley, R. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Liu, K.
Right arrow Articles by Crowley, R. S.
J Am Med Inform Assoc. 2007;14:641-650. DOI 10.1197/jamia.M2392.
© 2007 American Medical Informatics Association


Methods Paper

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

Kaihong Liu, MD, MSa, Wendy Chapman, PhDa, Rebecca Hwa, PhDb and Rebecca S. Crowley, MD, MSa,c,*

a Department of Biomedical Informatics, University of Pittsburgh School of Medicine, Pittsburgh, PA
b Department of Computer Science, University of Pittsburgh School of Medicine, Pittsburgh, PA
c Department of Pathology, University of Pittsburgh School of Medicine, Pittsburgh, PA

* Correspondence and reprints: Rebecca Crowley, MD, MS, Department of Biomedical Informatics, University of Pittsburgh School of Medicine, UPMC Shadyside Cancer Pavilion—Room 307, 5230 Centre Avenue, Pittsburgh, PA 15232 (Email: crowleyrs{at}upmc.edu).

Received for publication: 01/31/07; accepted for publication: 05/21/07.

Part-of-speech tagging represents an important first step for most medical natural language processing (NLP) systems. The majority of current statistically-based POS taggers are trained using a general English corpus. Consequently, these systems perform poorly on medical text. Annotated medical corpora are difficult to develop because of the time and labor required. We investigated a heuristic-based sample selection method to minimize annotated corpus size for retraining a Maximum Entropy (ME) POS tagger. We developed a manually annotated domain specific corpus (DSC) of surgical pathology reports and a domain specific lexicon (DL). We sampled the DSC using two heuristics to produce smaller training sets and compared the retrained performance against (1) the original ME modeled tagger trained on general English, (2) the ME tagger retrained on the DL, and (3) the MedPost tagger trained on MEDLINE abstracts. Results showed that the ME tagger retrained with a DSC was superior to the tagger retrained with the DL, and also superior to MedPost. Heuristic methods for sample selection produced performance equivalent to use of the entire training set, but with many fewer sentences. Learning curve analysis showed that sample selection would enable an 84% decrease in the size of the training set without a decrement in performance. We conclude that heuristic sample selection can be used to markedly reduce human annotation requirements for training of medical NLP systems.







HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
Copyright © 2007 by the American Medical Informatics Association.