Skip navigation.
Semantic Software Lab
Concordia University
Montréal, Canada

OrganismTagger: Detection, normalization, and grounding of organism entities in biomedical documents

Printer-friendly versionPrinter-friendly versionPDF versionPDF version
TitleOrganismTagger: Detection, normalization, and grounding of organism entities in biomedical documents
Publication TypeJournal Article
Year of Publication2011
AuthorsNaderi, N., T. Kappler, C. J. O. Baker, and R. Witte
Refereed DesignationRefereed
Date PublishedAugust 9, 2011
ISSN1460-2059 (online) 1367-4803 (print)

Motivation: Semantic tagging of organism mentions in full-text articles is an important part of literature mining and semantic enrichment solutions. Tagged organism mentions also play a pivotal role in disambiguating other entities in a text, such as proteins. A high-precision organism tagging system must be able to detect the numerous forms of organism mentions, including common names as well as the traditional taxonomic groups: genus, species, and strains. In addition, such a system must resolve abbreviations and acronyms, assign the scientific name, and if possible link the detected mention to the NCBI Taxonomy database for further semantic queries and literature navigation.

Results: We present the OrganismTagger, a hybrid rule-based/machine learning system to extract organism mentions from the literature. It includes tools for automatically generating lexical and ontological resources from a copy of the NCBI Taxonomy database, thereby facilitating system updates by end-users. Its novel ontology-based resources can also be reused in other semantic mining and linked data tasks. Each detected organism mention is normalized to a canonical name through the resolution of acronyms and abbreviations and subsequently grounded with an NCBI Taxonomy database ID. In particular, our system combines a novel machine-learning approach with rule-based and lexical methods for detecting strain mentions in documents. On our manually annotated OT corpus, the OrganismTagger achieves a precision of 95%, a recall of 94% and a grounding accuracy of 97.5%. On the manually annotated corpus of Linnaeus-100, the results show a precision of 99%, recall of 97% and grounding accuracy of 97.4%.

Availability: The OrganismTagger, including supporting tools, resources, training data and manual annotations, as well as end-user and developer documentation, is freely available under an open source license at

Impact Factor

5.468 (2012)


Received on March 7, 2011; revised on July 14, 2011; accepted on July 31, 2011.

Bioinformatics-2011-Naderi-2721-9.pdf424.64 KB