Our open source OrganismTagger is a hybrid rule-based/machine-learning system that extracts organism mentions from the biomedical literature, normalizes them to their scientific name, and provides grounding to the NCBI Taxonomy database. Our pipeline provides the flexibility of annotating the species of particular interest to bio-engineers on different corpora, by optionally including detection of common names, acronyms, and strains. The OrganismTagger performance has been evaluated on two manually annotated corpora, OT and Linneaus. On the OT corpus, the OrganismTagger achieves a precision and recall of 95% and 94% and a grounding accuracy of 97.5%. On the manually annotated corpus of Linneaus-100, the results show a precision and recall of 99% and 97% and grounding with an accuracy of 97.4%. It is described in detail in our publication, "OrganismTagger: Detection, normalization, and grounding of organism entities in biomedical documents", Bioinformatics, vol. 27, no. 19 Oxford University Press, pp. 2721--2729, August 9, 2011.
Springer just published a new book, Language Technology for Cultural Heritage, where we also contributed a chapter: "Integrating Wiki Systems, Natural Language Processing, and Semantic Technologies for Cultural Heritage Data Management". The book collects selected, extended papers from several years of the LaTeCH workshop series, where we presented our work on the Durm Project back in 2008.
In this project, which ran from 2004–2006, we analysed the historic Encyclopedia of Architecture, which was written in German between 1880-1943. It was one of the largest projects aiming at conserving all architectural knowledge available at that time. Today, its vast amount of content is mostly lost: few complete sets are available and its complex structure does not lend itself easily to contemporary application. We were able to track down one of the rare complete sets in the Karlsruhe University's library, where it fills several meters of shelves in the archives. The goal, then, was to apply "modern" (as of 2005) semantic technologies to make these heritage documents accessible again by transforming them into a semantic knowledge base (due to funding limitations, we only worked with one book in this project, but the system was developed to be able to eventually cover the complete set). Using techniques from Natural Language Processing and Semantic Computing, we automatically populate an ontology that can be used for various application scenarios: Building historians can use it to navigate and query the encyclopedia, while architects can directly integrate it into contemporary construction tools. Additionally, we made all content accessible through a user-friendly Wiki interface, which combines original text with NLP-derived metadata and adds annotation capabilities for collaborative use (note that not all features are enabled in the public demo version).
All data created in the project (scanned book images, generated corpora, etc.) is publicly available under open content licenses. We also still maintain a number of open source tools that were originally developed for this project, such as the Durm German Lemmatizer. A new version of our Wiki/NLP integration, which will allow everyone to easily set up a similar system, is currently under development and will be available early 2012.
In the spirit of continuous integration, every check-in to the Subversion repository is built automatically and additionally checked with various tools. The latest build is archived and available for browsing and download.
Some photos from our GATE meeting in Montreal August/September 2010 @ Concordia U.
TS11 at Canadian AI
Call for papers
Automatic text summarization (TS) has been a matter of active research for over a decade now. Doing TS really well would require insights from statistics, machine learning, linguistics and cognitive science, to name a few. Despite a great deal of research effort, state-of-the-art TS systems achieve summary quality much lower than even untrained human summarizers. There is room for improvement and much interesting work to do.
Summarization is the theme of Text Analysis Conferences (TAC), an influential annual shared evaluation exercise. It is not uncommon to plan TS work around those annual events, regardless of their somewhat narrow range: they focus on summarizing news. While this workshop is open to relevant work already presented at TAC, it is designed as a venue for research on TS which does not necessarily fit the TAC format. We will welcome articles which discuss summarization of other genres (such as blogs, email messages, books, captions or subtitles), investigation of human recall and summarization of data, and the role of language generation in TS, among others. We will also gladly consider position papers on more fundamental long-term challenges in TS: how to move past heavy reliance on shallow lexical information, how to create summaries of high linguistic quality, and so on.
The noun phrase chunker MuNPEx (Multi-Lingual Noun Phrase Extractor) is now available in the new and improved release v1.0. MuNPEx is a base NP chunker for the GATE framework and implemented in JAPE. It is fast, robust, customizable, well-tested and currently supports English, German, and French (with Spanish in beta).
Major changes in this release:
- Limited number of pre- and post-head modifiers to make MuNPEx more robust on certain kinds of input (like a long list of tags or menu entries when processing web pages)
- New optional grammars to add a HEAD_LEMMA slot to an NP annotation, with the lemma extracted from the GATE morphological analyser (for English), the Durm Lemmatizer (for German), or the TreeTagger (for German, Spanish, French)
- DET/MOD/HEAD/MOD2 slots are now stored as strings (rather than Content objects) to make them easier to export and compatible with the new Predicate-Argument Extractor (PAX) component
- other code cleanup and improvements
- no longer labeled as "beta" -- five years of testing ought to be enough, we're not Google ;-)
For more details and the download, please visit the MuNPEx page.
MutationFinder is a freely available resource for tagging mutations in biomedical texts. However, it cannot be directly integrated into a text mining pipeline when using the General Architecture for Text Engineering (GATE) framework. Here, I show how to make it available to GATE users using the standard TaggerFramework component, which requires some text wrangling.
***Early bird registration has been extended until 21 July!***
The third GATE training course will take place at Concordia University
in Montréal, Canada, from August 30th to September 3rd 2010. This event
will follow the format of the earlier May 2010 course, but with the
addition of a new training track covering linked data and ontologies.
Further details on the material to be covered:
Registration, travel and accommodation: