Semantic Software Lab
Concordia University
Montréal, Canada

Tools & Resources

Multi-Lingual Noun Phrase Extractor (MuNPEx) v1.0 for GATE released

MuNPEx 1.0MuNPEx 1.0The noun phrase chunker MuNPEx (Multi-Lingual Noun Phrase Extractor) is now available in the new and improved release v1.0. MuNPEx is a base NP chunker for the GATE framework and implemented in JAPE. It is fast, robust, customizable, well-tested and currently supports English, German, and French (with Spanish in beta).

Major changes in this release:

  • Limited number of pre- and post-head modifiers to make MuNPEx more robust on certain kinds of input (like a long list of tags or menu entries when processing web pages)
  • New optional grammars to add a HEAD_LEMMA slot to an NP annotation, with the lemma extracted from the GATE morphological analyser (for English), the Durm Lemmatizer (for German), or the TreeTagger (for German, Spanish, French)
  • DET/MOD/HEAD/MOD2 slots are now stored as strings (rather than Content objects) to make them easier to export and compatible with the new Predicate-Argument Extractor (PAX) component
  • other code cleanup and improvements
  • no longer labeled as "beta" -- five years of testing ought to be enough, we're not Google ;-)

For more details and the download, please visit the MuNPEx page.

New Javadoc Doclet for NLP Analysis on Java Source Code

For those interested in performing NLP on source code, in particular Javadoc comments, we just released a Doclet at the NLP Frameworks workshop last week.

Its main feature is that it creates an XML corpus from Java source code that is optimised for processing in an NLP Framework (GATE in our case, but it should work for any framework that takes XML as input).

The OwlExporter: Flexible Ontology Population from Text

This page describes the OwlExporter, an open source (AGPL3) component that facilitates populating an OWL Ontology from annotations created by an existing GATE application.

The Javadoc NLP Corpus Generation Doclet

This page describes the process of generating a corpus from source code and source code comments using Javadoc. The SSLDoclet is a custom doclet that is passed as a parameter to Javadoc in order to create an Abstract Syntax Tree (AST) that can be used as a corpus within NLP frameworks such as GATE.

The GATE Multi-Parser Predicate-Argument EXtractor Component (MultiPaX)

The GATE Multi-Parser Predicate-Argument EXtractor Component (MultiPaX) can extract predicate-argument structures (PAS) from the output of different parsers.

Reported Speech Tagger

Reported speech in the form of direct and indirect reported speech is an important indicator of evidentiality in traditional newspaper texts, but also increasingly in the new media that rely heavily on citation and quotation of previous postings, as for instance in blogs or newsgroups. We developed an NLP component in form of a GATE resource that can automatically detect and tag reported speech constructs, in particular the source, reporting verb and content. This is intended as a first module for more sophisticated representation and reasoning with attributed information, such as belief reasoning based on nested belief structures.

