Skip navigation.
Home
Semantic Software Lab
Concordia University
Montréal, Canada

The GATE Predicate-Argument EXtractor Component (PAX)

Printer-friendly versionPrinter-friendly versionSend by emailSend by emailPDF versionPDF version

1. Overview

This page describes a GATE component for extracting predicate-argument structures (PAS). PASs are used in various contexts to represent relations within a sentence structure. Different "semantic" parsers extract relational information from sentences but there exists no common format to store these information. Our predicate-argument extractor component (PAX) takes the annotations generated by selected parsers and extracts/transforms the parsers' results to predicate-argument structures represented as triples (subject-verb-object).

A variety of different parsers offer support for syntactic analysis of sentences. With this resource we try to extract predicate-argument structures using the output of different such parsers. Currently we support MiniPar, RASP, SUPPLE, and the Stanford Parser. In addition, we can extract PAS out of noun phrases making use of the output of a noun phrase chunker like MuNPEx.

Our resource is implemented as a component for the General Architecture for Text Engineering (GATE). For a more detailed background and motivation, please read our research paper.

PAX pipelinePAX pipeline

2. Prerequisites

As our Predicate-Argument EXtractor comes in form of a GATE component, you will obviously need GATE itself. Most of the required pre-processing components are included in the GATE distribution. In particular, you will need (see the pipeline configuration details below): 1. Tokenizer, 2. Sentence Splitter, 3. POS-Tagger, 4. Morphological Analyser (optional, to get root forms of triples for Stanford Parser and MuNPEx), 5. One Parser: RASP-3, MiniPar, Stanford Parser, SUPPLE, or noun phrase (NP) extractor (e.g., MuNPEx)

3. Documentation

The PAX component is designed to be embedded in more complex pipelines. Here, we describe the minimum requirements for obtaining predicate-argument structures annotations. Note that in the following discussion we assume you know how to work with GATE, for tasks like adding a new CREOLE repository or loading new components into a processing pipeline. If you haven't done this before, please read the GATE user's guide first!

4. Pipeline Configuration

The minimal pipeline to get the predicate-argument structures for all parsers is shown in the figure above. Since there is no wrapper in GATE for RASP-3 we included one in the PAX component. To use RASP-3 you have to specify the location of the RASP script as a parameter when you load the PAX component.

5. PAX Configuration

The PAX component has four runtime parameters. The most important one is "parserASName". It tells the PAX component which parser output is used to extract the triples. Valid entries are: "rasp", "supple", "minipar", "stanford", or "np". Make sure the output annotation set name of the SUPPLE parser wrapper is set accordingly (semanticsSetName="supple"). The same holds if you want to use MiniPar (annotationSetOutputName="minipar").

Screenshot of GATE with PAX annotationsScreenshot of GATE with PAX annotations

6. Result Annotations

A new annotation set of type MultiPaX is added to the document. If the component detects a predicate-argument structure in the output of the selected parser, the sentence containing the PAS is annotated and "sub", "obj", "verb", "neg", and "parserName" properties are added to the annotation.
An example of the result annotatons can be seen in the figure on the left.


7. Downloads

If you use our component, please cite our paper: Krestel, R., R. Witte, and S. Bergler, "Predicate-Argument EXtractor (PAX)", New Challenges for NLP Frameworks, Valletta, Malta : ELRA, pp. 51--54, May 22, 2010.

8. Feedback

For questions, comments, etc., please use the Forum.

9. Version history

  • 1.3: 02.12.2010. OutputASName can now be set to arbitrary names.
  • 1.2: 10.08.2010. Morphological Analyzer is now an optional requirement.
  • 1.1: 10.05.2010. Improved LREC release.
  • 1.0: 01.03.2010. Initial public release.