The GATE Predicate-Argument EXtractor Component (MultiPaX)
This page describes a GATE component for extracting predicate-argument structures (PAS). PASs are used in various contexts to represent relations within a sentence structure. Different "semantic" parsers extract relational information from sentences but there exists no common format to store these information. Our predicate-argument extractor component MultiPaX (formerly just named PAX) takes the annotations generated by selected parsers and extracts/transforms the parsers' results to predicate-argument structures represented as triples (subject-verb-object).
A variety of different parsers offer support for syntactic analysis of sentences. With this resource we try to extract predicate-argument structures using the output of different such parsers. Currently we support MiniPar, RASP, SUPPLE, and the Stanford Parser. In addition, we can extract PAS out of noun phrases making use of the output of a noun phrase chunker like MuNPEx.
As our Predicate-Argument EXtractor comes in form of a GATE component, you will obviously need GATE itself. Most of the required pre-processing components are included in the GATE distribution. In particular, you will need (see the pipeline configuration details below): 1. Tokenizer, 2. Sentence Splitter, 3. POS-Tagger, 4. Morphological Analyser (optional, to get root forms of triples for Stanford Parser and MuNPEx), 5. One Parser: RASP-3, MiniPar, Stanford Parser, SUPPLE, or noun phrase (NP) extractor (e.g., MuNPEx)
The PAX component is designed to be embedded in more complex pipelines. Here, we describe the minimum requirements for obtaining predicate-argument structures annotations. Note that in the following discussion we assume you know how to work with GATE, for tasks like adding a new CREOLE repository or loading new components into a processing pipeline. If you haven't done this before, please read the GATE user's guide first!
4. Pipeline Configuration
The minimal pipeline to get the predicate-argument structures for all parsers is shown in the figure above. Since there is no wrapper in GATE for RASP-3 we included one in the PAX component. To use RASP-3 you have to specify the location of the RASP script as a parameter when you load the PAX component.
5. MultiPaX Configuration
The MultiPaX component has four runtime parameters. The most important one is "parserASName". It tells the MultiPaX component which parser output is used to extract the triples. Valid entries are: "rasp", "supple", "minipar", "stanford", or "np". Make sure the output annotation set name of the SUPPLE parser wrapper is set accordingly (semanticsSetName="supple"). The same holds if you want to use MiniPar (annotationSetOutputName="minipar").
6. Result Annotations
A new annotation set of type MultiPaX is added to the document. If the component detects a predicate-argument structure in the output of the selected parser, the sentence containing the PAS is annotated and "sub", "obj", "verb", "neg", and "parserName" properties are added to the annotation.
An example of the result annotatons can be seen in the figure on the left.
- the GATE MultiPaX component Version 1.3
- our research paper about the Predicate-Argument-EXtractor (PAX)
- the GNU GPL license, under which you can use the component
If you use our component, please cite our paper: "Predicate-Argument EXtractor (PAX)", New Challenges for NLP Frameworks, Valletta, Malta : ELRA, pp. 51--54, May 22, 2010.
For questions, comments, etc., please use the Forum.
9. Version history
- 1.3: 02.12.2010. OutputASName can now be set to arbitrary names.
- 1.2: 10.08.2010. Morphological Analyzer is now an optional requirement.
- 1.1: 10.05.2010. Improved LREC release.
- 1.0: 01.03.2010. Initial public release.