- SSL for Students
- Tools & Resources
The Javadoc NLP Corpus Generation Doclet
This page describes the process of generating a corpus from source code and source code comments using Javadoc. The SSLDoclet is a custom doclet that is passed as a parameter to Javadoc in order to create an Abstract Syntax Tree (AST) that can be used as a corpus within NLP frameworks such as GATE.
The semantics found in source code and source code comments are modelled using an AST schema. The SSLDoclet generates an AST that is designed to be used as a corpus. The schema of the AST uses a structure that enables the tags, attributes and elements of the generated XML to be translated as annotations, features and entities of a corpus when loaded within an NLP framework.
The SSLDoclet is designed to generate a corpus using the eXtensible Markup Language (XML) that uses a schema that is easily processed by an NLP Framework. The XML nodes are interpreted as annotations, the attributes are interpreted as features of the annotations, and finally the elements are interpreted as entities of the annotations.
Our doclet is implemented using the Javadoc doclet API library, and is passed as a parameter to Javadoc when generating a corpus using a source directory. The figure below shows the "docs" task that is part of the ant build included with the SSLDoclet source code.
The parameters needed to generate a corpus from source code are:
- doclet: The name of the doclet
- docletpath: The directory containing the doclet
- sourcepath: The directory containing the source code that needs to be processed
- packagenames: The name of the packages that need to be processed.
- destdir: The directory where the doclet will place the generated XML documents.
To run the SSLDoclet, both ANT and Java must be part of your system's environment variable, and:
- Setup the doc task with the valid parameter values.
- Issue the following command using the command line
“ ant docs ”
- The SSL Javaoc Doclet Version 1.2
- Our research paper about the Semantic Software Lab Javadoc Doclet (SSL Javadoc Doclet)
- The GNU GPL license under which you can use this tool.
If you use our component, please cite our paper: "Generating an NLP Corpus from Java Source Code: The SSL Javadoc Doclet", New Challenges for NLP Frameworks, Valletta, Malta : ELRA, pp. 41–45, May 22, 2010.
For questions, comments, etc., please use the Forum.
5. Version history
- 1.2: 13.04.2010. Added line number information.
- 1.1: 01.03.2010. Initial public release.