Skip navigation.
Semantic Software Lab
Concordia University
Montréal, Canada

The Javadoc NLP Corpus Generation Doclet

Printer-friendly versionPrinter-friendly versionPDF versionPDF version

1. Overview

This page describes the process of generating a corpus from source code and source code comments using Javadoc. The SSLDoclet is a custom doclet that is passed as a parameter to Javadoc in order to create an Abstract Syntax Tree (AST) that can be used as a corpus within NLP frameworks such as GATE.

A Sample of Java code and comments taken from ArgoUMLA Sample of Java code and comments taken from ArgoUML

The semantics found in source code and source code comments are modelled using an AST schema. The SSLDoclet generates an AST that is designed to be used as a corpus. The schema of the AST uses a structure that enables the tags, attributes and elements of the generated XML to be translated as annotations, features and entities of a corpus when loaded within an NLP framework.

Screenshot of a Corpus Generated using the SSLDocletScreenshot of a Corpus Generated using the SSLDoclet

The SSLDoclet is designed to generate a corpus using the eXtensible Markup Language (XML) that uses a schema that is easily processed by an NLP Framework. The XML nodes are interpreted as annotations, the attributes are interpreted as features of the annotations, and finally the elements are interpreted as entities of the annotations.

The Generated Corpus Loaded within GATEThe Generated Corpus Loaded within GATE

2. Documentation

Our doclet is implemented using the Javadoc doclet API library, and is passed as a parameter to Javadoc when generating a corpus using a source directory. The figure below shows the "docs" task that is part of the ant build included with the SSLDoclet source code.

Docs Task Included in the ANT BuildDocs Task Included in the ANT Build

The parameters needed to generate a corpus from source code are:

  • doclet: The name of the doclet
  • docletpath: The directory containing the doclet
  • sourcepath: The directory containing the source code that needs to be processed
  • packagenames: The name of the packages that need to be processed.
  • destdir: The directory where the doclet will place the generated XML documents.

To run the SSLDoclet, both ANT and Java must be part of your system's environment variable, and:

  • Setup the doc task with the valid parameter values.
  • Issue the following command using the command line
    “ ant docs ”

3. Download

If you use our component, please cite our paper: Khamis, N., R. Witte, and J. Rilling, "Generating an NLP Corpus from Java Source Code: The SSL Javadoc Doclet", New Challenges for NLP Frameworks, Valletta, Malta : ELRA, pp. 41–45, May 22, 2010.

4. Feedback

For questions, comments, etc., please use the Forum.

5. Version history

  • 1.2: 13.04.2010. Added line number information.
  • 1.1: 01.03.2010. Initial public release.