New Javadoc Doclet for NLP Analysis on Java Source Code

For those interested in performing NLP on source code, in particular Javadoc comments, we just released a Doclet [5] at the NLP Frameworks workshop [6] last week.

Its main feature is that it creates an XML corpus from Java source code that is optimised for processing in an NLP Framework (GATE [7] in our case, but it should work for any framework that takes XML as input).

For more information and the download, have a look at the Web page [5]. And for details, background, and an application example at our paper [1].

We currently use it for automatic quality assessment of source code comments, but obviously there are many other use cases as well.

References

Khamis, N., R. Witte [8], and J. Rilling [9], "Generating an NLP Corpus from Java Source Code: The SSL Javadoc Doclet [10]", New Challenges for NLP Frameworks, Valletta, Malta : ELRA, pp. 41–45, May 22, 2010.

Links:
[1] https://www.semanticsoftware.info/category/project/tools-resources
[2] https://www.semanticsoftware.info/category/topic/nlp/corpora
[3] https://www.semanticsoftware.info/category/topic/semantic-computing
[4] https://www.semanticsoftware.info/category/topic/software-engineering
[5] https://www.semanticsoftware.info/javadoclet
[6] http://nlpframeworks2010.semanticsoftware.info
[7] http://gate.ac.uk
[8] https://www.semanticsoftware.info/biblio/author/1
[9] https://www.semanticsoftware.info/biblio/author/10
[10] https://www.semanticsoftware.info/biblio/generating-nlp-corpus-java-source-code-ssl-javadoc-doclet