Skip navigation.
Home
Semantic Software Lab
Concordia University
Montréal, Canada

The Semantic Software Lab

ENCS Building

The Semantic Software Lab was founded in 2008 by René Witte at Concordia University in Montréal, Québec, Canada. Our lab focuses on research and applications of Semantic Computing, Text Mining, Linked Data, Natural Language Processing (NLP), Intelligent Information Systems, and related technologies. We are committed to providing free, open source software and open research data to the community.

This website provides information about the lab's research activities and our published tools and resources. It also provide information for students interested in course or research work, as well as career opportunities for researchers. It also aims to serve as a community portal for selected topics and events in the area of semantic systems within (north-east) America in general and Montréal in particular. You can also follow us on Twitter @SemSoft, on LinkedIn or connect with us on Google+.

Semantic Computing Course

The Semantic Computing course (SOEN 691B) is offered at Concordia University, providing graduate students with a unique opportunity to study research and development of novel semantic software systems. The course is taught by Prof. René Witte and supported by team members from the Semantic Software Lab. Students from other universities in Québec can register for this course through CREPUQ.

This course provide an introduction to selected topics from Semantic Computing, including text mining, tagging and tag analysis, recommender systems, RDF and linked data, semantic desktops and semantic wikis.

Semantic Assistants: Eclipse Plug-In

Natural Language Processing (NLP) for Software Engineering: Our Eclipse plug-in integrates the Eclipse development environment into the Semantic Assistants architecture. It provides a user interface for offering various Natural Language Processing services to users. In particular, when using Eclipse as a software development environment, you can now offer novel semantic analysis services, such as named entity detection or quality analysis of source code comments, to software developers.

Semantic Assistants at IBM CASCON 2011


Last week I presented our Semantic Assistants Eclipse plug-in at the IBM Conference of Advances studies in Markham, Ontario. CASCON is a conference hosted by IBM's Centre for Advanced Studies (CAS) in partnership with NSERC with the goal of showcasing various research projects in progress by individuals in academia, industry and the general public.

OwlExporter v3.0 Released


We just released a new version of the OwlExporter ontology population plugin for GATE. The OwlExporter PR can be added to any NLP pipeline to facilitate the population of an existing OWL ontology with entities detected in the corpus. It supports the population of separate NLP- and domain-ontologies and has support for some advanced features, like the export of coreference chains.

In this release, we included a pre-compiled binary and a complete example pipeline that transforms GATE's ANNIE information extraction example into an ontology population system. We also completely revamped the documentation and website to make it more accessible to ontology population novices.

The Organism Tagger System


Our open source OrganismTagger is a hybrid rule-based/machine-learning system that extracts organism mentions from the biomedical literature, normalizes them to their scientific name, and provides grounding to the NCBI Taxonomy database. Our pipeline provides the flexibility of annotating the species of particular interest to bio-engineers on different corpora, by optionally including detection of common names, acronyms, and strains. The OrganismTagger performance has been evaluated on two manually annotated corpora, OT and Linneaus. On the OT corpus, the OrganismTagger achieves a precision and recall of 95% and 94% and a grounding accuracy of 97.5%. On the manually annotated corpus of Linneaus-100, the results show a precision and recall of 99% and 97% and grounding with an accuracy of 97.4%. It is described in detail in our publication, Naderi, N., T. Kappler, C. J. O. Baker, and R. Witte, "OrganismTagger: Detection, normalization, and grounding of organism entities in biomedical documents", Bioinformatics: Oxford University Press, August 9, 2011.

New Book Chapter on Semantic Wikis and Natural Language Processing for Cultural Heritage Data


Springer just published a new book, Language Technology for Cultural Heritage, where we also contributed a chapter: "Integrating Wiki Systems, Natural Language Processing, and Semantic Technologies for Cultural Heritage Data Management". The book collects selected, extended papers from several years of the LaTeCH workshop series, where we presented our work on the Durm Project back in 2008.

In this project, which ran from 2004–2006, we analysed the historic Encyclopedia of Architecture, which was written in German between 1880-1943. It was one of the largest projects aiming at conserving all architectural knowledge available at that time. Today, its vast amount of content is mostly lost: few complete sets are available and its complex structure does not lend itself easily to contemporary application. We were able to track down one of the rare complete sets in the Karlsruhe University's library, where it fills several meters of shelves in the archives. The goal, then, was to apply "modern" (as of 2005) semantic technologies to make these heritage documents accessible again by transforming them into a semantic knowledge base (due to funding limitations, we only worked with one book in this project, but the system was developed to be able to eventually cover the complete set). Using techniques from Natural Language Processing and Semantic Computing, we automatically populate an ontology that can be used for various application scenarios: Building historians can use it to navigate and query the encyclopedia, while architects can directly integrate it into contemporary construction tools. Additionally, we made all content accessible through a user-friendly Wiki interface, which combines original text with NLP-derived metadata and adds annotation capabilities for collaborative use (note that not all features are enabled in the public demo version).

All data created in the project (scanned book images, generated corpora, etc.) is publicly available under open content licenses. We also still maintain a number of open source tools that were originally developed for this project, such as the Durm German Lemmatizer. A new version of our Wiki/NLP integration, which will allow everyone to easily set up a similar system, is currently under development and will be available early 2012.

New Jenkins Server for Semantic Assistants Project

We now have a public Jenkins server (formerly known as Hudson) available for our Semantic Assistants project that supports the SourceForge code repository.

In the spirit of continuous integration, every check-in to the Subversion repository is built automatically and additionally checked with various tools. The latest build is archived and available for browsing and download.

Some Photos from the FIG3 GATE Meeting in Montréal

Some photos from our GATE meeting in Montreal August/September 2010 @ Concordia U.

CfP: Text Summarization Workshop at Canadian AI (TS11)

2011-03-11
America/Montreal

TS11 at Canadian AI
https://sites.google.com/site/ts11canai/

Call for papers

Automatic text summarization (TS) has been a matter of active research for over a decade now. Doing TS really well would require insights from statistics, machine learning, linguistics and cognitive science, to name a few. Despite a great deal of research effort, state-of-the-art TS systems achieve summary quality much lower than even untrained human summarizers. There is room for improvement and much interesting work to do.

Summarization is the theme of Text Analysis Conferences (TAC), an influential annual shared evaluation exercise. It is not uncommon to plan TS work around those annual events, regardless of their somewhat narrow range: they focus on summarizing news. While this workshop is open to relevant work already presented at TAC, it is designed as a venue for research on TS which does not necessarily fit the TAC format. We will welcome articles which discuss summarization of other genres (such as blogs, email messages, books, captions or subtitles), investigation of human recall and summarization of data, and the role of language generation in TS, among others. We will also gladly consider position papers on more fundamental long-term challenges in TS: how to move past heavy reliance on shallow lexical information, how to create summaries of high linguistic quality, and so on.

Syndicate content