Skip navigation.
Home
Semantic Software Lab
Concordia University
Montréal, Canada

Blogroll

EWRL and NIPS 2016

Machine Learning Blog - Wed, 2017-01-04 16:24

I went to the European Workshop on Reinforcement Learning and NIPS last month and saw several interesting things.

At EWRL, I particularly liked the talks from:

  1. Remi Munos on off-policy evaluation
  2. Mohammad Ghavamzadeh on learning safe policies
  3. Emma Brunskill on optimizing biased-but safe estimators (sense a theme?)
  4. Sergey Levine on low sample complexity applications of RL in robotics.

My talk is here. Overall, this was a well organized workshop with diverse and interesting subjects, with the only caveat being that they had to limit registration

At NIPS itself, I found the poster sessions fairly interesting.

  1. Allen-Zhu and Hazan had a new notion of a reduction (video).
  2. Zhao, Poupart, and Gordon had a new way to learn Sum-Product Networks
  3. Ho, Littman, MacGlashan, Cushman, and Austerwell, had a paper on how “Showing” is different from “Doing”.
  4. Toulis and Parkes had a paper on estimation of long term causal effects.
  5. Rae, Hunt, Danihelka, Harley, Senior, Wayne, Graves, and Lillicrap had a paper on large memories with neural networks.
  6. Hardt, Price, and Srebro, had a paper on Equal Opportunity in ML.

Format-wise, I thought the 2 sessions was better than 1, but I really would have preferred more. The recorded spotlights are also pretty cool.

The NIPS workshops were great, although I was somewhat reminded of kindergarten soccer in terms of lopsided attendance. This may be inevitable given how hot the field is, but I think it’s important for individual researchers to remember that:

  1. There are many important directions of research.
  2. You personally have a much higher chance of doing something interesting if everyone else is not doing it also.

During the workshops, I learned about ADAM (a momentum form of Adagrad), testing ML systems, and that even TenserFlow is finally looking into synchronous updates for parallel learning (allreduce is the way).

(edit: added one)

Categories: Blogroll

Ethics, logistic regression, and 0-1 loss

LingPipe Blog - Tue, 2016-12-27 18:51
Andrew Gelman and David Madigan wrote a paper on why 0-1 loss is so problematic: Gelman and Madigan. 2015. How is Ethics Like Logistic Regression? Chance 28(12). This is related to the issue of whether one should be training on an artificial gold standard. Suppose we have a bunch of annotators and we don’t have […]
Categories: Blogroll

Vowpal Wabbit version 8.3 and tutorial

Machine Learning Blog - Thu, 2016-12-08 05:02

I just released Vowpal Wabbit 8.3 and we are planning a tutorial at NIPS Saturday over the lunch break in the ML systems workshop. Please join us if interested.

8.3 should be backwards compatible with all 8.x series. There have been big changes since the last version related to

  1. Contextual bandits, particularly w.r.t. the decision service.
  2. Learning to search for which we have a paper at NIPS.
  3. Logarithmic time multiclass classification.
Categories: Blogroll

Triplifying a real dictionary

Semantic Web Company - Wed, 2016-11-16 08:07

The Linked Data Lexicography for High-End Language Technology (LDL4HELTA) project was started in cooperation between Semantic Web Company (SWC) and K Dictionaries. LDL4HELTA combines lexicography and Language Technology with semantic technologies and Linked (Open) Data mechanisms and technologies. One of the implementation steps of the project is to create a language graph from the dictionary data.

The input data, described further, is a Spanish dictionary core translated into multiple languages and available in XML format. This data should be triplified (which means to be converted to RDF – Resource Description Framework) for several purposes, including to enrich it with external resources. The triplified data needs to comply with Semantic Web principles.

To get from a dictionary’s XML format to its triples, I learned that you must have a model. One piece of the sketched model, representing two Spanish words which have senses that relate to each other, is presented in Figure 1.

Figure 1: Language model example (click to enlarge)

This sketched model first needs to be created by a linguist who understands both the language complexity and Semantic Web principles. The extensive model [1] was developed at the Ontology Engineering Group of the Universidad Politécnica de Madrid (UPM).

Language is very complex. With this we all agree! How complex it really is, is probably often underestimated, especially when you need to model all its details and triplify it.

So why is the task so complex?

To start with, the XML structure is complex in itself, as it contains nested structures. Each word constitutes an entry. One single entry can contain information about:

  • Pronunciation
  • Inflection
  • Range Of Application
  • Sense Indicator
  • Compositional Phrase
  • Translations
  • Translation Example
  • Alternative Scripting
  • Register
  • Geographical Usage
  • Sense Qualifier
  • Provenance
  • Version
  • Synonyms
  • Lexical sense
  • Usage Examples
  • Homograph information
  • Language information
  • Specific display information
  • Identifiers
  • and more…

Entries can have predefined values, which can recur but their fields can also have so-called free values, which can vary too. Such fields are:

  • Aspect
  • Tense
  • Subcategorization
  • Subject Field
  • Mood
  • Grammatical Gender
  • Geographical Usage
  • Case
  • and more…

As mentioned above, in order to triplify a dictionary one needs to have a clear defined model. Usually, when modelling linked data or just RDF it is important to make use of existing models and schemas to enable easier and more efficient use and integration. One well-known lexicon model is Lemon. Lemon contains good pieces of information to cover our dictionary needs, but not all of them. We started using also the Ontolex model, which is much more complex and is considered to be the evolution of Lemon. However, some pieces of information were still missing, so we created an additional ontology to cover all missing corners and catch the specific details that did not overlap with the Ontolex model (such as the free values).

An additional level of complexity was the need to identify exactly the missing pieces in Ontolex model and its modules and create the part for the missing information. This was part of creating the dictionary’s model which we calledontolexKD.

As a developer you never sit down to think about all the senses or meanings or translations of a word (except if you specialize in linguistics), so just to understand the complexity was a revelation for me. And still, each dictionary contains information that is specific to it and which needs to be identified and understood.

The process used in order to do the mapping consists of several steps. Imagine this as a processing pipeline which manipulates the XML data. UnifiedViews is an ETL tool, specialized in the management of RDF data, in which you can configure your own processing pipeline. One of its use cases is to triplify different data formats. I used it to map XML to RDF and upload it into a triple store. Of course this particular task can also be achieved with other such tools or methods for that matter. In UnifiedViews the processing pipeline resembles what appears in Figure 2.

Figure 2: UnifiedViews pipeline used to triplify XML (click to enlarge)

 

The pipeline is composed out of data processing units (DPUs) which communicate iteratively. In a left-to-right order the process in Figure 2 represents:

  • A DPU used to upload the XML files into UnifiedViews for further processing;
  • A DPU which transforms XML data to RDF using XSLT. The style sheet is part of the configuration of the unit;
  • The .rdf generated files are stored on the filesystem;
  • And, finally, the .rdf generated files are uploaded into a triple store, such as Virtuoso Universal server.

Basically the XML is transformed using XSLT.

Complexity increases also through the URIs (Uniform Resource Identifier) that are needed for mapping the information in the dictionary, because with Linked Data any resource should have a clearly identified and persistent identifier! The start was to represent a single word (headword) under a desired namespace and build on it to associate it with its part of speech, grammatical number, grammatical gender, definition, translation – just to begin with.

The base URIs follow the best practices recommended in the ISA study on persistent URIs following the pattern:http://{domain}/{type}/{concept}/{reference}.

An example of such URIs for the forms of a headword is:

These two URIs represent the singular masculine and singular feminine forms of the Spanish word entendedor.

If the dictionary contains two different adjectival endings, as with entendedor which has different endings for the feminine and masculine forms (entendedora and entendedor), and they are not explicitly mentioned in the dictionary than we use numbers in the URI to describe them. If the gener would be explicitly mentioned the URIs would be:

In addition, we should consider that the aim of triplifying the XML was for all these headwords with senses, forms and translations, to connect and be identified and linked following Semantic Web principles. The actual overlap and linking of the dictionary resources remains open. A second step for improving the triplification and mapping similar entries, if possible at all, still needs to be carried out. As an example, let’s take two dictionaries, say German, which contain a translation into English and an English dictionary which also contains translations into German. We get the following translations:

Bank – bank – German to English

bank – Bank – English to German

The URI of the translation from German to English was designed to look like:

And the translation from English to German would be:

In this case both represent the same translation but have different URIs because they were generated from different dictionaries (mind the translation order). These should be mapped so as to represent the same concept, theoretically, or should they not?

The word Bank in German can mean either a bench or a bank in English. When I translate both English senses back into German I get again the word Bank, but I cannot be sure which sense I translate unless the sense id is in the URI, hence the SE00006110 and SE00006116. It is important to keep the order of translation (target-source) but later map the fact that both translations refer to the same sense, same concept. This is difficult to establish automatically. It is hard even for a human sometimes.

One of the last steps of complexity was to develop a generic XSLT which can triplify all the different languages of this dictionary series and store the complete data in a triple store. The question remains: is the design of such a universal XSLT possible while taking into account the differences in languages or the differences in dictionaries?

The task at hand is not completed from the point of view of enabling the dictionary to benefit from Semantic Web principles yet. The linguist is probably the first one who can conceptualize “the how to do this”.

As a next step we will improve the Linked Data created so far and bring it to the status of a good linked language graph by enriching the RDF data with additional information, such as the history of a term or additional grammatical information etc.

References:

[1] J. Bosque-Gil, J. Gracia, E. Montiel-Ponsoda, and G. Aguado-de Cea, “Modelling multilingual lexicographic resources for the web of data: the k dictionaries case,” in Proc. of GLOBALEX’16 workshop at LREC’15, Portoroz, Slovenia, May 2016.

Categories: Blogroll

Visualize PoolParty project data with SKOS Play! in four steps

Semantic Web Company - Tue, 2016-10-04 03:16

There is a new functionality in PoolParty 5.5 that allows users to manage the skos:inScheme relationship of their concepts.

When you activate the skos:inScheme functionality for your PoolParty project you can create input data for SKOS Play! very easy. SKOS Play! is a free application that lets you render and visualize SKOS taxonomies in different formats (html, pdf) and different graphical representations (tree tabular, etc.).

With four steps you can generate such a representation based on PoolParty data: 

1) Activate skos:inScheme in your PoolParty project:

2) Apply skos:inScheme settings for concepts in your taxonomy.

For existing concepts, user can select the subtree in which the skos:inScheme setting should be applied. For new concepts you can define a behavior to automatically apply the inScheme setting on the active subtree.

This is a screenshot of a small PoolParty subtree, showing beverages that are used for cocktail creation:

Like usual, you can see the skos:ConceptScheme in purple. The narrower nodes in green represent skos:Concepts. All skos:Concepts in this subtree have a skos:inScheme relation to the skos:ConceptScheme with title “Beverages”.

 

 

3) SKOS Play!

When your PoolParty project is publicly available (help page explaining user groups in PoolParty), you can simply copy the URL of the corresponding SPARQL endpoint and paste it into the SKOS Play! input field during the upload process: http://labs.sparna.fr/skos-play/upload. In this example I simply used the SPARQL endpoint of the Cocktails thesaurus: http://vocabulary.semantic-web.at/PoolParty/sparql/cocktails. As an alternative you could also export you PoolParty project and import the resulting file in SKOS Play! A corresponding file you could retrieve from http://vocabulary.semantic-web.at/cocktails/export/cocktails.trig

For simplicity you can skip the advanced options.

4) Get results

After you hit the Next button you receive feedback that concept data was processed successfully on the top of the page. When you scroll down you have options to select the skos:ConceptScheme and language that should be further processed. In addition you have the option to print and to visualize your data. Printing lets you select between alphabetical index and tree. Both version are clickable and can be created in html or pdf format. Visualization offers different types like a collapsible tree, zoomable square or circle representations and also an autocomplete form.

I chose the tree visualization which results in a nice interactive tree. Users can click circles to unfold the tree. When a label is clicked, the user is directed to this concept URI. In this use case the user is directed to PoolParty Linked Data Frontend.

And the cool thing is that you can simply download the generated tree by right hand mouse button > Save as…

You simply have to edit the downloaded raw html file to have a fully working visualization: delete the svg element completely to generate an empty div element (id=”body”).

The generated html code can be downloaded here: SKOSPlay_blogpost.zip

By the way, you can also see a PoolParty thesaurus visualization, powered with SKOS Play! on this page: http://www.reegle.info/glossary

 

 

 

Categories: Blogroll

Web 2: But Wait, There's More (And More....) - Best Program Ever. Period.

Searchblog - Thu, 2011-10-13 13:20
I appreciate all you Searchblog readers out there who are getting tired of my relentless Web 2 Summit postings. And I know I said my post about Reid Hoffman was the last of its kind. And it was, sort of. Truth is, there are a number of other interviews happening... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Reid Hoffman, Founder, LinkedIn (And Win Free Tix to Web 2)

Searchblog - Wed, 2011-10-12 12:22
Our final interview at Web 2 is Reid Hoffman, co-founder of LinkedIn and legendary Valley investor. Hoffman is now at Greylock Partners, but his investment roots go way back. A founding board member of PayPal, Hoffman has invested in Facebook, Flickr, Ning, Zynga, and many more. As he wears (at... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview the Founders of Quora (And Win Free Tix to Web 2)

Searchblog - Tue, 2011-10-11 13:54
Next up on the list of interesting folks I'm speaking with at Web 2 are Charlie Cheever and Adam D'Angelo, the founders of Quora. Cheever and D'Angelo enjoy (or suffer from) Facebook alumni pixie dust - they left the social giant to create Quora in 2009. It grew quickly after... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Ross Levinsohn, EVP, Yahoo (And Win Free Tix to Web 2)

Searchblog - Tue, 2011-10-11 12:46
Perhaps no man is braver than Ross Levinsohn, at least at Web 2. First of all, he's the top North American executive at a long-besieged and currently leaderless company, and second because he has not backed out of our conversation on Day One (this coming Monday). I spoke to Ross... (Go to Searchblog Main)
Categories: Blogroll

I Just Made a City...

Searchblog - Mon, 2011-10-10 14:41
...on the Web 2 Summit "Data Frame" map. It's kind of fun to think about your company (or any company) as a compendium of various data assets. We've added a "build your own city" feature to the map, and while there are a couple bugs to fix (I'd like... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Vic Gundotra, SVP, Google (And Win Free Tix to Web 2)

Searchblog - Mon, 2011-10-10 14:03
Next up on Day 3 of Web 2 is Vic Gundotra, the man responsible for what Google CEO Larry Page calls the most exciting and important project at this company: Google+. It's been a long, long time since I've heard as varied a set of responses to any Google project... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview James Gleick, Author, The Information (And Win Free Tix to Web 2)

Searchblog - Sat, 2011-10-08 21:16
Day Three kicks off with James Gleick, the man who has written the book of the year, at least if you are a fan of our conference theme. As I wrote in my review of "The Information," Gleick's book tells the story of how, over the past five thousand or... (Go to Searchblog Main)
Categories: Blogroll

I Wish "Tapestry" Existed

Searchblog - Fri, 2011-10-07 15:34
(image) Early this year I wrote File Under: Metaservices, The Rise Of, in which I described a problem that has burdened the web forever, but to my mind is getting worse and worse. The crux: "...heavy users of the web depend on scores - sometimes hundreds - of services,... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Steve Ballmer, CEO of Microsoft (And Win Free Tix to Web 2)

Searchblog - Fri, 2011-10-07 13:17
Day Two at Web 2 Summit ends with my interview of Steve Ballmer. Now, the last one, some four years ago, had quite a funny moment. I asked Steve about how he intends to compete with Google on search. It's worth watching. He kind of turns purple. And not... (Go to Searchblog Main)
Categories: Blogroll

Me, On The Book And More

Searchblog - Thu, 2011-10-06 13:05
Thanks to Brian Solis for taking the time to sit down with me and talk both specifically about my upcoming book, as well as many general topics.... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Michael Roth, CEO of Interpublic Group (And Win Free Tix to Web 2)

Searchblog - Thu, 2011-10-06 12:37
What's the CEO of a major advertising holding company doing at Web 2 Summit? Well, come on down and find out. Marketing dollars are the oxygen in the Internet's bloodstream - the majority of our most celebrated startups got that way by providing marketing solutions to advertisers of all stripes.... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Mary Meeker of KPCB (And Win Free Tix to Web 2)

Searchblog - Wed, 2011-10-05 15:00
For the first time in eight years, Mary Meeker will let me ask her a few questions after she does her famous market overview. Each year, Mary pushes the boundaries of how many slides she can cram into one High Order Bit, topping out at 70+ slides in ten or... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Dennis Crowley, CEO, Foursquare (And Win Free Tix to Web 2)

Searchblog - Wed, 2011-10-05 13:06
Foursquare co-founder and CEO Dennis Crowley will give his first 1-1 interview on the Web 2 stage on the conference's second day, following a morning of High Order Bits and a conversation on privacy policy with leaders from government in both the US and Canada. After Crowley will be a... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Michael Dell, CEO, Dell (And Win Free Tix to Web 2)

Searchblog - Tue, 2011-10-04 12:35
Not unlike Steve Jobs back in the 1990s, Michael Dell returned to the helm of his company at a crucial moment, when his namesake was seemingly rudderless. Back in 2007, Dell was losing marketshare to HP, Apple had not yet proven the monster it has since become in mobile, and... (Go to Searchblog Main)
Categories: Blogroll

FM Welcomes Lijit to the Family

Searchblog - Tue, 2011-10-04 11:04
Today Federated Media Publishing announced it has acquired Lijit Networks, a world-class business partner to online publishers based in Boulder, Colorado. This combination is the result of literally months of work, including a ton of strategic thinking that dates back to Federated's acquisitions of Foodbuzz, Big Tent, and TextDigger... (Go to Searchblog Main)
Categories: Blogroll
Syndicate content