Skip navigation.
Semantic Software Lab
Concordia University
Montréal, Canada


How the PoolParty Semantic Suite is learning to speak 40+ languages

Semantic Web Company - Mon, 2015-08-03 05:49

Business is becoming more and more globalised, and enterprises and organisations are acting in several different regions and thus facing more challenges of different cultural aspects as well as respective language barriers. Looking at the European market, we even see 24 working languages in EU28, which make cross-border services considerably complicated. As a result, powerful language technology is needed, and intense efforts have already been taken in the EU to deal with this situation and enable the vision of a multilingual digital single market (a priority area of the European Commission this year, see:

Here at the Semantic Web Company we also witness fast-growing demands for language-independent and/or specific-language and cross-language solutions to enable business cases like cross-lingual search or multilingual data management approaches. To provide such solutions, a multilingual metadata and data management approach is needed, and this is where PoolParty Semantic Suite comes into the game: as PoolParty follows W3C semantic web standards like SKOS, we have language-independent-based technologies in place and our customers already benefit from them. However, as regards text analysis and text extraction, the ability to process multilingual information and data is key for success – which means that the systems need to speak as many languages as possible.

Our new cooperation with K Dictionaries (KD) is enabling the PoolParty Semantic Suite to continuously “learn to speak” more and more languages, by making use of KD’s rich monolingual, bilingual and multilingual content and its long-time experience in lexicography as a base for improved multi-language text analysis and processing.

KD ( is a technology-oriented content and data creator that is based in Tel Aviv and cooperates with publishing partners, ICT firms, the academe and professional associations worldwide. It deals with nearly 50 languages, offering quality monolingual, bilingual and multilingual lexical datasets, morphological word forms, phonetic transcription, etc.

As a result of this cooperation, PoolParty now provides language bundles in the following languages, which can be licensed together with all types of PoolParty servers:

  • English
  • French
  • German
  • Italian
  • Japanese
  • Korean
  • Russian
  • Slovak
  • Spanish

Additional language bundles are in preparation and will be in place soon!

Furthermore, SWC and KD are partners in a brand new EUREKA project that is supported by a bilateral technology/innovation program between Austria and Israel. The project is called LDL4HELTA (Linked Data Lexicography for High-End Language Technology Application) and combines lexicography and Language Technology with Semantic Web and Linked (Open) Data mechanisms and technologies to improve existing and develop new products and services. It integrates the products of both partners to better serve existing customers and new ones, as well as to enter together new markets in the field of Linked Data lexicography-based Language Technology solutions. This project has been successfully kicked off in early July and has a duration of 24 months, with the first concrete results due early in 2016.

The LDL4HELTA project is supported by a research partner (Austrian Academy of Sciences) and an expert Advisory Board including  Prof Christian Chiarcos (Goethe University, Frankfurt), Mr Orri Erling (OpenLink Software), Dr Sebastian Hellmann (Leipzig University), Prof Alon Itai (Technion, Haifa), and Ms Eveline Wandl-Wogt (Austrian Academy of Sciences).

So stay tuned and we will inform you about news and activities of this cooperation here in the blog continuously!

Categories: Blogroll

Semantic Web Company with LOD2 project top listed at the first EC Innovation Radar

Semantic Web Company - Tue, 2015-07-14 06:15

The Innovation Radar is a DG Connect support initiative which focuses on the identification of high potential innovations and the key innovators behind them in FP7, CIP and H2020 projects. The Radar supports the innovators by suggesting a range of targeted actions that can assist them in fulfilling their potential in the market place. The first Innovation Radar Report reviews the innovation potential of ICT projects funded under 7th Framework Programme and the Competitiveness and Innovation Framework Programme. Between May 2014 and January 2015, the Commission reviewed 279 ICT projects, which had resulted in a total of 517 innovations, delivered by 544 organisations in 291 European cities.

Core of the analysis is the Innovation Capacity Indicator (ICI), which measures both the ability of the innovator company and the quality of the environment in which it operates. AND: among this results, SWC has received two top rankings. One for the recently concluded LOD2 project (LOD2 – Creating Knowledge out of Interlinked Data) and another as being one of the key organisations and thereby innovating SMEs within this projects. Also listed are our partners OpenLink Software and Wolters Kluwer Germany.

Ranking of the top 10 innovations and key organisations behind them (Innovation Radar 2015)

We are happy and proud that the report identifies Semantic Web Company as one of those players (10 %) where commercial exploitation of innovations is already ongoing. That strengthens and confirmes our approach to interconnect FP7 and H2020 research and innovation activities with real-world business use cases coming from our customers and partners. Thereby our core product PoolParty Semantic Suite can be taken as a best practice example of embedding collaborative research into an innovation-driven commercial product. For Semantic Web Company, the report is particularly encouraging because of it’s emphasis on the positive role of SMEs, where the report sees 41% of high-potential innovation coming from…


Blogpost by Martin Kaltenböck and Thomas Thurner

Categories: Blogroll

Improved Customer Experience by use of Semantic Web and Linked Data technologies

Semantic Web Company - Sun, 2015-06-28 05:08

With the rise of Linked Data technologies, there come several new approaches into play for the improvement of customer experience across all digital channels of a company. All of these methodologies can be subsumed under the term “the connected customer”.

These are interesting not only for retailers operating a web shop, but also for enterprises seeking for new ways to develop tailor-made customer services and to increase customer retention.

Linked Data methodologies can help to improve several measurements alongside a typical customer experience lifecycle.

  1. Personalized access to information, e.g. to technical documentation
  2. Cross-selling through a better contextualization of product information
  3. Semantically enhanced help desk, user forums and self service platforms
  4. Better ways to understand and interpret a customer intention by use of enterprise vocabularies
  5. More dynamic management of complex multi-channel websites through a better cost-effectiveness
  6. More precise methods for data analytics, e.g. to allow marketers to better target campaigns and content to the user’s preferences
  7. Enhanced search experience at aggregators like Google through the use of microdata and

In the center of this approach, knowledge graphs work like a ‘linking machine’. Based on standards-based semantic models, business entities are getting linked in a most dynamic way. Those graphs go beyond the power of social graphs. While social graphs are focused on people only, are knowledge graphs connecting all kinds of relevant business objects to each other.

When customers and their behaviours are represented in a knowledge model, Linked data technologies try to preserve as much semantics as possible. By these means they are able to complement other approaches for big data analytics, which rather tend to flatten out the data model behind business entities.

Categories: Blogroll

Using SPARQL clause VALUES in PoolParty

Semantic Web Company - Fri, 2015-06-26 09:15

Since PoolParty fully supports SPARQL 1.1 functionalities you can use clauses like VALUES. The VALUES clause can be used to provide an unordered solution sequence that is joined with the results of the query evaluation. From my perspective it is a convenience of filtering variables and an increase in readability of queries.

E.g. when you want to know which cocktails you can create with Gin and a highball glass you can go to and fire this query:

PREFIX skos:<> PREFIX co: <> SELECT ?cocktailLabel WHERE {   ?cocktail co:consists-of ?ingredient ;     co:uses ?drinkware ;     skos:prefLabel ?cocktailLabel .   ?ingredient skos:prefLabel ?ingredientLabel .   ?drinkware skos:prefLabel ?drinkwareLabel .   FILTER (?ingredientLabel = "Gin"@en && ?drinkwareLabel = "Highball glass"@en ) }

When you want to add additional pairs of ingredients and drink ware you want to filter in combination the query gets quite clumsy. Wrongly placed braces can break the syntax. In addition, when writing complicated queries you easily insert errors, e.g. by mixing boolean operators which results in wrong results…

... FILTER ((?ingredientLabel = "Gin"@en && ?drinkwareLabel = "Highball glass"@en ) ||      (?ingredientLabel = "Vodka"@en && ?drinkwareLabel ="Old Fashioned glass"@en )) }

Using VALUES can help in this situation. For example this query shows you how to filter both pairs Gin+Highball glass and Vodka+Old Fashioned glass in a neat way:

PREFIX skos:<> PREFIX co: <> SELECT ?cocktailLabel WHERE {   ?cocktail co:consists-of ?ingredient ;     co:uses ?drinkware ;     skos:prefLabel ?cocktailLabel .   ?ingredient skos:prefLabel ?ingredientLabel .   ?drinkware skos:prefLabel ?drinkwareLabel . } VALUES ( ?ingredientLabel ?drinkwareLabel ) {   ("Gin"@en "Highball glass"@en)   ("Vodka"@en "Old Fashioned glass"@en) }

Especially when you create SPARQL code automatically, e.g. generated by a form, this clause can be very useful.


Categories: Blogroll


Code from an English Coffee Drinker - Wed, 2015-06-17 06:13
Over on one of my other blogs I recently wrote a post about a Hummingbird Hawk-Moth which I'd seen in the garden. This post included an animated GIF of the moth in flight. I included an animated GIF rather than a normal video because I'd had problems with the camera and the longest video I manged to shoot before the moth flew away contained just 14 frames.

As you can see it finishes almost before you've had chance to realise it's started to play and isn't very helpful at showing why the moth is named after the Hummingbird.

Now I'm sure there are many ways in which you could turn the video into a looping GIF but I'm going to detail what I did, partly so I don't forget, and partly as I wrote some software to deal with one particular issue.

My approach to almost any video related task usually starts with FFmpeg and this was no different, with a simple command to scale the video down and produce a GIF as output making sure to keep the right frame rate.
ffmpeg -i 00002.MTS -vf "fps=25,scale=400:225" animated.gif
As you can see this works to produce an animated GIF although there are a number of problems with it. Firstly it's very grainy and secondly you can't really see the moth now we've scaled the image down. We'll deal with the second problem first (as the first problem mostly goes away by the end). Again a simple FFmpeg command allows us to crop the video:

ffmpeg -i 00002.MTS -vf "fps=25,scale=1920:1080,crop=800:450:400:225,scale=400:225" animated.gif
This gives us a much better view of the moth but that jump as the animation loops around is very very annoying. The problem is that even with just 14 frames there is enough camera movement between the first and last frame for the join to be really obvious. This is something you often see with animated GIFs produced from video and you can clearly see why in this image of the first and last frames superimposed on one another.

At this point I tried a number of filters in FFmpeg that are supposed to help remove camera shake etc. but none of them helped as the camera movement is fairly smooth and what I want is to just remove the movement altogether so that the plant stay stills between the frames. While I couldn't find anyway of doing this in FFmpeg I did realise that some of the code I used in 3DAssembler might help.

I've never really described how 3DAssembler works, but essentially it aligns images. In fact it uses SURF to detect identical features in a pair of images and then determines the horizontal and vertical shift required to try and overlap as many of the features it found as possible. In 3DAssembler this allows you to automatically align photos to produce a usable stereo pair. Here though we can use the same approach to align the frames of the video.

The code I wrote (which is currently a mess of hard coded values, so I'll release it once I've had time to clean it up) calculates the horizontal and vertical shifts of each frame against the first frame and then crops each frame appropriately. If we superimpose the first and last of these corrected frames we can see how things have improved.

Producing the final animated GIF is then a multistage process. Firstly we use FFmpeg to turn the video into a series of still images, taking care to deinterlace the original video:
ffmpeg -i 00002.MTS -vf "scale=1920:1080,yadif=1:0,hqdn3d,fps=25" frame%03d.pngMy code then aligns, crops, and scales these frames down to the same size we were using before. The set of frames is then reassembled to produce the animated GIF:
convert -delay 4 frame*.png animated.gif
While there is still a jump of the background as it loops around it is a lot less obvious than in the original. You'll also notice that the grain present in the original has disappeared. The grain is actually dithering introduced to try and improve the image due to the fact that a GIF image is limited to just 256 colours. I didn't apply a dither filter when assembling the GIF which does mean you can see problems with the colour palette especially in the grass at the bottom left.

I'm not sure why GIF has such a limited colour pallette but I'm guessing it relates to keeping the filesize down when storage was more expensive and bandwidth was a lot lower. Now that most peoples internet connection can handle full resolution HD video we shuold probably move beyond GIF images. For single images, especially hand drawn or those that use transparency, GIF has been replaced by PNG. The PNG format also supports animation. Unfortunately only Firefox currently has support for showing animated PNG images, in all the other browsers all you see is the first frame.

Fortunately it is possible to get animated PNGs to play in most modern browsers with a little bit of JavaScript trickery using the apng-canvas library. Unfortunately the way this library works means that you need to host both the image and the javascript in the same place which makes it difficult to use with blogger, not helped by the fact the you can't currently upload animated PNG files to blogger either. Anyway after a little bit of work hopefully the following should be animated for everyone.

As you can see this is much better than the GIF version as we aren't limited to just 256 colours. This was produced in exactly the same way as the GIF version though, apart from the final command to assemble the file which now looks like:
apngasm animated.png frame*.png 1 25I'll clean up the code I used and make it available in case anyone else fancies producing weird little animated videos.

APNG.ifNeeded().then(function() { var images = document.querySelectorAll(".apng-image"); for (var i = 0; i < images.length; i++) APNG.animateImage(images[i]); });
Categories: Blogroll

Java and IDEs for the R/Python world

(Some tips on how to use Java if you’re from R or Python; some thoughts on software platforms and programming for data-science-or-whatever-we-call-it-now.)

Most of my research these days uses Python, R, or Java. It’s terrific that so many people are using Python and R as their primary langauges now; this is way better than the bad old days when people overused Java just because that’s what they learned in their intro CS course. Python/R are better for many things. But fast, compiled, static languages are still important[1], and Java still seems to be a very good cross-platform approach for this[2], or at the very least, it’s helpful to know how to muck around with CoreNLP or Mallet. I think in undergrad I kept annoying my CS professors that we needed to stop using Java and do everything in Python, but I honestly think we now have the opposite problem — I’ve met many people recently who do lots of programming without traditional CS training (e.g. from the natural sciences, social sciences, statistics, humanities, etc.), who need to pick up some Java but find it fairly different than the lightweight languages they first learned. I don’t know what are good overall introductions to the language for this audience, but here’s a little bit of information about development tools which make it easier.

Unlike R or Python, Java is really hard to program with just a text editor. You have to import tons of packages to do anything basic, and the names for everything are long and easy to misspell, which is extra bad because it takes more lines of code to do anything. While it’s important to learn the bare basics of compiling and running Java from the commandline (at the very least because you need to understand it to run Java on a server), the only good way to write Java for real is with an IDE. This is more complicated than a text editor, but once you get the basics down it is much more productive for most things. In many ways Java is an outdated, frustrating language, but Java plus an IDE is actually pretty good.

The two most popular IDEs seem to be Eclipse and IntelliJ. I’ve noticed really good Java programmers often prefer IntelliJ. It’s probably better. I use Eclipse only because I learned it a long time ago. They’re both free.

The obvious things an IDE gives you include things like autosuggestion to tell you method names for a particular object, or instantly flagging misspelled variable names. But the most useful and underappreciated features, in my opinion, are for code navigation and refactoring. I feel like I became many times more productive when I learned how to use them.

For example:

  1. Go to a definition (Eclipse name: “Open Declaration”). Hold “Command” then all the function names, class names, and variable names will get underlines. You can click one to navigate to where it’s declared. This is really helpful to follow method calls. You basically are following the path your program would take at runtime. You can even navigate into the code for any library or the standard library.
  2. Back: this is a button on the toolbar. After you navigated to a declaration, use the this to go back to where you were before. This lets you do things like go to a method just to quickly refresh your memory about what’s going on, or maybe go to a class to remember what things are in it, then after a second go right back to what you were working on. This lets you effectively deal with a lot more complexity without holding it all in your head at once.

(The “Command” key is for Mac with Eclipse; there are equivalents for Linux and Windows and other IDEs too.)

With these two commands, you can move through your code, and other people’s code, like it’s a web browser. Enabling keyboard shortcuts makes it work even better. Then you can press a keyboard shortcut to navigate to the the function currently under your cursor, and press another to go back to where you were. I think that by default these two commands don’t both have shortcuts; it’s worth adding them yourself (in Preferences). I actually mapped them to be like Chrome or Safari, using Command-[ and Command-] for Back and Open Declaration, respectively. I use them constantly when writing Java code.

But that’s just one navigational direction. You can also traverse in other directions with:

  • See all references (Eclipse: right-click, “References”; or, Cmd-Shift-G). You invoke this on a function name in the code. Then you’ll get a listing on the sidebar of all places that call that function, and you can click on them to go to them. As opposed to going to a declaration, this lets you go backwards in a hypothetical call stack. It’s like being able to navigate to all inbound links, like all “cited by” in Google Scholar. And it’s useful for variables and classes, too. By invoking this on different things in your code, you quickly get little ego-network snapshots of your codebase’s dependency graph. This not only helps you track down bugs, but helps you figure out how to refactor or restructure your code.

There are many other useful navigational features as well, such as navigating to a class by typing a prefix of its name; and many other IDE features too. Different people tend to use different ones so it’s worth looking at what different people use.

Finally, besides navigation, a very useful feature is rename refactoring: any variable or function or class can be renamed, and all references to it get renamed too. Since names are pretty important for comprehension, this actually makes it much easier to write the first draft of code, because you don’t have to worry about getting the name right on the first try. When I write large Python programs, I find I have to spend lots of time thinking through the structure and naming so I don’t hopelessly confuse myself later. There’s also move refactoring, where you can move functions between different files.

Navigation and refactoring aren’t just things for Java; they’re important things you want to do in any language. There are certainly IDEs and editor plugins for lightweight languages as well which support these things to greater or lesser degrees (e.g. RStudio, PyCharm, Syntastic…). And without IDE support, there are unix-y alternatives like CTags, perl -pi, grep, etc. These are good, but their accuracy relative to the semantics you care about often is less than 100%, which changes how you use them.

Java and IDE-style development feel almost retrospective in some ways. To me at least, they’re associated with a big-organization, top-heavy, bureaucratic software engineering approach to programming, which feels distant from the needs of computational research or startup-style lightweight development. And they certainly don’t address some of the major development challenges facing scientific programming, like dependency management for interwoven code/data pipelines, or data/algorithm visualization done concurrently with code development. But these tools still the most effective ones for a large class of problems, so worth doing well if you’re going to do them at all.

[1]: An incredibly long and complicated discussion addressed in many other places, but in my own work, static languages are necessary over lightweight ones for (1) algorithms that need more speed, especially ones that involve graphs or linguistic structure, or sample/optimize over millions of datapoints; (2) larger programs, say more than a few thousand lines of code, which is when dynamic typing starts to turn into a mess while static typing and abstractions start to pay off; (3) code with multiple authors, or that develops or uses libraries with nontrivial APIs; in theory dynamic types are fine if everyone is super good at communication and documentation, but in practice explicit interfaces make things much easier. If none of (1-3) are true, I think Python or R is preferable.

[2]: Long story here and depends on your criteria. Scala is similar to Java in this regard. The main comparison is to C and C++, which have a slight speed edge over Java (about the same for straightforward numeric loops, but gains in BLAS/LAPACK and other low-level support), are way better for memory usage, and can more directly integrate with your favorite more-productive high-level language (e.g. Python, R, or Matlab). But the interface between C/C++ and the lightweight language you care about is cumbersome. Cython and Rcpp do this better — and especially good if you’re willing to be tied to either Python or R — but they’re still awkward enough they slow you down and introduce new bugs. (Julia is a better approach since it just eliminates this dichotomy, but is still maturing.) C/C++’s weaknesses compared to Java include nondeterministic crashing bugs (due to the memory model), high conceptual complexity to use the better C++ features, time-wasting build issues, and no good IDEs. At the end of the day I find that I’m usually more productive in Java than C/C++, though the Cython or Rcpp hybrids can get to similar outcomes. These main criteria somewhat assume a Linux or Mac platform; people on Microsoft Windows are in a different world where there’s a great C++ IDE and C# is available, which is (mostly?) better than Java. But very few people in my work/research world use Windows and it’s been like this for many years, for better or worse.

Categories: Blogroll

The Economist gets in on the AI Fluff

Data Mining Blog - Sun, 2015-05-10 22:54

The Economist leads with an editorial and an article on The Dawn of Artificial Intelligence.

The editorial starts of with:

“THE development of full artificial intelligence could spell the end of the human race,” Stephen Hawking warns. Elon Musk fears that the development of artificial intelligence, or AI, may be the biggest existential threat humanity faces. Bill Gates urges people to beware of it.

Dread that the abominations people create will become their masters, or their executioners, is hardly new. But voiced by a renowned cosmologist, a Silicon Valley entrepreneur and the founder of Microsoft—hardly Luddites—and set against the vast investment in AI by big firms like Google and Microsoft, such fears have taken on new weight. With supercomputers in every pocket and robots looking down on every battlefield, just dismissing them as science fiction seems like self-deception. The question is how to worry wisely.

To my knowledge, while the three titans mentioned here of undeniable intellect, it is not clear why two of them are relevant at all. I consider myself reasonably smart, but should I be quoted on my opinion of current treatments of toxoplasmosis? The argument "a smart person says X, therefore we should attend" is known as the appeal to false authority (i.e. a fallacious argument). Such arguments introduce another argumentative crime: the exclusion of actual authority. Eric Horvitz comes to mind:

The head of Microsoft’s main research lab has dismissed fears that artificial intelligence could pose a threat to the survival of the human race.

Eric Horvitz believed that humans would not “lose control of certain kinds of intelligences”, adding: “In the end we’ll be able to get incredible benefits from machine intelligence in all realms of life, from science to education to economics to daily life.”

The article then stumbles along confusing advances in machine perception and machine learning with intelligence. Predictably, the Kasparov vs Deep Blue achievement is trundled out:

Yet AI is already powerful enough to make a dramatic difference to human life. It can already enhance human endeavour by complementing what people can do. Think of chess, which computers now play better than any person.

Suggesting that any computer can now beat all humans. Deep Blue was a singular machine which was dismantled after the match. And how does chess playing make a dramatic difference to human life?

A well informed article - and one that is actually interesting to read - should lead with a concrete set of definitions regarding the components of artificial intelligence (perception, reasoning, etc.) and discuss the discrete advances in those areas. Such an article should discuss the challenges of bringing those things together in anything other than the most rudimentary ways they are combined currently. An interesting article on AI would discuss the economics that drive investments and how they dictate what areas are worked on and which aren't.

In all these articles, there is a vague notion of inevitability, as if the machines themselves were selectively investing in research areas, or as if we could stand back and do nothing and this scary version of AI would emerge. It's as if the article were suggesting that landing on the moon was inevitable. The fact that humans did this and then never returned indicates that it was an incredibly intentional endeavor.


Categories: Blogroll

AI, Artificial Birds and Aeroplanes

Data Mining Blog - Sat, 2015-05-09 14:50

The Turing Test for artificial intelligence is a reasonably well understood idea: if, through a written form of communication, a machine can convince a human that it too is a human, then it passes the test. The elegance of this approach (which I believe is its primary attraction) is that it avoids any troublesome definition of intelligence and appeals to an innate ability in humans to detect entities which are not 'one of us'.

This form of AI is the one that is generally presented in entertainment (films, novels, etc.).

However, to an engineer, there are some problems with this as the accepted popular idea of artificial intelligence.

I believe that software engineering can be evaluated in a simple measure of productivity. We either create things that make the impossible possible - going from 0 to 1, or we create things that amplify some value, generally a human's ability to do something, - going from X to nX. In other words, we enable a new thing, or we multiple our ability to do something.

Turing AI, while clearly an interesting intellectual concept, is like building an artificial bird instead of building an aeroplane:

  • A Turing AI can converse in natural language, but humans can't speak charts or holograms as a means to explain something.
  • A Turing AI can read a book and appear to understand it, but it can't read a thousand research articles on cancer and find a connection between discoveries that results in a breakthrough.
  • A Turing AI doesn't have to even be particularly intelligence (though it probably ought to at least appear self aware, reflective, etc) and so would be potentially like making a very poor hire for your team.

I believe that if we consider opportunities for applying 'AI methods' to the vast corpus of data (both in natural language and in various structured forms) on the web, we will realize that there is an economic motivation (i.e. build value for users that can build a user base) that will require all the generally accepted facets of an AI (reasoning, perception, communication with humans, theory of mind, etc.) but will be nothing like a Turing AI.

Rather than think of search engines - the fundamental agents that mediate the web corpus - as mechanisms to help humans find the 'right' document - I believe it is time to change our intentions to:optimize the value of the data on the internet for all mankind.

When we achieve this, we will have built an AI, but it won't be a Turing AI and it may not even pass the Turing Test.

Categories: Blogroll

How to Understand Computers in Film

Data Mining Blog - Wed, 2015-05-06 19:21

When we see an act of programming, screeds of code or other interactions with computers in movies, software engineers are likely to roll their eyes.

  • When Chappie's coder has to write 'terabytes of code'
  • When Ford's computer guy has to 'write a special program' to crack a password in one of the Jack Ryan movies
  • etc.

I rolled my eyes at these.

But I also realize that these interactions are just symbols. They are place holders for 'someone doing some coding'. If we actually saw someone doing some coding, I think we'd roll our eyes for another reason.

This model probably works for any exposure of any technical area in a movie from coding to cooking to sailing to farming. Rather than criticizing the creators of the movie for lack of research, it is so much easier to recognize these moments as symbolic fillers with a dramatic twist.

Categories: Blogroll

Thoughts on KOS (Part 3): Trends in knowledge organization

Semantic Web Company - Tue, 2015-05-05 11:13

The accelerating pace of change in the economic, legal and social environment combined with tendencies towards increased decentralization of organizational structures have had a profound impact on the way we organize and utilize and organize knowledge. The internet as we know it today and especially the World Wide Web as the multimodal interface for the presentation and consumption of multimedia information are the most prominent examples of these developments. To illustrate the impact of new communication technologies on information practices Saumure & Shiri (2008) conducted a survey on knowledge organization trends in the Library and Information Sciences before and after the emergence of the World Wide Web. Table 1 shows their results.








The survey illustrates three major trends: 1) the spectrum of research areas has broadened significantly from originally complex and expert-driven methodologies and systems to more light-weight, application-oriented approaches; 2) while certain research areas have kept their status over the years (i.e. Cataloguing & Classification or Machine Assisted Knowledge Organization), new areas of research have gained importance (i.e. Metadata Applications & Uses, Classifying Web Information, Interoperability Issues) while formerly prevalent topics like Cognitive Models or Indexing have declined in importance or dissolved into other areas; and 3) the quantity of papers that are explicitly and implicitly dealing with metadata issues have significantly increased.

These insights coincide with a survey conducted by The Economist (2010) that comes to the conclusion that metadata has become a key enabler in the creation of controllable and exploitable information ecosystems under highly networked circumstances. Metadata provide information about data, objects and concepts. This information can be descriptive, structural or administrative. Metadata adds value to data sets by providing structure (i.e. schemas) and increasing the expressivity (i.e. controlled vocabularies) of a dataset.

According to Weibel & Lagoze (1997, p. 177):

“[the] association of standardized descriptive metadata with networked objects has the potential for substantially improving resource discovery capabilities by enabling field-based (e.g., author, title) searches, permitting indexing of non-textual objects, and allowing access to the surrogate content that is distinct from access to the content of the resource itself.”

These trends influence the functional requirements of the next generation’s Knowledge Organization Systems (KOSs) as a support infrastructure for knowledge sharing and knowledge creation under conditions of distributed intelligence and competence.

Go to previous posts in this series:
Thoughts on KOS (Part1): Getting to grips with “semantic” interoperability or
Thoughts on KOS (Part 2): Classifying Knowledge Organisation Systems



Saumure, Kristie; Shiri, Ali (2008). Knowledge organization trends in library and information studies: a preliminary comparison of pre- and post-web eras. In: Journal of Information Science, 34/5, 2008, pp. 651–666

The Economist (2010). Data, data everywhere. A special report on managing information., accessed 2013-03-10

Weibel, S. L., & Lagoze, C. (1997). An element set to support resource discovery. In: International Journal on Digital Libraries, 1/2, pp. 176-187

Categories: Blogroll

How the Tech Media Keeps Artificial Intelligence at a Distance

Data Mining Blog - Mon, 2015-05-04 16:30

In sympathy with yesterday's post about AI as presented in films, consider this recent article from the Wall Street Journal: Artificial Intelligence Experts are in High Demand. A list of mostly machine learning experts is produced as evidence for the topic of the article. There is an unfortunate trend being presented to the public in this space in which the term 'artificial intelligence' is being used to draw readers with stories of real technical achievements in the space of machine learning and machine perception (recognizing a cat in a image is not an act of artificial intelligence), movies are being produced that romanticize a form of unobtainable AI, and the two are being tied together with stories of impending doom (Musk, Hawking).

All this is done with little or no investment in helping us establish what we really mean - and need - in an artificial intelligence.

If artificial intelligence experts were in high demand, then linguistics, philosophers, sociologists, etc. should be very happy - not just ML peeps.

Categories: Blogroll

How Hollywood Keeps Artificial Intelligence at a Distance

Data Mining Blog - Sun, 2015-05-03 23:40

When something doesn't exist (like artificial intelligence) it's easy to think that there is some missing piece of magic required to bring it in to existence. There has been a growing interest in movie depictions of AI of late, and these all seem to require some sort of non-linear step to realize this technology.

  • Ex Machina (which I really enjoyed) required a new sort of hard/software in the form of a jelly like substance.
  • Chappie (which I also liked, though I generally prefer cheese and ham combined in a sandwich) required 'terabytes of coding' and a good amount of luck to produce its AI.
  • Age of Ultron (a film about one liners and explosions) required a magic jewel from Loki's staff no less to create its AI.
  • Transcendence (Kurzweil summarized) gives up on AI and simply loads a human brain into the ether.

The message in all of these movies is - the reason we don't have AI is that we haven't taken some non-linear step.

Categories: Blogroll

Artificial Intelligence and Economics

Data Mining Blog - Sat, 2015-05-02 17:59

There are lots of articles online of the form - humanity has built some amazing things, so why haven't we produced an artificial intelligence? These articles (here's an example) often include some discussion of a task that even a very young human can perform - e.g. look at an image and describe what objects are in it (though personally I don't believe this is an indication of intelligence).

I believe that one of the key reasons that we can produce algorithms that can design super efficient jet engines, but not systems with the rudimentary common sense reasoning capabilities of a 3 year old is the economic context that drives innovation. The airline industry has an annual revenue of 700 billion dollars. No one has (yet) articulated the revenue potential of a technology that can look at a photograph and tell you what is in it.

The most important (and possibly last) step in the development of artificial intelligence is to create the economic engine that will require and motivate it.

Categories: Blogroll

SWC’s Semantic Event Recommendations

Semantic Web Company - Mon, 2015-04-27 09:51

Just a couple of years ago critics argued that the semantic approach in IT wouldn’t make the transformation from an inspiring academic discipline to a relevant business application. They were wrong! With the digitalization of business, the power of semantic solutions to handle Big Data became obvious.

Thanks to a dedicated global community of semantic technology experts, we can observe a rapid development of software solutions in this field. The progress is coupled to a fast growing number of corporations that are implementing semantic solutions to win insights from existing but unused data.

Knowledge transfer is extremely important in semantics. Let`s have a look on the community calendar for the upcoming months. We are looking forward to share our experiences and learn. Join us!


>> Semantics technology event calendar


Categories: Blogroll

Randomized experimentation

Machine Learning Blog - Wed, 2015-04-22 10:39

One good thing about doing machine learning at present is that people actually use it! The back-ends of many systems we interact with on a daily basis are driven by machine learning. In most such systems, as users interact with the system, it is natural for the system designer to wish to optimize the models under the hood over time, in a way that improves the user experience. To ground the discussion a bit, let us consider the example of an online portal, that is trying to present interesting news stories to its user. A user comes to the portal and based on whatever information the portal has on the user, it recommends one (or more) news stories. The user chooses to read the story or not and life goes on. Naturally the portal wants to better tailor the stories it displays to the users’ taste over time, which can be observed if users start to click on the displayed story more often.

A natural idea would be to use the past logs and train a machine learning model which prefers the stories that users click on and discourages the stories which are avoided by the users. This sounds like a simple classification problem, for which we might use an off-the-shelf algorithm. This is indeed done reasonably often, and the offline logs suggest that the newly trained model will result in a lot more clicks than the old one. The new model is deployed, only to find out its performance is not as good as hoped, or even poorer than what was happening before! What went wrong? The natural reaction is typically that (a) the machine learning algorithm needs to be improved, or (b) we need better features, or (c) we need more data. Alas, in most of these cases, the right answer is (d) none of the above. Let us see why this is true through a simple example.

Imagine a simple world where some of our users are from New York and others are from Seattle. Some of our news stories pertain to finance, and others pertain to technology. Let us further imagine that the probability of a click (henceforth CTR for clickthrough rate) on a news article based on city and subject has the following distribution:

City Finance CTR Tech CTR New York 1 0.6 Seattle 0.4 0.79 Table1: True (unobserved) CTRs

Of course, we do not have this information ahead of time while designing the system, so our starting system recommends articles according to some heuristic rule. Imagine that we user the rule:

  • New York users get Tech stories, Seattle users get Finance stories.

Now we collect the click data according to this system for a while. As we obtain more and more data, we obtain increasingly accurate estimates of the CTR for Tech stories and NY users, as well as Finance stories and Seattle users (0.6 and 0.4 resp.). However, we have no information on the other two combinations. So if we train a machine learning algorithm to minimize the squared loss between predicted CTR on an article and observed CTR, it is likely to predict the average of observed CTRs (that is 0.5) in the other two blocks. At this point, our guess looks like:


City Finance CTR Tech CTR New York 1 / ? / 0.5 0.6 / 0.6 / 0.6 Seattle 0.4 / 0.4 / 0.4 0.79 / ? / 0.5 Table2: True / observed / estimated CTRs

Note that this would be the case even with infinite data and an all powerful learner, so machine learning is not to be faulted in any way here. Given these estimates, we naturally realize that show finance articles to Seattle users was a mistake, and switch to Tech. But Tech is also looking pretty good in NY, and we stick with it. Our new policy is:

  • Both NY and Seattle users get Tech articles.

Running the new system for a while, we will fix the erroneous estimates for the Tech CTR on Seattle (that is, up 0.5 to 0.79). But we still have no signal that makes us prefer Finance over Tech in NY. Indeed even with infinite data, the system will be stuck with this suboptimal choice at this point, and our CTR estimates will look something like:

City Finance CTR Tech CTR New York 1 / ? / 0.59 0.6 / 0.6 / 0.6 Seattle 0.4 / 0.4 / 0.4 0.79 / 0.79 / 0.79 Table3: True / observed / estimated CTRs

We can now assess the earlier claims:

  1. More data does not help: Since Observed and True CTRs match wherever we are collecting data
  2. Better learning algorithm does not help: Since Predicted and Observed CTRs coincide wherever we are collecting data
  3. Better data does help!! We should not be having the blank cell in observed column.

This seems simple enough to fix though. We should have really known better than to completely omit observations in one cell of our table. With good intentions, we decide to collect data in all cells. We choose to use the following rule:

  • Seattle users get Tech stories during day and finance stories during night
  • Similarly, NY users get Tech stories during day and finance stories during night

We are now collecting data on each cell, but we find that our estimates still lead us to a suboptimal policy. Further investigation might reveal that users are more likely to read finance stories during the day when the markets are open. So when we only display finance stories during night, we underestimate the finance CTR and end up with wrong estimates. Realizing the error of our ways, we might try to fix this again and then run into another problem and so on.

The issue we have discovered above is that of confounding variables. There is lot of wonderful work and many techniques that can be used to circumvent confounding variables in experimentation. Here, I mention the simplest one and perhaps the most versatile one of them: Randomization. The idea is that instead of recommending stories to users according to a fix deterministic rule, we allow for different articles to be presented to the user according to some distribution. This distribution does not have to be uniform. In fact, good randomization would likely focus on plausibly good articles so as to not degrade the user experience. However, as long as we add sufficient randomization, we can then obtain consistent counterfactual estimates of quantities from our experimental data. There is growing literature on how to do this well. A nice paper which covers some of these techniques and provides an empirical evaluation is A more involved example in the context of computational advertising at Microsoft is discussed in



Categories: Blogroll

Thoughts on KOS (Part 2): Classifying Knowledge Organisation Systems

Semantic Web Company - Tue, 2015-04-21 11:06

Traditional KOSs include a broad range of system types from term lists to classification systems and thesauri. These organization systems vary in functional purpose and semantic expressivity. Most of these traditional KOSs were developed in a print and library environment. They have been used to control the vocabulary used when indexing and searching a specific product, such as a bibliographic database, or when organizing a physical collection such as a library (Hodge et al. 2000).

KOS in the era of the Web

With the proliferation the World Wide Web new forms of knowledge organization principles emerged based on hypertextuality, modularity, decentralisation and protocol-based machine communication (Berners-Lee 1998). New forms of KOSs emerged like folksonomies, topic maps and knowledge graphs, also commonly and broadly referred to as ontologies[1].

With reference to Gruber’s (1993/1993a) classic definition:

“a common ontology defines the vocabulary with which queries and assertions are exchanged among agents” based on “ontological commitments to use the shared vocabulary in a coherent and consistent manner.”

From a technological perspective ontologies function as integration layer for semantically interlinked concepts with the purpose to improve the machine-readability of the underlying knowledge model. Ontologies leverage interoperability from a syntactic to a semantic level for the purpose of knowledge sharing. According to Hodge et al. (2003)

“semantic tools emphasize the ability of the computer to process the KOS against a body of text, rather than support the human indexer or trained searcher. These tools are intended for use in the broader, more uncontrolled context of the Web to support information discovery by a larger community of interest or by Web users in general.” (Hodge et al. 2003)

In other words ontologies are being considered valuable to classifying web information in that they aid in enhancing interoperability – bringing together resources from multiple sources (Saumure & Shiri 2008, p. 657).

Which KOS serves your needs?

Schaffert et al. (2005) introduce a model to classify ontologies balong their scope, acceptance and expressivity, as can be seen in the figure below.


According to this model the design of KOSs has to take account of the user group (acceptance model), the nature and abstraction level of knowledge to be represented (model scope) and the adequate formalism to represent knowledge for specific intellectual purposes (level of expressiveness). Although the proposed classification leaves room for discussion, it can help to distinguish various KOSs from each other and gain a better insight into the architecture of functionally and semantically intertwined KOSs. This is especially important under conditions of interoperability.

[1] It must be critically noted that the inflationary usage of the term “ontology” often in neglect of its philosophical roots has not necessarily contributed to a clarification of the concept itself. A detailed discussion of this matter is beyond the scope of this post. In this paper the author refers to Gruber’s (1993a) definition of ontology as “an explicit specification of a conceptualization”, which is commonly being referred to in artificial intelligence research.

The next post will look at trends inknowledge organization before and after the emergence of the world wide web.

Go to the previous post:Thoughts on KOS (Part1): Getting to grips with “semantic” interoperability


Gruber, Thomas R. (1993). Toward Principles for the Design of Ontologies Used for Knowledge Sharing. In International Journal Human-Computer Studies 43, pp. 907-928.

Gruber, Thomas R. (1993a). A translation approach to portable ontologies. In: Knowledge Acquisition, 5/2, pp. 199-220

Hodge, Gail (2000). Systems of Knowledge Organization for Digital Libraries: Beyond Traditional Authority Files. In: First Digital Library Federation electronic edition, September 2008. Originally published in trade paperback in the United States by the Digital Library Federation and the Council on Library and Information Resources, Washington, D.C., 2000

Hodge, Gail M.; Zeng, Marcia Lei; Soergel, Dagobert (2003). Building a Meaningful Web: From Traditional Knowledge Organization Systems to New Semantic Tools. In: Proceedings of the 2003 Joint Conference on Digital Libraries (JCDL’03), IEEE

Saumure, Kristie; Shiri, Ali (2008). Knowledge organization trends in library and information studies: a preliminary comparison of pre- and post-web eras. In: Journal of Information Science, 34/5, 2008, pp. 651–666

Schaffert, Sebastian; Gruber, Andreas; Westenthaler, Rupert (2005). A Semantic Wiki for Collaborative Knowledge Formation. In: Reich, Siegfried; Güntner, Georg; Pellegrini, Tassilo; Wahler, Alexander (Eds.). Semantic Content Engineering. Linz: Trauner, pp. 188-202

Categories: Blogroll

Web 2: But Wait, There's More (And More....) - Best Program Ever. Period.

Searchblog - Thu, 2011-10-13 13:20
I appreciate all you Searchblog readers out there who are getting tired of my relentless Web 2 Summit postings. And I know I said my post about Reid Hoffman was the last of its kind. And it was, sort of. Truth is, there are a number of other interviews happening... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Reid Hoffman, Founder, LinkedIn (And Win Free Tix to Web 2)

Searchblog - Wed, 2011-10-12 12:22
Our final interview at Web 2 is Reid Hoffman, co-founder of LinkedIn and legendary Valley investor. Hoffman is now at Greylock Partners, but his investment roots go way back. A founding board member of PayPal, Hoffman has invested in Facebook, Flickr, Ning, Zynga, and many more. As he wears (at... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview the Founders of Quora (And Win Free Tix to Web 2)

Searchblog - Tue, 2011-10-11 13:54
Next up on the list of interesting folks I'm speaking with at Web 2 are Charlie Cheever and Adam D'Angelo, the founders of Quora. Cheever and D'Angelo enjoy (or suffer from) Facebook alumni pixie dust - they left the social giant to create Quora in 2009. It grew quickly after... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Ross Levinsohn, EVP, Yahoo (And Win Free Tix to Web 2)

Searchblog - Tue, 2011-10-11 12:46
Perhaps no man is braver than Ross Levinsohn, at least at Web 2. First of all, he's the top North American executive at a long-besieged and currently leaderless company, and second because he has not backed out of our conversation on Day One (this coming Monday). I spoke to Ross... (Go to Searchblog Main)
Categories: Blogroll
Syndicate content