Skip navigation.
Semantic Software Lab
Concordia University
Montréal, Canada


The Economist gets in on the AI Fluff

Data Mining Blog - Sun, 2015-05-10 22:54

The Economist leads with an editorial and an article on The Dawn of Artificial Intelligence.

The editorial starts of with:

“THE development of full artificial intelligence could spell the end of the human race,” Stephen Hawking warns. Elon Musk fears that the development of artificial intelligence, or AI, may be the biggest existential threat humanity faces. Bill Gates urges people to beware of it.

Dread that the abominations people create will become their masters, or their executioners, is hardly new. But voiced by a renowned cosmologist, a Silicon Valley entrepreneur and the founder of Microsoft—hardly Luddites—and set against the vast investment in AI by big firms like Google and Microsoft, such fears have taken on new weight. With supercomputers in every pocket and robots looking down on every battlefield, just dismissing them as science fiction seems like self-deception. The question is how to worry wisely.

To my knowledge, while the three titans mentioned here of undeniable intellect, it is not clear why two of them are relevant at all. I consider myself reasonably smart, but should I be quoted on my opinion of current treatments of toxoplasmosis? The argument "a smart person says X, therefore we should attend" is known as the appeal to false authority (i.e. a fallacious argument). Such arguments introduce another argumentative crime: the exclusion of actual authority. Eric Horvitz comes to mind:

The head of Microsoft’s main research lab has dismissed fears that artificial intelligence could pose a threat to the survival of the human race.

Eric Horvitz believed that humans would not “lose control of certain kinds of intelligences”, adding: “In the end we’ll be able to get incredible benefits from machine intelligence in all realms of life, from science to education to economics to daily life.”

The article then stumbles along confusing advances in machine perception and machine learning with intelligence. Predictably, the Kasparov vs Deep Blue achievement is trundled out:

Yet AI is already powerful enough to make a dramatic difference to human life. It can already enhance human endeavour by complementing what people can do. Think of chess, which computers now play better than any person.

Suggesting that any computer can now beat all humans. Deep Blue was a singular machine which was dismantled after the match. And how does chess playing make a dramatic difference to human life?

A well informed article - and one that is actually interesting to read - should lead with a concrete set of definitions regarding the components of artificial intelligence (perception, reasoning, etc.) and discuss the discrete advances in those areas. Such an article should discuss the challenges of bringing those things together in anything other than the most rudimentary ways they are combined currently. An interesting article on AI would discuss the economics that drive investments and how they dictate what areas are worked on and which aren't.

In all these articles, there is a vague notion of inevitability, as if the machines themselves were selectively investing in research areas, or as if we could stand back and do nothing and this scary version of AI would emerge. It's as if the article were suggesting that landing on the moon was inevitable. The fact that humans did this and then never returned indicates that it was an incredibly intentional endeavor.


Categories: Blogroll

AI, Artificial Birds and Aeroplanes

Data Mining Blog - Sat, 2015-05-09 14:50

The Turing Test for artificial intelligence is a reasonably well understood idea: if, through a written form of communication, a machine can convince a human that it too is a human, then it passes the test. The elegance of this approach (which I believe is its primary attraction) is that it avoids any troublesome definition of intelligence and appeals to an innate ability in humans to detect entities which are not 'one of us'.

This form of AI is the one that is generally presented in entertainment (films, novels, etc.).

However, to an engineer, there are some problems with this as the accepted popular idea of artificial intelligence.

I believe that software engineering can be evaluated in a simple measure of productivity. We either create things that make the impossible possible - going from 0 to 1, or we create things that amplify some value, generally a human's ability to do something, - going from X to nX. In other words, we enable a new thing, or we multiple our ability to do something.

Turing AI, while clearly an interesting intellectual concept, is like building an artificial bird instead of building an aeroplane:

  • A Turing AI can converse in natural language, but humans can't speak charts or holograms as a means to explain something.
  • A Turing AI can read a book and appear to understand it, but it can't read a thousand research articles on cancer and find a connection between discoveries that results in a breakthrough.
  • A Turing AI doesn't have to even be particularly intelligence (though it probably ought to at least appear self aware, reflective, etc) and so would be potentially like making a very poor hire for your team.

I believe that if we consider opportunities for applying 'AI methods' to the vast corpus of data (both in natural language and in various structured forms) on the web, we will realize that there is an economic motivation (i.e. build value for users that can build a user base) that will require all the generally accepted facets of an AI (reasoning, perception, communication with humans, theory of mind, etc.) but will be nothing like a Turing AI.

Rather than think of search engines - the fundamental agents that mediate the web corpus - as mechanisms to help humans find the 'right' document - I believe it is time to change our intentions to:optimize the value of the data on the internet for all mankind.

When we achieve this, we will have built an AI, but it won't be a Turing AI and it may not even pass the Turing Test.

Categories: Blogroll

How to Understand Computers in Film

Data Mining Blog - Wed, 2015-05-06 19:21

When we see an act of programming, screeds of code or other interactions with computers in movies, software engineers are likely to roll their eyes.

  • When Chappie's coder has to write 'terabytes of code'
  • When Ford's computer guy has to 'write a special program' to crack a password in one of the Jack Ryan movies
  • etc.

I rolled my eyes at these.

But I also realize that these interactions are just symbols. They are place holders for 'someone doing some coding'. If we actually saw someone doing some coding, I think we'd roll our eyes for another reason.

This model probably works for any exposure of any technical area in a movie from coding to cooking to sailing to farming. Rather than criticizing the creators of the movie for lack of research, it is so much easier to recognize these moments as symbolic fillers with a dramatic twist.

Categories: Blogroll

Thoughts on KOS (Part 3): Trends in knowledge organization

Semantic Web Company - Tue, 2015-05-05 11:13

The accelerating pace of change in the economic, legal and social environment combined with tendencies towards increased decentralization of organizational structures have had a profound impact on the way we organize and utilize and organize knowledge. The internet as we know it today and especially the World Wide Web as the multimodal interface for the presentation and consumption of multimedia information are the most prominent examples of these developments. To illustrate the impact of new communication technologies on information practices Saumure & Shiri (2008) conducted a survey on knowledge organization trends in the Library and Information Sciences before and after the emergence of the World Wide Web. Table 1 shows their results.








The survey illustrates three major trends: 1) the spectrum of research areas has broadened significantly from originally complex and expert-driven methodologies and systems to more light-weight, application-oriented approaches; 2) while certain research areas have kept their status over the years (i.e. Cataloguing & Classification or Machine Assisted Knowledge Organization), new areas of research have gained importance (i.e. Metadata Applications & Uses, Classifying Web Information, Interoperability Issues) while formerly prevalent topics like Cognitive Models or Indexing have declined in importance or dissolved into other areas; and 3) the quantity of papers that are explicitly and implicitly dealing with metadata issues have significantly increased.

These insights coincide with a survey conducted by The Economist (2010) that comes to the conclusion that metadata has become a key enabler in the creation of controllable and exploitable information ecosystems under highly networked circumstances. Metadata provide information about data, objects and concepts. This information can be descriptive, structural or administrative. Metadata adds value to data sets by providing structure (i.e. schemas) and increasing the expressivity (i.e. controlled vocabularies) of a dataset.

According to Weibel & Lagoze (1997, p. 177):

“[the] association of standardized descriptive metadata with networked objects has the potential for substantially improving resource discovery capabilities by enabling field-based (e.g., author, title) searches, permitting indexing of non-textual objects, and allowing access to the surrogate content that is distinct from access to the content of the resource itself.”

These trends influence the functional requirements of the next generation’s Knowledge Organization Systems (KOSs) as a support infrastructure for knowledge sharing and knowledge creation under conditions of distributed intelligence and competence.

Go to previous posts in this series:
Thoughts on KOS (Part1): Getting to grips with “semantic” interoperability or
Thoughts on KOS (Part 2): Classifying Knowledge Organisation Systems



Saumure, Kristie; Shiri, Ali (2008). Knowledge organization trends in library and information studies: a preliminary comparison of pre- and post-web eras. In: Journal of Information Science, 34/5, 2008, pp. 651–666

The Economist (2010). Data, data everywhere. A special report on managing information., accessed 2013-03-10

Weibel, S. L., & Lagoze, C. (1997). An element set to support resource discovery. In: International Journal on Digital Libraries, 1/2, pp. 176-187

Categories: Blogroll

How the Tech Media Keeps Artificial Intelligence at a Distance

Data Mining Blog - Mon, 2015-05-04 16:30

In sympathy with yesterday's post about AI as presented in films, consider this recent article from the Wall Street Journal: Artificial Intelligence Experts are in High Demand. A list of mostly machine learning experts is produced as evidence for the topic of the article. There is an unfortunate trend being presented to the public in this space in which the term 'artificial intelligence' is being used to draw readers with stories of real technical achievements in the space of machine learning and machine perception (recognizing a cat in a image is not an act of artificial intelligence), movies are being produced that romanticize a form of unobtainable AI, and the two are being tied together with stories of impending doom (Musk, Hawking).

All this is done with little or no investment in helping us establish what we really mean - and need - in an artificial intelligence.

If artificial intelligence experts were in high demand, then linguistics, philosophers, sociologists, etc. should be very happy - not just ML peeps.

Categories: Blogroll

How Hollywood Keeps Artificial Intelligence at a Distance

Data Mining Blog - Sun, 2015-05-03 23:40

When something doesn't exist (like artificial intelligence) it's easy to think that there is some missing piece of magic required to bring it in to existence. There has been a growing interest in movie depictions of AI of late, and these all seem to require some sort of non-linear step to realize this technology.

  • Ex Machina (which I really enjoyed) required a new sort of hard/software in the form of a jelly like substance.
  • Chappie (which I also liked, though I generally prefer cheese and ham combined in a sandwich) required 'terabytes of coding' and a good amount of luck to produce its AI.
  • Age of Ultron (a film about one liners and explosions) required a magic jewel from Loki's staff no less to create its AI.
  • Transcendence (Kurzweil summarized) gives up on AI and simply loads a human brain into the ether.

The message in all of these movies is - the reason we don't have AI is that we haven't taken some non-linear step.

Categories: Blogroll

Artificial Intelligence and Economics

Data Mining Blog - Sat, 2015-05-02 17:59

There are lots of articles online of the form - humanity has built some amazing things, so why haven't we produced an artificial intelligence? These articles (here's an example) often include some discussion of a task that even a very young human can perform - e.g. look at an image and describe what objects are in it (though personally I don't believe this is an indication of intelligence).

I believe that one of the key reasons that we can produce algorithms that can design super efficient jet engines, but not systems with the rudimentary common sense reasoning capabilities of a 3 year old is the economic context that drives innovation. The airline industry has an annual revenue of 700 billion dollars. No one has (yet) articulated the revenue potential of a technology that can look at a photograph and tell you what is in it.

The most important (and possibly last) step in the development of artificial intelligence is to create the economic engine that will require and motivate it.

Categories: Blogroll

SWC’s Semantic Event Recommendations

Semantic Web Company - Mon, 2015-04-27 09:51

Just a couple of years ago critics argued that the semantic approach in IT wouldn’t make the transformation from an inspiring academic discipline to a relevant business application. They were wrong! With the digitalization of business, the power of semantic solutions to handle Big Data became obvious.

Thanks to a dedicated global community of semantic technology experts, we can observe a rapid development of software solutions in this field. The progress is coupled to a fast growing number of corporations that are implementing semantic solutions to win insights from existing but unused data.

Knowledge transfer is extremely important in semantics. Let`s have a look on the community calendar for the upcoming months. We are looking forward to share our experiences and learn. Join us!


>> Semantics technology event calendar


Categories: Blogroll

Randomized experimentation

Machine Learning Blog - Wed, 2015-04-22 10:39

One good thing about doing machine learning at present is that people actually use it! The back-ends of many systems we interact with on a daily basis are driven by machine learning. In most such systems, as users interact with the system, it is natural for the system designer to wish to optimize the models under the hood over time, in a way that improves the user experience. To ground the discussion a bit, let us consider the example of an online portal, that is trying to present interesting news stories to its user. A user comes to the portal and based on whatever information the portal has on the user, it recommends one (or more) news stories. The user chooses to read the story or not and life goes on. Naturally the portal wants to better tailor the stories it displays to the users’ taste over time, which can be observed if users start to click on the displayed story more often.

A natural idea would be to use the past logs and train a machine learning model which prefers the stories that users click on and discourages the stories which are avoided by the users. This sounds like a simple classification problem, for which we might use an off-the-shelf algorithm. This is indeed done reasonably often, and the offline logs suggest that the newly trained model will result in a lot more clicks than the old one. The new model is deployed, only to find out its performance is not as good as hoped, or even poorer than what was happening before! What went wrong? The natural reaction is typically that (a) the machine learning algorithm needs to be improved, or (b) we need better features, or (c) we need more data. Alas, in most of these cases, the right answer is (d) none of the above. Let us see why this is true through a simple example.

Imagine a simple world where some of our users are from New York and others are from Seattle. Some of our news stories pertain to finance, and others pertain to technology. Let us further imagine that the probability of a click (henceforth CTR for clickthrough rate) on a news article based on city and subject has the following distribution:

City Finance CTR Tech CTR New York 1 0.6 Seattle 0.4 0.79 Table1: True (unobserved) CTRs

Of course, we do not have this information ahead of time while designing the system, so our starting system recommends articles according to some heuristic rule. Imagine that we user the rule:

  • New York users get Tech stories, Seattle users get Finance stories.

Now we collect the click data according to this system for a while. As we obtain more and more data, we obtain increasingly accurate estimates of the CTR for Tech stories and NY users, as well as Finance stories and Seattle users (0.6 and 0.4 resp.). However, we have no information on the other two combinations. So if we train a machine learning algorithm to minimize the squared loss between predicted CTR on an article and observed CTR, it is likely to predict the average of observed CTRs (that is 0.5) in the other two blocks. At this point, our guess looks like:


City Finance CTR Tech CTR New York 1 / ? / 0.5 0.6 / 0.6 / 0.6 Seattle 0.4 / 0.4 / 0.4 0.79 / ? / 0.5 Table2: True / observed / estimated CTRs

Note that this would be the case even with infinite data and an all powerful learner, so machine learning is not to be faulted in any way here. Given these estimates, we naturally realize that show finance articles to Seattle users was a mistake, and switch to Tech. But Tech is also looking pretty good in NY, and we stick with it. Our new policy is:

  • Both NY and Seattle users get Tech articles.

Running the new system for a while, we will fix the erroneous estimates for the Tech CTR on Seattle (that is, up 0.5 to 0.79). But we still have no signal that makes us prefer Finance over Tech in NY. Indeed even with infinite data, the system will be stuck with this suboptimal choice at this point, and our CTR estimates will look something like:

City Finance CTR Tech CTR New York 1 / ? / 0.59 0.6 / 0.6 / 0.6 Seattle 0.4 / 0.4 / 0.4 0.79 / 0.79 / 0.79 Table3: True / observed / estimated CTRs

We can now assess the earlier claims:

  1. More data does not help: Since Observed and True CTRs match wherever we are collecting data
  2. Better learning algorithm does not help: Since Predicted and Observed CTRs coincide wherever we are collecting data
  3. Better data does help!! We should not be having the blank cell in observed column.

This seems simple enough to fix though. We should have really known better than to completely omit observations in one cell of our table. With good intentions, we decide to collect data in all cells. We choose to use the following rule:

  • Seattle users get Tech stories during day and finance stories during night
  • Similarly, NY users get Tech stories during day and finance stories during night

We are now collecting data on each cell, but we find that our estimates still lead us to a suboptimal policy. Further investigation might reveal that users are more likely to read finance stories during the day when the markets are open. So when we only display finance stories during night, we underestimate the finance CTR and end up with wrong estimates. Realizing the error of our ways, we might try to fix this again and then run into another problem and so on.

The issue we have discovered above is that of confounding variables. There is lot of wonderful work and many techniques that can be used to circumvent confounding variables in experimentation. Here, I mention the simplest one and perhaps the most versatile one of them: Randomization. The idea is that instead of recommending stories to users according to a fix deterministic rule, we allow for different articles to be presented to the user according to some distribution. This distribution does not have to be uniform. In fact, good randomization would likely focus on plausibly good articles so as to not degrade the user experience. However, as long as we add sufficient randomization, we can then obtain consistent counterfactual estimates of quantities from our experimental data. There is growing literature on how to do this well. A nice paper which covers some of these techniques and provides an empirical evaluation is A more involved example in the context of computational advertising at Microsoft is discussed in



Categories: Blogroll

Thoughts on KOS (Part 2): Classifying Knowledge Organisation Systems

Semantic Web Company - Tue, 2015-04-21 11:06

Traditional KOSs include a broad range of system types from term lists to classification systems and thesauri. These organization systems vary in functional purpose and semantic expressivity. Most of these traditional KOSs were developed in a print and library environment. They have been used to control the vocabulary used when indexing and searching a specific product, such as a bibliographic database, or when organizing a physical collection such as a library (Hodge et al. 2000).

KOS in the era of the Web

With the proliferation the World Wide Web new forms of knowledge organization principles emerged based on hypertextuality, modularity, decentralisation and protocol-based machine communication (Berners-Lee 1998). New forms of KOSs emerged like folksonomies, topic maps and knowledge graphs, also commonly and broadly referred to as ontologies[1].

With reference to Gruber’s (1993/1993a) classic definition:

“a common ontology defines the vocabulary with which queries and assertions are exchanged among agents” based on “ontological commitments to use the shared vocabulary in a coherent and consistent manner.”

From a technological perspective ontologies function as integration layer for semantically interlinked concepts with the purpose to improve the machine-readability of the underlying knowledge model. Ontologies leverage interoperability from a syntactic to a semantic level for the purpose of knowledge sharing. According to Hodge et al. (2003)

“semantic tools emphasize the ability of the computer to process the KOS against a body of text, rather than support the human indexer or trained searcher. These tools are intended for use in the broader, more uncontrolled context of the Web to support information discovery by a larger community of interest or by Web users in general.” (Hodge et al. 2003)

In other words ontologies are being considered valuable to classifying web information in that they aid in enhancing interoperability – bringing together resources from multiple sources (Saumure & Shiri 2008, p. 657).

Which KOS serves your needs?

Schaffert et al. (2005) introduce a model to classify ontologies balong their scope, acceptance and expressivity, as can be seen in the figure below.


According to this model the design of KOSs has to take account of the user group (acceptance model), the nature and abstraction level of knowledge to be represented (model scope) and the adequate formalism to represent knowledge for specific intellectual purposes (level of expressiveness). Although the proposed classification leaves room for discussion, it can help to distinguish various KOSs from each other and gain a better insight into the architecture of functionally and semantically intertwined KOSs. This is especially important under conditions of interoperability.

[1] It must be critically noted that the inflationary usage of the term “ontology” often in neglect of its philosophical roots has not necessarily contributed to a clarification of the concept itself. A detailed discussion of this matter is beyond the scope of this post. In this paper the author refers to Gruber’s (1993a) definition of ontology as “an explicit specification of a conceptualization”, which is commonly being referred to in artificial intelligence research.

The next post will look at trends inknowledge organization before and after the emergence of the world wide web.

Go to the previous post:Thoughts on KOS (Part1): Getting to grips with “semantic” interoperability


Gruber, Thomas R. (1993). Toward Principles for the Design of Ontologies Used for Knowledge Sharing. In International Journal Human-Computer Studies 43, pp. 907-928.

Gruber, Thomas R. (1993a). A translation approach to portable ontologies. In: Knowledge Acquisition, 5/2, pp. 199-220

Hodge, Gail (2000). Systems of Knowledge Organization for Digital Libraries: Beyond Traditional Authority Files. In: First Digital Library Federation electronic edition, September 2008. Originally published in trade paperback in the United States by the Digital Library Federation and the Council on Library and Information Resources, Washington, D.C., 2000

Hodge, Gail M.; Zeng, Marcia Lei; Soergel, Dagobert (2003). Building a Meaningful Web: From Traditional Knowledge Organization Systems to New Semantic Tools. In: Proceedings of the 2003 Joint Conference on Digital Libraries (JCDL’03), IEEE

Saumure, Kristie; Shiri, Ali (2008). Knowledge organization trends in library and information studies: a preliminary comparison of pre- and post-web eras. In: Journal of Information Science, 34/5, 2008, pp. 651–666

Schaffert, Sebastian; Gruber, Andreas; Westenthaler, Rupert (2005). A Semantic Wiki for Collaborative Knowledge Formation. In: Reich, Siegfried; Güntner, Georg; Pellegrini, Tassilo; Wahler, Alexander (Eds.). Semantic Content Engineering. Linz: Trauner, pp. 188-202

Categories: Blogroll

Thoughts on KOS (Part1): Getting to grips with “semantic” interoperability

Semantic Web Company - Fri, 2015-04-10 08:44

Enabling and managing interoperability at the data and the service level is one of the strategic key issues in networked knowledge organization systems (KOSs) and a growing issue in effective data management. But why do we need “semantic” interoperability and how can we achieve it?

Interoperability vs. Integration

The concept of (data) interoperability can best be understood in contrast to (data) integration. While integration refers to a process, where formerly distinct data sources and their representation models are being merged into one newly consolidated data source, the concept of interoperability is defined by a structural separation of knowledge sources and their representation models, but that allows connectivity and interactivity between these sources by deliberately defined overlaps in the representation model. Under circumstances of interoperability data sources are being designed to provide interfaces for connectivity to share and integrate data on top of a common data model, while leaving the original principles of data and knowledge representation intact. Thus, interoperability is an efficient means to improve and ease integration of data and knowledge sources.

Three levels of interoperability

When designing interoperable KOSs it is important to distinguish between structural, syntactic and semantic interoperability (Galinski 2006):

  • Structural interoperability is achieved by representing metadata using a shared data model like the Dublin Core Abstraction Model or RDF (Resource Description Framework).
  • Syntactic interoperability if achieved by serializing data in a shared mark-up language like XML, Turtle or N3.
  • Semantic interoperability is achieved by using a shared terminology or controlled vocabulary to label and classify metadata terms and relations.

Given the fact that metadata standards carry a lot of intrinsic legacy, it is sometimes very difficult to achieve interoperability at all three levels mentioned above. Metadata formats and models are historically grown, they are most of the time a result of community decision processes, often highly formalized for specific functional purposes and most of the time deliberately rigid and difficult to change. Hence it is important to have a clear understanding and documentation of the application profile of a metadata format as a precondition for enabling interoperability at all three levels mentioned above. Semantic Web standards do a really good job in this respect!!

In the next post, we will take a look at various KOSs and how they differ with respect to expressivity, scope and target group.

Categories: Blogroll

Transforming music data into a PoolParty project

Semantic Web Company - Thu, 2015-04-09 04:37

For the Nolde project it was requested to build a knowledge graph, containing detailed information about the austrian music scene: artists, bands and their music releases. We decided to use PoolParty, since theses entities should be accessible in an editorial workflow. More details about the implementation will be provided in a later blog post.

In the first round I want to share my experiences with the mapping of music data into SKOS. Obviously, LinkedBrainz was the perfect source to collect and transform such data since this is available as RDF/NTriples dumps and even providing a SPARQL endpoint! LinkedBrainz data is modeled using the Music Ontology.

E.g. you can select all mo:MusicArtists with relation to Austria.

I imported LinkedBrainz dump files and imported them into a triple store, together with DBpedia dumps.

With two CONSTRUCT queries, I was able to collect the required data and transform it into SKOS, into a PoolParty compatible format:

Construct Artists

Every matching MusicArtist results in a SKOS concept. The foaf:name is mapped to skos:prefLabel (in German).

As you can see, I used Custom Schema features to provide self-describing metadata on top of pure SKOS features: a MusicBrainz link, a MusicBrainz Id, DBpedia link, homepage…

In addition you can see in the query that also data from DBpedia was collected. In case a owl:sameAs relationship to DBpedia exists, a possible abstract is retrieved. When a DBpedia abstract is available it is mapped to skos:definition.

Construct Releases (mo:SignalGroups) with relations to Artists

Similar to the Artists, a matching SignalGroup results in a SKOS Concept. A skos:related relationship is defined between an Artist and his Releases.


The SPARQL construct queries provided ttl files that could by imported directly into PoolParty, resulting in a project, containing nearly 1,000 Artists and 10,000 Releases:


You can reach the knowledge graph by visting the publicly available Linked Data Frontend of PoolParty:

E.g. you can find out details and links about Peter Alexander or Conchita Wurst.

Categories: Blogroll

Extracting Insights from Consumer Reviews

Life Analytics Blog - Mon, 2015-02-23 06:53
Here is one more example on how we can extract Insights from Consumer Reviews. This time we will use Reviews that were given for several Supplement Brands of Omega-3 Fish Oil.
For this example we analyze 4018 Reviews of Consumers who bought Omega-3 Supplements.  Keep in mind that in most cases each Product Review has an associated Rating (usually given as 1-5 stars) which signifies the overall satisfaction of each Consumer . Therefore, after data collection of the Reviews and Ratings we have a file with the following entries per row :
[Text of Review,Rating]

The fact that a Customer gives also a Score can be especially helpful because we can identify the words and Phrases that differentiate Positive experiences (ie those having 5 Star Ratings) from the Negative Ones (We assume that any Review having a Rating of  4 stars or less is Negative). So for example, Positive Reviews may contain mostly words and phrases such as "Great", "Happy" and "Will buy again" whereas Negative Reviews may contain words and phrases such as "Never buying again","not happy" or "damaged".
The tools used for this example are NLTK and Python. The code simply reads the reviews and associated text and creates a Matrix with the same representation as the file it read.

Next, we want to identify which Insights we can extract from this representation. For example :

-Identify which words commonly occur in 5-star reviews
-Identify which words commonly occur in Reviews with a rating of 4 Stars or Lower.
-Identify potentially Interesting Phrases and Words
-Extract term Co-Occurrences

We start with terms occurring more frequently in Negative Reviews for Omega-3 Supplements. Here is what we've found :

So it appears that people tend to give negative Reviews when the Taste (and possibly After-Taste) is not quite right. A lot of people complain about a Fishy odor. Notice also that the 3rd Term is sure which we can assume that it originates from customers saying that they are not sure if the Product works or not (Notice also that the 4th term is yet). Some more terms to consider :

krill (a type of Oil which is alternative Product to Omega-3 Supplementation)

Now let's look at the Terms associated with Positive Reviews :

great and excellent are terms that were expected to be found in Positive Reviews.  Some terms to consider are :


We move on to identifying potentially interesting terms and Phrases. Here is a Screenshot from the Software that i used  :

I added a Red Rectangle wherever sensitive information (such as Company Names) appears which for the purpose of this post is not relevant (but it certainly is relevant in a different setting).

We immediately see some interesting mentions, for example : Heavy Metal poisoning, Upset Stomach incidences, Cognitive Function , Joint Pains, Panic Attacks, Reasonably Priced Items, Postpartum Depression, Allergic Reactions, Speedy Delivery and Soft Gels that Stick together.

Recall that in a previous example we found that the term however is a term that occurs frequently within Negative Reviews. Some analysts may have chosen to treat this term as a stopword which in this case would be a serious mistake. The reason for this is that the term however shows us very often the reason for which a product or service is not receiving a perfect rating and vice-versa. Therefore, If a Data Scientist would have chosen to exclude this term from the Analysis (stopwords are typically removed from the text), potentially interesting insights would have never surfaced.

Ideally, we would like to know what is the context that occurs after the term however whenever this term occurs withing a negative review. That will help us to focus on all occurrences of however with negative sentiment. To do this, we only take into account all reviews containing the term however and having a Rating of 3 stars or less. It appears that the most common terms occurring after the term however was Fishy odor and After-taste. In other words, fishy odor is the cause that keeps Customers from giving a 5-star Rating.

On the other hand, phrases such as highly recommend are interesting because we may use co-occurrence analysis to see which terms co-occur with a highly recommended product.

Of course this is -by no means- the end on what we can do. To extract even better insights we have to spend significantly more time to do proper Pre-processing, use Information Extraction and use several other techniques to analyze Text Data in novel and potentially interesting ways.

Categories: Blogroll

SEMANTiCS2015: Calls for Research & Innovation Papers, Industry Presentations and Poster/Demos are now open!

Semantic Web Company - Fri, 2015-02-20 06:00

The SEMANTiCS2015 conference comes back this year in its 11th edition where it all started in 2005 to Vienna, Austria!

The conference  takes place from 15-17 September 2015 (the main conference will be on 16-17th of September and several back 2 back workshops & events on 15th) at the University of Economics – see all information:

We are happy to announce the SEMANTiCS Open Calls as follows. All infos on the Calls can also be found on the SEMANTiCS2015 website here:

Call for Research & Innovation Papers

The Research & Innovation track at SEMANTiCS welcomes the submission of papers on novel scientific research and/or innovations relevant to the topics of the conference. Submissions must be original and must not have been submitted for publication elsewhere. Papers should follow the ACM ICPS guidelines for formatting ( and must not exceed 8 pages in lenght for full papers and 4 pages for short papers, including references and optional appendices.

Abstract Submission Deadline: May 22, 2015
Paper Submission Deadline: May 29, 2015
Notification of Acceptance: July 10, 2015
Camera-Ready Paper: July 24, 2015

Call for Industry & Use Case Presentations

To address the needs and interests of industry SEMANTICS presents enterprise solutions that deal with semantic processing of data and/or information in areas like like Linked Data, Data Publishing, Semantic Search, Recommendation Services, Sentiment Detection, Search Engine Add-Ons, Thesaurus and/or Ontology Management, Text Mining, Data Mining and any related fields. All submissions have a strong focus on real world applications beyond the prototypical status and demonstrate the power of semantic systems!

Submission Deadline: July 1, 2015
Notification of Acceptance: July 20, 2015
Presentation Ready: August 15, 2015

Call for Posters and Demos

The Posters & Demonstrations Track invites innovative work in progress, late-breaking research and innovation results, and smaller contributions (including pieces of code) in all fields related to the broadly understood Semantic Web. The informal setting of the Posters & Demonstrations Track encourages participants to present innovations to business users and find new partners or clients.  In addition to the business stream, SEMANTiCS 2015 welcomes developer-oriented posters and demos to the new technical stream.

Submission Deadline: June 17, 2015
Notification of Acceptance: July 10, 2015
Camera-Ready Paper: August 01, 2015

We are looking forward to receive your submissions for SEMANTiCS2015 and see you in Vienna in autumn!

Categories: Blogroll

Data to Value & Semantic Web Company agree partnership to bring cutting edge Semantic Management to Financial Services clients

Semantic Web Company - Wed, 2015-02-18 09:04

The partnership aims to change the way organisations, particularly within Financial Services, manage the semantics embedded in their data landscapes. This will offer several core benefits to existing and prospective clients including locating, contextualising and understanding the meaning and content of Information faster and at a considerably lower cost. The partnership will achieve this through combining the latest Information Management and Semantic techniques including:

  • Text Mining, Tagging, Entity Definition & Extraction.
  • Business Glossary, Data Dictionary & Data Governance techniques.
  • Taxonomy, Data Model and Ontology development.
  • Linked Data & Semantic Web analyses.
  • Data Profiling, Mining & Discovery.

This includes improving regulatory compliance in areas such as BCBS, enabling new investment research and client reporting techniques as well as general efficiency drivers such as faster integration of mergers and acquisitions. As part of the partnership, Data to Value Ltd. will offer solution services and training in PoolParty product offerings, including ontology development and data modeling services.

Nigel Higgs, Managing Director of Data to Value notes; “this is an exciting collaboration between two firms which are pushing the boundaries in the way Data, Information and Semantics are managed by business stakeholders. We spend a great deal of time helping organisations at a grass roots level pragmatically adopt the latest Information Management techniques. We see this partnership as an excellent way for us to help organisations take realistic steps to adopting the latest semantic techniques.”

Andreas Blumauer, CEO of Semantic Web Company adds, “The consortium of our two companies offers a unique bundle, which consists of a world-class semantic platform and a team of experts who know exactly how Semantics can help to increase the efficiency and reliability of knowledge intensive business processes in the financial industry.”

Categories: Blogroll

Web 2: But Wait, There's More (And More....) - Best Program Ever. Period.

Searchblog - Thu, 2011-10-13 13:20
I appreciate all you Searchblog readers out there who are getting tired of my relentless Web 2 Summit postings. And I know I said my post about Reid Hoffman was the last of its kind. And it was, sort of. Truth is, there are a number of other interviews happening... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Reid Hoffman, Founder, LinkedIn (And Win Free Tix to Web 2)

Searchblog - Wed, 2011-10-12 12:22
Our final interview at Web 2 is Reid Hoffman, co-founder of LinkedIn and legendary Valley investor. Hoffman is now at Greylock Partners, but his investment roots go way back. A founding board member of PayPal, Hoffman has invested in Facebook, Flickr, Ning, Zynga, and many more. As he wears (at... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview the Founders of Quora (And Win Free Tix to Web 2)

Searchblog - Tue, 2011-10-11 13:54
Next up on the list of interesting folks I'm speaking with at Web 2 are Charlie Cheever and Adam D'Angelo, the founders of Quora. Cheever and D'Angelo enjoy (or suffer from) Facebook alumni pixie dust - they left the social giant to create Quora in 2009. It grew quickly after... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Ross Levinsohn, EVP, Yahoo (And Win Free Tix to Web 2)

Searchblog - Tue, 2011-10-11 12:46
Perhaps no man is braver than Ross Levinsohn, at least at Web 2. First of all, he's the top North American executive at a long-besieged and currently leaderless company, and second because he has not backed out of our conversation on Day One (this coming Monday). I spoke to Ross... (Go to Searchblog Main)
Categories: Blogroll

I Just Made a City...

Searchblog - Mon, 2011-10-10 14:41
...on the Web 2 Summit "Data Frame" map. It's kind of fun to think about your company (or any company) as a compendium of various data assets. We've added a "build your own city" feature to the map, and while there are a couple bugs to fix (I'd like... (Go to Searchblog Main)
Categories: Blogroll
Syndicate content