Skip navigation.
Semantic Software Lab
Concordia University
Montréal, Canada


#InOrOut: Today's #EURef Debate on Twitter

So what did the #EUReferendum debate look like today? Is Twitter still voting #Leave as it did back in May? What were the main hashtags and user mentions in today's #Leave and #Remain tweet samples?

Tweet VolumesRecord breaking 1.9 million tweets were posted today on the #InOrOut #EUReferendum, which is between three and six times the daily volumes observed earlier in June. On average, this is 21 tweets per second over the day, although, the peaks of activity occurred after 9am (see graphs below). 1.5 million of those tweets were posted during poll opening times. In that period, only 3,300 posts were inaccessible to us due to Twitter rate limits. 

Since the polls closed at 10pm tonight, there was a huge surge in Twitter activity with over 60,000 posts between 10pm and 11pm alone.  Twitter rate limits meant that we could not access another 6,000 posts from that period. Since this is only 10% of the overall data in this hour, we still have a representative sample for our analyses. 

Amongst the 1.9 million posts, over 1 million (57%) were retweets and 94 thousand (5%) - replies. These proportions of retweets and replies are consistent with patterns observed earlier in June.   
Tweets, Re-tweets, and Replies: #Leave or #Remain
Let's start by looking at original tweets, i.e. tweets which have been posted by their authors and are not a reply to another tweet or a retweet. I refer to the authors of those tweets as the OPs (Original Posters), following terminology adopted from online forums.

In retweets, the #Leave proponents are more vocal in comparison to the #Remain.   

The difference is particularly pronounced for replies,  where #Leave proponents are engaging in more debates than #Remain ones. Nevertheless, with replies constituting only 5% of all tweets today, the echo chamber effect observed earlier in June still remains unchanged. 
#InOrOut, #Leave, #Remain and Other Popular HashtagsInterestingly, 75% of all tweets today (1.4 million) contained at least one hashtag. This is a very significant increase on the 56.5% observed several days ago. 

Some of the most popular hashtags  remain unchanged from earlier in June. These refer to the leave and remain campaigns, immigration, NHS, parties, media, and politicians. Interestingly, there is now increased interest in #forex and #stocks, as predictors of the likely outcome. 

Most Mentioned Users Today: What is @Brndstr
Last for tonight, I compared the most frequently mentioned Twitter users in original tweets from today (see above) against those most mentioned earlier in June. The majority of popular mentioned users remains unchanged, with a mix of campaign Twitter accounts, media, and key political leaders.

The most prominent difference is that @Brndstr (Bots for Brands) came top (mentioned in over 14 thousand tweets), followed by @YouTube with 3 thousand mentions. Other new, frequently mentioned accounts today were Avaaz, DanHannanMEP,BuzzFeedUK, and realDonaldTrump.

So What Does This Tell Us?
The #InOrOut #EUReferendum has attracted unprecedented tweet volumes on poll day, with a significantly higher proportion of hashtags than previously. This seems to suggest that Twitter users are trying to get their voices heard and spread the word far and wide, well beyond the bounds of their normal follower  network. 

There are some exciting new entrants in the top 30 most mentioned Twitter accounts in today's referendum posts. I will analyse these in more depth tomorrow. For now, good night!  

Thanks to:Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team 

Any mistakes are my own.
Categories: Blogroll

Identifying A Reliable Sample of Leave/Remain Tweets

This post is the second in the series on the Brexit Tweet Analyser.

Having looked at tweet volumes and basic characteristics of the Twitter discourse around the EU referendum, we now turn to the method we chose for identify a reliable, even if incomplete, sample of leave and remain tweets.

No ground truth; not trying to predict if leave or remain are leading, but instead interested in identifying a reliable, if incomplete subset, so we can analyse topics discussed and active users within.

Are Hashtags A Reliable Predictor of Leave/Remain Support?As discussed in our earlier post, over 56% of all tweets on the referendum contain at least one hashtag. Some of these are actually indicative of support for the leave/remain campaigns, e.g. #votetoleave, #voteout, #saferin, #strongertogether. Then there are also hashtags which try to address undecided voters, e.g. #InOrOut, #undecided, while promoting either a remain or leave vote but not through explicit hashtags.

A recent study of EU referendum tweets by Ontotext, carried out over tweets in May 2016,  classified tweets as leave or remain on the basis of approximately 30 hashtags. Some of those were associated with leave, the rest -- with remain, and each  tweet was classified as leave or remain based on whether it contains predominantly leave or predominantly remain hashtags. 

Based on analysing manually a sample of random tweets with those hashtags, we found that this strategy does not always deliver a reliable assessment, since in many cases leave hashtags are used as a reference to the leave campaign, while the tweet itself is supportive of remain or neutral. The converse is also true, i.e. remain hashtags are used to refer to the remain stance/campaign. We have included some examples below. 

A more reliable, even if somewhat more restrictive, approach is to consider the last hashtag in the tweet as the most indicative of its intended stance (pro-leave or pro-remain). This results in a higher precision sample of remain/leave tweets, which we can then analyse in more depth in terms of topics discussed and opinions expressed. 

Using this approach, amongst the 1.9 million tweets between June 13th and 19th, 5.5% (106 thousand) were identified as supporting the Leave campaign, while 4% (80 thousand) - as supporting the Remain campaign. Taken together, this constitutes just under a 10% sample, which we consider sufficient for the purposes of our analysis. 

These results, albeit drawn from a smaller, high-precision sample, seem to indicate that the Leave campaign is receiving more coverage and support on Twitter, when compared to Remain. This is consistent also with the findings of the Ontotext study .

In subsequent posts we will look into the most frequently mentioned hashtags, the most active Twitter users, and the topics discussed in the Remain and Leave samples separately. 

What about #Brexit in particular?   The recent Ontotext study on May 2016 data used #Brexit as one of the key hashtags indicative of leave. Others have also used #Brexit in the same fashion.

In our more recent 6.5 million tweets (dated between 1 June and 19 June 2016), just under 1.7 million contain the #Brexit hashtag (26%). However, having examined a random sample of those manually (see examples below), we established that while many tweets did use #Brexit to indicate support for leave, there were also many cases where #Brexit referred to the referendum, or the leave/remain question, or the Brexit campaign as a whole. We have provided some such examples at the end of this blog post. We also found a sufficient number of examples where #Brexit appears at the end of tweets while still not indicating support for voting leave. 

Therefore, we chose to distinguish the #Brexit hashtag from all other leave hashtags and tagged tweets with a final #Brexit tag separately. This enables us, in subsequent analyses, to compare findings with and without considering #Brexit.  

Example Remain/Leave Hashtag Use

It doesnt matter who some of the dodgy leaders of #Remain and #Brexit are, they each only have ONE VOTE, like all of us public #EURef— Marcus Storm (@MarcsandSparks) 20 June 2016

Perfect question! "Why is #brexit ahead, despite all the experts supporting #remain?" #questiontime— Steve Parrott (@steveparrott50) 19 June 2016

Could the last decent politician (of any party) to leave the #Leave camp please turn off the lights.....#Bremain— Dr Hamed Khan (@drhamedkhan) 19 June 2016

Today's @thesundaytimes #focus articles on #brexit say it all. #remain is forward-looking, #leave backward— Patrick White (@pbpwhite) 20 June 2016

Example Brexit Tweets
#Brexit probability declines as campaigns remain quiet via @RJ_FXandRates— Bloomberg London (@LondonBC) 17 June 2016
#VoteRemain #VoteLeave #InOrOut #EURef #StrongerIn -- Is #Brexit The End Of The World As We Know It? via @forbes— Jolly Roger (@EUGrassroots) 17 June 2016
Remaining #Brexit Polls scheduled releases— Nicola Duke (@NicTrades) 17 June 2016
Blame austerity—not immigration—for bringing Britain to ‘breaking point’ #EUref— The Conversation (@ConversationUK) June 20, 2016
BREAK World's biggest carmaker #Ford tells staff of "deep concerns abt "uncertainty/potential downsides" of #Brexit— Beth Rigby (@BethRigby) June 20, 2016

Thanks to:Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team 

Any mistakes are my own.
Categories: Blogroll

An ICML unworkshop

Machine Learning Blog - Mon, 2016-06-20 22:29

Following up on an interesting suggestion, we are creating a “Birds of a Feather Unworkshop” with a leftover room (Duffy/Columbia) on Thursday and Friday during the workshops. People interested in ad-hoc topics can post a time and place to meet and discuss. Details are here a little ways down.

Categories: Blogroll

Introducing the Brexit Analyser: real-time Twitter analysis with GATE

The GATE team has been busy lately with building the real-time Brexit Analyser.  It analyses tweets related to the forthcoming EU referendum, as they come in, in order to track the referendum debate unfolding on Twitter. This research is being carried out as part of the SoBigData project

The work follows on from our successful collaboration with Nesta on the Political Futures Tracker, which analysed tweets in real-time in the run up to the UK General Election in 2015. 

Unlike others, we do not try to predict the outcome of the referendum or answer the question of whether Twitter can be used as a substitute for opinion polls. Instead, our focus is on a more in-depth analysis of the referendum debate; the people and organisations who engage in those debates; what topics are discussed and opinion expressed, and who the top influencers are.

What does it do?It analyses and indexes tweets as they come in (i.e. in real time), in order to identify commonly discussed topics, opinions expressed, and whether a tweet is expressing support for remaining or leaving the EU. It must be noted that not all tweets have a clear stance and also that not all tweets express a clear voting intention (e.g. "Brexit & Bremain"). More on this in subsequent posts! 

In more detail, the Brexit Analyser uses text analytics and opinion mining techniques from GATE, in order to identify tweets expressing voting intentions, the topics discussed within, and the sentiment expressed towards these topics. Watch this space! 
The Data  (So Far)We are collecting tweets based on a number of referendum related hashtags and keywords, such as #voteremain, #voteleave, #brexit, #eureferendum. 

The volume of original tweets, replies, and re-tweets per day collected so far is shown below. On average, this is close to half a million tweets per day (480 thousand), which is 1.6 times the tweets on 26 March 2015 (300,000), when the Battle For Number 10 interviews took place, in the run up to the May 2015 General Elections. 

In total, we have analysed just over 1.9 million tweets in the past 4 days, with 60% of those being re-tweets. On average, a tweet is re-tweeted 1.65 times. 

Subsequent posts will examine the distribution of original tweets, re-tweets, and replies specifically in tweets expressing a remain/leave voting intention.  

Hashtags: 1 million of those 1.9 million tweets contain at least one hashtag  (i.e. 56.5% of all tweets have hashtags). If only original tweets are considered (i.e. all replies and retweets are excluded), then there are 319 thousand tweets with hashtags amongst the original 678 thousand tweets (i.e. 47% of original tweets are hashtag bearing).

Analysing hashtags used in a Twitter debate is interesting, because they indicate commonly discussed topics, stance taken towards the referendum, and also key influencers. As they are easy to search for, hashtags help Twitter users participate in online debates, including other users they are not directly connected to.

Below we show some common hashtags on June 16, 2016. As can be seen, most are associated directly with the referendum and voting intentions, while others refer to politicians, parties, media, places, and events:

URLs:  Interestingly, amongst the 1.9 million tweets only 134 thousand contain a URL (i.e. only 7%).  Amongst the 1.1 million re-tweets, 11% contain a URL, which indicates that tweets with URLs tend to be retweet more.  

These low percentages suggest that the majority of tweets on the EU referendum are expressing opinions or addressing another user, rather than sharing information or providing external evidence. 

@Mentions: Indeed, 90 thousand (13%) of the original 678 thousand tweets contain an username mention. The 50 most mentioned users in those tweets are shown below. The size of the user name indicates frequency, i.e. the larger the text the more frequently has this username been mentioned in tweets. 

In subsequent posts we will provide information on the most frequently re-tweeted users and the most prolific Twitter users in the dataset. 

So What Does This Tell Us?
Without a doubt, there is a heavy volume of tweets on the EU referendum, published daily. However, with only 6.8% of all tweets being replies and over 58% -- re-tweets, this resembles more an echo chamber, rather than a debate.  

Pointers to external evidence/sources via URLs are scarce, as are user mentions. The most frequently mentioned users are predominantly media (e.g., BBC, Reuters, FT, the Sun, Huffington Post);  politicians playing a prominent role in the campaign (e.g. David Cameron,  Boris Johnson, Nigel Farage, Jeremy Corbyn); and campaign accounts created especially for the referendum (e.g. @StrongerIn, @Vote_Leave).    

Thanks to:Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team 

Categories: Blogroll

The ICML 2016 Space Fight

Machine Learning Blog - Sat, 2016-06-04 17:29

The space problem started long ago.

At ICML last year and the year before the amount of capacity that needed to fit everyone on any single day was about 1500. My advice was to expect 2000 and have capacity for 2500 because “New York” and “Machine Learning”. Was history right? Or New York and buzz?

I was not involved in the venue negotiations, but my understanding is that they were difficult, with liabilities over $1M for IMLS the nonprofit which oversees ICML year to year. The result was a conference plan with a maximum capacity of 1800 for the main conference, a bit less for workshops, and perhaps 1000 for tutorials.

Then the NIPS registration numbers came in: 3900 last winter. It’s important to understand here that a registration is not a person since not everyone registers for the entire event. Nevertheless, NIPS was very large with perhaps 3K people attending at any one time. Historically, NIPS is the conference most similar to ICML with a history of NIPS being a bit larger. Most people I know treat these conferences as indistinguishable other than timing: ICML in the summer and NIPS in the winter.

Given this, I had to revise my estimate up: We should really have capacity for 3000, not 2500. It also convinced everyone that we needed to negotiate for more space with the Marriott. This again took quite awhile with the result being a modest increase in capacity for the conference (to 2100) and the workshops, but nothing for the tutorials.

The situation with tutorials looked terrible while the situation with workshops looked poor. Acquiring more space at the Marriott looked near impossible. Tutorials require a large room, so we looked into the Kimmel Center at NYU acquiring a large room and increasing capacity to 1450 for the tutorials. We also looked into additional rooms for workshops finding one at Columbia and another at the Microsoft Technology Center which has a large public use room 2 blocks from the Marriott. Other leads did not pan out.

This allowed us to cover capacity through early registration (May 7th). Based on typical early vs. late registration distributions I was expecting registrations might need to close a bit early similar to what happened with KDD in 2014.

Then things blew up. Tutorial registration reached capacity the week of May 23rd, and then all registration stopped May 28th, 3 weeks before the conference. Aside from simply failing to meet demand this also creates lots of problems. What do you do with authors? And when I looked into things in detail for workshops I realized we were badly oversubscribed for some workshops. It’s always difficult to guess which distribution of room sizes is needed to support the spectrum of workshop interests in advance so there were serious problems. What could we do?

The first step was tutorial and main conference registration which reopened last Tuesday using some format changes which allowed us to increase capacity further. We will use simulcast to extra rooms to support larger audiences for tutorials and plenary talks allowing us to up the limit for tutorials to 1590 and for the main conference to 2400. We’ve also shifted the poster session to run in parallel with main tracks rather than in the evening. Now, every paper will have 3-4 designated hours during the day (ending at 7pm) for authors to talk to people individually. As a side benefit, this will also avoid the competition between posters and company-sponsored parties which have become common. We’ll see how this works as a format, but it was unavoidable here: even without increasing registration the existing evening poster session plan was a space disaster.

The workshop situation was much more difficult. I walked all over the nearby area on Wednesday, finding various spaces and getting quotes. I also realized that the largest room at the Crown Plaza could help with our tutorials: it was both bigger and much closer than NYU. On Thursday, we got contract offers from the promising venues and debated into the evening. On Friday morning at 6am the Marriott suddenly gave us a bunch of additional space for the workshops. Looking through things, it was enough to shift us from ‘oversubscribed’ to ‘crowded’ with little capacity to register more given natural interests. We developed a new plan on the fly, changed contracts, negotiated prices down, and signed Friday afternoon.

The local chairs (Marek Petrik and Peder Olsen) and Mary Ellen were working hard with me through this process. Disruptive venue changes 3 weeks before the conference are obviously not the recommended way of doing things:-) And yet it seems to be working out now, much better than I expected last weekend. Here’s the situation:

  1. Tutorials ~1600 registered with capacity for 1850. I expect this to run out of capacity, but it will take a little while. I don’t see a good way to increase capacity further.
  2. The main conference has ~2200 registered with capacity for 2400. Maybe this can be increased a little bit, but it is quite possible the main conference will run out of capacity as well. If it does, only authors will be allowed to register.
  3. Workshops ~1900 registered with capacity for 3000. Only the Deep Learning workshop requires a simulcast. It seems very unlikely that we’ll run out of capacity so this should be the least crowded part of the conference. We even have some left-over little rooms (capacity for 125 or less) that are looking for a creative use if you have one.

In this particular case, “New York” was both part of the problem and much of the solution. Where else can you walk around and find large rooms on short notice within 3 short blocks? That won’t generally be true in the future, so we need to think carefully about how to estimate attendance.

Categories: Blogroll

Room Sharing for ICML (and COLT, and ACL, and IJCAI)

Machine Learning Blog - Mon, 2016-05-02 13:32

My greatest concern with the many machine learning conferences in New York this year was the relatively high cost that implied, particularly for hotel rooms in Manhattan. Keeping the conference affordable for graduate students seems critical to what ICML is really about.

The price becomes much more reasonable if you can find roommates to share the price. For example, the conference hotel can have 3 beds in a room.

This still leaves a coordination problem: How do you find plausible roommates? If only there was a website where the participants in a conference could look for roommates. Oh wait, there is. is something new which might measurably address the cost problem. Obviously, you’ll want to consider roommate possibilities carefully, but now at least there is a place to meet.

Note that the early registration deadline for ICML is May 7th.

Categories: Blogroll

Quora session

Machine Learning Blog - Tue, 2016-04-19 08:46

I’m doing a Quora Session today that may be of interest. I’m impressed with both the quality and quantity of questions.

Categories: Blogroll

ICML registration is live

Machine Learning Blog - Fri, 2016-04-08 15:30

Here. I would recommend registering early because there is a difficult to estimate(*) chance you will not be able to register later.

The program is shaping up and should be of interest. The 9 Tutorials(**), 4 Invited Speakers, and 23 Workshops are all chosen, with paper decisions due out in a couple weeks.

Early Full (after May 7) Student 510 640 Regular 840 1050

These numbers are as aggressively low as the local chairs and I can sleep with at night. The prices are higher than I’d like (New York is expensive), but a bit lower than last year, particularly for students(***).

(*) Relevant facts:

  1. ICML 2016: submissions up 30% to 1300.
  2. NIPS 2015 in Montreal: 3900 registrations (way up from last year).
  3. NIPS 2016 is in Barcelona.
  4. ICML 2015 in Lille: 1670 registrations.
  5. KDD 2014 in NYC: closed@3000 registrations 1 week before the conference.

I tried to figure out how to setup a prediction market to estimate what will happen this year, but didn’t find an easy-enough way to do that.

(**) I kind of wish we could make up the titles. How about: “Go is Too Easy” and “My Neural Network is Deeper than Yours”?

(***) Sponsors are very generous and are mostly giving to defray student costs. Approximately every dollar of the difference between Regular and Student registration is due to company donations. For students, also note that there will be some scholarship opportunities to defray costs coming out soon.

Categories: Blogroll

Insights into Nature’s Data Publishing Portal

Semantic Web Company - Wed, 2016-03-30 05:05

In recent years, Nature has adopted linked data technologies on a broader scale. Andreas Blumauer was intrigued to discover more about the strategy and technologies behind. He had the opportunity to talk with Michele Pasin and Tony Hammond who are the architects of Nature’s data publishing portal.


Semantic Puzzle: Nature’s data publishing portal is one of the most renowned ones in the linked data community. Could you talk a bit about its history? Why was this project initiated and who have been the brains behind it since then?

Michele Pasin: We have been involved with semantic technologies at Macmillan since 2010. At the time it was primarily my colleague Tony Hammond who saw the potential of these technologies for metadata management and data sharing. Tony set up the portal in April 2012 (and expanded in July 2012), in the context of a broader company initiative aimed at moving towards a ‘digital first’ publication workflow.

The platform was essentially a public RDF output of some of the metadata embedded in our XML articles archive. This included a SPARQL endpoint for data about articles published by NPG from 1845 through to the present day. Additionally the datasets include NPG product and subject ontologies. These datasets are available under a Creative Commons Zero waiver.

The platform was only for external use though, so it was essentially detached from the products end users would see on Still, it allowed us to mature a better understanding of how to make use of these tools within our existing technology stack. It is important to remember that in the years the company has been investing a considerable amount of resources on an XML-centered architecture, so finding a solution that could leverage the legacy infrastructure with these new technologies has always been a fundamental requirement for us.

More recently, in 2013 we started working on a new hybrid linked data platform, this time with a much stronger focus on supporting our internal applications. That’s pretty much around the time I joined the company. In essence, we made the point that in order to achieve stronger interoperability levels within our systems we had to create an architecture where RDF is core to the publishing workflow as much as XML is. (By the way if you are interested in the details of this, we presented a paper about this at ISWC 2014.) As part of this phase, we also built a more sophisticated set of ontologies used for encoding the semantics of our data, together with improved versions of the datasets previously released.

The ontologies portal came out in early 2015 as the result of this second phase of work. On the portal one can find extensive documentation about all of our models, as well as periodical downloads in various RDF formats. The idea is to make it easier for people – both within the enterprise and externally – to access, understand and reuse our linked data.

At the same time, since user engagement level on was not as good as expected, we decided to terminate that service. In the future, we plan to keep releasing periodic snapshots of the datasets and the ontologies we are using, but not a public endpoint in the immediate future.

Semantic Puzzle: As one of your visions you’re stating that your “primary reason for adopting linked data technologies is quite simply better metadata management”. How did you deal with metadata before you started with this transition? What has changed since then, also from a business point of view?

Michele Pasin:  Our pre–linked data approach to dealing with metadata and enterprise taxonomies is probably not unheard of, especially within similar sized companies: a vast array of custom-made solutions, varying from simple word documents sitting in someone’s computer, to Excel spreadsheets or, in the best of cases, database tables in one of our production systems. Of course, there were also a number of ad-hoc applications/scripts responsible for the reading/updating of these metadata sources, as often they would be critical to one or more system in the publishing workflow (e.g. think of the journal’s master list, or the list of approved article-types).

It is worth stressing that the lack of a unified technical infrastructure aspect was a key problem, of course, but not the only one. In fact I would argue that addressing the lack of a centralized data governance approach was even more crucial. For example, most often you would not know who/which department was in charge of a particular controlled vocabulary or metadata specification. In some cases, no single source of truth was actually available, because different people/groups were in charge of specific aspects of a single specification (due to their differing interests).  

Hence you need a certain amount of management buy-in to implement such a wide-ranging approach to metadata; moving to a single platform and technical solution based on linked data was fundamental, but an equally fundamental organizational change was also needed. Even more so, if one considers that this is not a time-boxed project but rather an ongoing process, an approach which pays off only as much as you can guarantee that as new products and services get launched, they all subscribe to the same metadata management ‘philosophy’.

Semantic Puzzle: One of the promises of Linked Data is that by “using a common data model and a common naming architecture, users can begin to realize the benefits and efficiencies of web scaling.” Could you describe a bit more in detail into which eco-system your content workflows and publishing processes are embedded (internally and externally) and why the use of standards is important for this?

Tony Hammond: We operate with an XML-based workflow for documents where we receive XML from our suppliers and store that within an XML database (MarkLogic). Increasingly we are beginning to move towards a dynamic publishing solution from that database. We are also using the database to provide a full-text search across all our content. In the past we had various workflows and a small number of different DTDs to reconcile, although we are currently converging on a single DTD. To facilitate search across this mixed XML content we abstracted certain key metadata elements into a common header. This was managed organically and was somewhat unpredictable both in terms of content model and naming.

By moving to a linked data solution for managing our metadata which is based on a single, core ontology we bypass our normalized metadata header and start to build on a new simpler data model (triples) with a common naming architecture. In effect, we have moved from a nominally normalized metadata to a super-normalized metadata which uses web standards for data (URI, RDF, OWL).

Semantic Puzzle: Your contents are also multimedia (image, video, …). How do you embed this non-textual contents into your linked data ecosystem? Which gateways, tools and connectors are used to bridge your linked data environment with multimedia?

Tony Hammond: Some years ago we embarked on a new initiative internally to streamline our production workflows. Our brief was to support a distributed content warehouse where digital assets would be stored in various locations. The idea was to abstract out our storage concerns and to maintain pointers to the various storage subsystems along with other physical characteristics required for accessing that storage.

In practice our main content was housed as XML documents within a MarkLogic XML database and associated media assets (e.g. images) were primarily stored on the filesystem with some secondary asset types (e.g. videos) being sourced from cloud services.

To relate a physical asset (e.g. an XML document, or a JEPG file) to the underlying concept (e.g. an article, or an image) we made use of XMP packets (a technology developed by Adobe Systems and standardized through ISO) which as simple RDF/XML descriptions allowed us to capture metadata about physical characteristics and to relate those properties to our data model. An XMP packet is a description of one physical resource and could be simply linked to the related conceptual resource.

We started this project with an RDF triplestore for maintaining and querying our metadata, but over time we moved towards a hybrid technology where our semantic descriptions were buried within XML documents as RDF/XML descriptions and could be queried within an XML context using XQuery to deliver a highly performant JSON API. These semantic descriptions enclosed minimal XMP documents which described the storage entities.

Semantic Puzzle: Nature links its datasets to external ones, e.g. to DBpedia or MeSH. Who exactly is benefiting from this and how?

Michele Pasin:  I would say that there are at least two reasons why we did this. First, we wanted to maximize the potential reuse of our datasets and models within the semantic web. Building owl:sameAs relationships to other vocabularies, or marking up our ontology classes and properties with subclass/subproperty relationships pointing to external vocabularies is a way to be good ‘linked data citizens’. Moreover, this is a deliberate attempt to counterbalance one of our key design principles: minimal commitment to external vocabularies. This approach to data modeling means that we tend to create our own models and define them within our own namespaces, rather than building production-level software against third party ontologies. It is worth pointing out that this is not because we think our ontologies are better – but because we want our data architecture to reflect as closely as possible the ontological commitment of a publishing enterprise with decades of established business practices, naming conventions etc. In other words, we aimed at creating a very cohesive and robust domain model, one which is resilient to external changes but that also supports semantic interoperability by providing a number of links and mappings to other semantic web standards.

Pointing to external vocabularies is a way to be good ‘linked data citizens’

The second reason for creating these links is to enable more innovative discovery services. For example, a subject page about photosyntesis could surface encyclopedic materials automatically retrieved from DBpedia; or it could provide links to highly cited articles retrieved from PubMed using MeSH mappings. This just scratches the surface of what one could do. The real difficulty is, how to do it in such a way that the overall user experience improves, rather than adding up to the information overload the majority of internet users already have to deal with. So at the moment, while the data people (us) are focusing on building a rich network of entities for our knowledge graph, the UX and front end teams are exploring design and interaction models that truly take advantage of these functionalities. Hopefully we see these activities continue to converge!

Semantic Puzzle: How do you deal with data quality management in general, and how can linked data technologies help to improve it?

Tony Hammond: We can distinguish between two main types of data: documents and ontologies. (And by ontologies we also comprehend thesauri and taxonomies.) Our documents are created by our suppliers using XML and are amenable to some data validations. We use automated DTD validation in our new workflow and by hand DTD validation in the older workflows. We also use Schematron rulesets to validate certain data points but these address only certain elements. We have a couple hundred Schematron rules which implement various business rules and are also synchronized with our ontologies.

Our ontologies, on the other hand, are by their nature more curated datasets. These are mastered as RDF Turtle files and stored within GitHub. These are currently maintained by hand, although we are beginning now to transition some of our taxonomies to the PoolParty taxonomy manager. We have a build process for deploying the ontologies to our XML database where they are combined with our XML documents. During this build process we both validate the RDF as well as running SPIN rules over the datasets which can validate data elements as well as expanding the dataset with new triples from rules-based inferencing.

Semantic Puzzle: For a publisher like Nature it is somehow “natural” that Linked Data is used. How could other industries make use of these principles for information management?

Tony Hammond: The main reason for using linked data is not to do with publishing the data (and indeed many other data models are generally used for data publishing), but with the desire to join one dataset with other datasets – or rather, the data within a dataset to the data within other datasets. It is for this reason that we make use of URIs as common (global) names for data points. Linking data is not just a goal in publishing data but applies equally when consuming data from various sources and integrating over those data sources within an internal environment. Indeed, arguably, the biggest use case for linked data is within private enterprises rather than surfaced on the open web. Once that point is appreciated there is no restriction on any industry in being more disposed to using linked data than any other, and it is used as a means to maximize the data surface that a company operates over.

The biggest use case for linked data is within private enterprises rather than surfaced on the open web

Semantic Puzzle: Where are the limits of Linked Data from your perspective, and do you believe they will ever be exceeded?

Tony Hammond: The limits to using linked data are more to do with top-down vs bottom-up approaches in dealing with data, i.e. linked data vs big data, or data curation vs data crunching. Linked data makes use of global names (URIs), schemas, ontologies. It is highly structured, organized data.

Now, whether it is feasible to bring this level of organization to data at large or whether data crunching will provide the appropriate insights over the data is an open question. Our expectation is that we will still need to use ontologies – and hence linked data – as an organizing principle, or reference, to guide us in processing large datasets and for sharing those data organizations. The question may be how much human curation is required in assembling these ontologies.

Michele Pasin: On a more practical level, I’d say that the biggest problem with linked data is still its rather limited adoption on a large scale. I’m referring in particular to the data publishing and reuse aspect. On this front, we really struggled to get the levels of uptake the business was expecting from us. Consider this: we have been publishing metadata for our entire archive since 2012 (approx. 1.2m documents, resulting in almost half a billion triples). However very few people made use of these data, either in the form of bulk downloads or via the SPARQL API we once hosted (and that was then retired due to low usage). This is in stark contrast with other – arguably less flexible – services we make available, e.g. the OpenSearch APIs, or a JSON REST service, which often see significant traffic.

Last year we gave a paper at the Linked Science workshop (affiliated with ISWC 2015) with the specific intent to address the problem within that community. What seemed to emerge is that possibly this has to do with the same reason why this technology has been so useful to us. RDF is an extremely flexible and powerful model, however, when it comes to data consumption and access, the average user cares more about simplicity than flexibility. Also, outside linked data circles we all know that the standard tech for APIs is JSON and REST, rather than RDF and SPARQL.

Lowering the bar to the adoption of semantic tech

The good news though is that we are seeing more initiatives aimed at bridging these two worlds. One that we are keeping an eye on, for example, is JSON-LD. The way this format hides various RDF complexities behind a familiar JSON structure makes it an ideal candidate for a linked data publishing product with a much wider user base. Which is exactly what we are looking for: lowering the bar to the adoption of semantic tech.


About Michele Pasin

Michele Pasin is an information architect and product manager with a focus on enterprise metadata management and semantic technologies.

Michele currently works for Springer Nature, a publishing company resulting from the May 2015 merger of Springer Science+Business Media and Holtzbrinck Publishing Group’s Nature Publishing Group, Palgrave Macmillan, and Macmillan Education.

He has recently taken up the role of product manager for the knowledge graph project, an initiative whose goal is to bring together various preexisting linked data repositories, plus a number of other structured and unstructured data sources, into a unified, highly integrated knowledge discovery platform. Before that, he worked on projects like’s subject pages (a dynamic section of the website that allow users to navigate content by topic) and the ontologies portal (a public repository of linked open data).

He holds a PhD in semantic web technologies from the Knowledge Media Institute (The Open University, UK) and advanced degrees in logic and philosophy of language from the University of Venice (Italy). Previously, he was a research associate at King’s College Department of Digital Humanities (London), where he developed on a number of cultural informatics projects such as the People of Medieval Scotland and the Art of Making in Antiquity. Online Portfolio:

Michele Pasin will give a keynote at this year’s SEMANTiCS conference.

About Tony Hammond

Tony Hammond is a data architect with a primary focus in the general area of machine-readable description technologies. He has been actively involved in developing industry standards for network identifiers and metadata frameworks. He has had experience working on both sides of the scientific publishing information chain, from international research centres to leading publishing houses. His background is in physics with astrophysics.

Tony currently works for Springer Nature, a publishing company resulting from the May 2015 merger of Springer Science+Business Media and Holtzbrinck Publishing Group’s Nature Publishing Group, Palgrave Macmillan, and Macmillan Education.

Categories: Blogroll

AI's not AI

Data Mining Blog - Sun, 2016-03-27 14:04

There has been a lot of commentary recently on issues relating to an experimental chat bot that Microsoft has (or had) launched named (after, perhaps, a river in Scotland) Tay. After a brief existence online, the bot was removed due to behaviours perceived as offensive which it was persuaded to engage in. Peter Lee of MSR has this to say about it. While there is much to learn from what transpired, the thing that irks me the most is the continued use of the term Artificial Intelligence to describe these systems - Lee actually calls it an 'artificial intelligence application'. Experimenting with these interactive agents is, no doubt, a useful activity that will teach us much about how humans will interact with actual AI entities in the future, but calling a chat bot of this nature an artificial intelligence application is like calling the icing on a cake, a cake. Communicating with humans is essential to artificial intelligence; communicating as a peer in human language with not much else going on 'upstairs' is not, however, a demonstration of artificial intelligence.

Related articles How the Tech Media Keeps Artificial Intelligence at a Distance The Economist gets in on the AI Fluff AI, Artificial Birds and Aeroplanes
Categories: Blogroll

International Semantic Web Community meets in Leipzig, Sept. 12-15, 2016

Semantic Web Company - Wed, 2016-03-23 05:27

At the annual SEMANTiCS Conference, experts from academia and industry meet to discuss semantic computing, its benefits and future business implications. Since 2005, SEMANTiCS has been attracting the opinion leaders in semantic web and big data technology, ranging from information managers and software engineers, to commerce experts and business developers as well as researchers and IT architects, when it comes to defining the future of information technology.

The SEMANTiCS 2016 takes place from September 12th to 15th at the second oldest university of Germany – the Leipzig University. Leipzig University hosts several departments in particular AKSW focused on Linked Data and Semantic Web and is therefore THE European hotspot, when it comes to graph-based technologies and knowledge engineering.

You want to be a part of the SEMANTiCS Conference and are interested to get in touch with the following audiences?

  • IT professionals & IT architects
  • Software developers
  • Knowledge Management Executives
  • Innovation Executives
  • R&D Executives

Calls are open now. Industrial presentation offer a platform to reach a huge network of practicioners and users to get feedback and academic submission are published in the well-known ACM-ICPS series (deadline 21st April, 23% acceptance rate). To submit your contribution, please visit the section calls on our website. To attend the workshops, the tutorials or to enjoy the talks in one of the offered sessions, please visit our registration site.

You want to partner with SEMANTiCS 2016? Then get a sponsor package or become an exhibitor! For more details, please click here.

To be up-to-date, stay tuned and follow us on facebook, twitter (@SemanticsConf) or visit our website for the latest news.

Categories: Blogroll

AlphaGo is not the solution to AI

Machine Learning Blog - Sun, 2016-03-13 18:46

Congratulations are in order for the folks at Google Deepmind who have mastered Go.

However, some of the discussion around this seems like giddy overstatement. Wired says Machines have conquered the last games and Slashdot says We know now that we don’t need any big new breakthroughs to get to true AI. The truth is nowhere close.

For Go itself, it’s been well-known for a decade that Monte Carlo tree search (i.e. valuation by assuming randomized playout) is unusually effective in Go. Given this, it’s unclear that the AlphaGo algorithm extends to other board games where MCTS does not work so well. Maybe? It will be interesting to see.

Delving into existing computer games, the Atari results (see figure 3) are very fun but obviously unimpressive on about ¼ of the games. My hypothesis for why is that their solution does only local (epsilon-greedy style) exploration rather than global exploration so they can only learn policies addressing either very short credit assignment problems or with greedily accessible polices. Global exploration strategies are known to result in exponentially more efficient strategies in general for deterministic decision process(1993), Markov Decision Processes (1998), and for MDPs without modeling (2006).

The reason these strategies are not used is because they are based on tabular learning rather than function fitting. That’s why I shifted to Contextual Bandit research after the 2006 paper. We’ve learned quite a bit there, enough to start tackling a Contextual Deterministic Decision Process, but that solution is still far from practical. Addressing global exploration effectively is only one of the significant challenges between what is well known now and what needs to be addressed for what I would consider a real AI.

This is generally understood by people working on these techniques but seems to be getting lost in translation to public news reports. That’s dangerous because it leads to disappointment. The field will be better off without an overpromise/bust cycle so I would encourage people to keep and inform a balanced view of successes and their extent. Mastering Go is a great accomplishment, but it is quite far from everything.

Edit: Further discussion here, CACM, here, and KDNuggets.

Categories: Blogroll

How PoolParty and ISO 25964 fit together

Semantic Web Company - Fri, 2016-03-04 04:36

The release of the ISO standard for thesauri “ISO 25964 Part 1: Thesauri for information retrieval” in 2011 was a huge step, as it replaced standards that dated back to 1986 (ISO 2788) and 1985 (ISO 5964). By that, methodologies from a pre-Web era, when thesauri where rather developed to be published on paper have been further developed. The new standard also brought a shift from a term-based model to a concept-based model stating: “Each term included in a thesaurus should represent a single concept (or unit of thought)” – from: ISO 25964 Part 1, page 15. That brings it close to Semantic Web based data models like SKOS and also shows that formerly disconnected communities are now working together.

Term vs. Concept based

We are frequently asked whether PoolParty is compatible with ISO 25964. Our basic answer always is “Yes, of course” as the data model defined in the standard can be mapped to SKOS + SKOS-XL (see: On the other hand we also have to point out that the ISO standard defines a very comprehensive model for managing all sorts of thesauri. In contrast, SKOS focuses rather on a simple data model that allows to manage all kinds of KOS (incl. classification schemes) that can be extended if more complexity is needed. In my view, this difference also reflects the two principal ways of approaching thesaurus projects:  “top down” approach (ISO 25964) vs. “bottom up” approach (SKOS). Since we at SWC have always been following the principle “start simple and add compexity as you go/need it”, it’s quite clear where we reside: With PoolParty’s ontology management and custom schema management, taxonomists can go far beyond SKOS’s expressivity.

ISO 25964 also includes a chapter about “Guidelines for thesaurus management software” – so I tried to figure out to which degree this is covered by PoolParty. The results can now be found in PoolParty documentation.

So if you’re asked the next time “Is PoolParty compatible with ISO 25964?”, you will answer hopefully “Yes, of course – just take a look at the documentation”:


Categories: Blogroll

Semantic Web Company Named to KMWorld’s 2016 ‘100 Companies That Matter in Knowledge Management’

Semantic Web Company - Tue, 2016-03-01 13:13

Semantic Web Company, the leading provider of graph-based metadata, search, and analytic solutions, today announced that it has been named to KMWorld’s 2016 list of the ‘100 Companies That Matter in Knowledge Management’. This award is another important milestone for the broad acceptance of Semantic Web standards in enterprises.

“Only last year, our standards-based platform PoolParty Semantic Suite got acknowledged as Trend-Setting Product by KMWorld. We are delighted to be recognized now as an industry leader in innovation and service from KMWorld. Semantic Web technologies have an ever-increasing impact on the management of data and information of many knowledge-intensive organizations,” says Andreas Blumauer, co-founder and CEO of the Semantic Web Company.


KMWorld Editor-in-Chief Sandra Haimila agrees, “Being named to our list of 100 Companies That Matter in Knowledge Management is a prestigious designation because it represents the best in innovation, creativity and functionality. The 100 Companies offer solutions designed to help users and customers find what they need whenever and wherever they need it … and what they need is the ability to access, analyze and share crucial knowledge.”

More information can be found in the March print issue of KMWorld Magazine and online at

About KMWorld

KMWorld is the leading information provider serving the Knowledge Management systems market and covers the latest in content, document and knowledge management, informing more than 30,000 subscribers about the components and processes – and subsequent success stories – that together offer solutions for improving business performance. KMWorld is a publishing unit of Information Today, Inc.

About Semantic Web Company

The Semantic Web Company was founded in 2004 and is acknowledged as a global leader in Semantic Web technologies. The company is the vendor of PoolParty Semantic Suite and is involved in R&D projects with a volume of more than 16 million EUR.

A team of Linked Data experts provides consulting and integration services for semantic data and knowledge portals. Boehringer Ingelheim, Credit Suisse, European Commission, Roche, Red Bull, and The World Bank are among many other customers, which have successfully adopted Semantic Web solutions.

More information can be found online at

Categories: Blogroll

Web 2: But Wait, There's More (And More....) - Best Program Ever. Period.

Searchblog - Thu, 2011-10-13 13:20
I appreciate all you Searchblog readers out there who are getting tired of my relentless Web 2 Summit postings. And I know I said my post about Reid Hoffman was the last of its kind. And it was, sort of. Truth is, there are a number of other interviews happening... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Reid Hoffman, Founder, LinkedIn (And Win Free Tix to Web 2)

Searchblog - Wed, 2011-10-12 12:22
Our final interview at Web 2 is Reid Hoffman, co-founder of LinkedIn and legendary Valley investor. Hoffman is now at Greylock Partners, but his investment roots go way back. A founding board member of PayPal, Hoffman has invested in Facebook, Flickr, Ning, Zynga, and many more. As he wears (at... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview the Founders of Quora (And Win Free Tix to Web 2)

Searchblog - Tue, 2011-10-11 13:54
Next up on the list of interesting folks I'm speaking with at Web 2 are Charlie Cheever and Adam D'Angelo, the founders of Quora. Cheever and D'Angelo enjoy (or suffer from) Facebook alumni pixie dust - they left the social giant to create Quora in 2009. It grew quickly after... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Ross Levinsohn, EVP, Yahoo (And Win Free Tix to Web 2)

Searchblog - Tue, 2011-10-11 12:46
Perhaps no man is braver than Ross Levinsohn, at least at Web 2. First of all, he's the top North American executive at a long-besieged and currently leaderless company, and second because he has not backed out of our conversation on Day One (this coming Monday). I spoke to Ross... (Go to Searchblog Main)
Categories: Blogroll

I Just Made a City...

Searchblog - Mon, 2011-10-10 14:41
...on the Web 2 Summit "Data Frame" map. It's kind of fun to think about your company (or any company) as a compendium of various data assets. We've added a "build your own city" feature to the map, and while there are a couple bugs to fix (I'd like... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Vic Gundotra, SVP, Google (And Win Free Tix to Web 2)

Searchblog - Mon, 2011-10-10 14:03
Next up on Day 3 of Web 2 is Vic Gundotra, the man responsible for what Google CEO Larry Page calls the most exciting and important project at this company: Google+. It's been a long, long time since I've heard as varied a set of responses to any Google project... (Go to Searchblog Main)
Categories: Blogroll
Syndicate content