Reinforcement learning is much discussed these days with successes like AlphaGo. Wouldn’t it be great if Reinforcement Learning algorithms could easily be used to solve all reinforcement learning problems? But there is a well-known problem: It’s very easy to create natural RL problems for which all standard RL algorithms (epsilon-greedy Q-learning, SARSA, etc…) fail catastrophically. That’s a serious limitation which both inspires research and which I suspect many people need to learn the hard way.
Removing the credit assignment problem from reinforcement learning yields the Contextual Bandit setting which we know is generically solvable in the same manner as common supervised learning problems. I know of about a half-dozen real-world successful contextual bandit applications typically requiring the cooperation of engineers and deeply knowledgeable data scientists.
Can we make this dramatically easier? We need a system that explores over appropriate choices with logging of features, actions, probabilities of actions, and outcomes. These must then be fed into an appropriate learning algorithm which trains a policy and then deploys the policy at the point of decision. Naturally, this is what we’ve done and now it can be used by anyone. This drops the barrier to use down to: “Do you have permissions? And do you have a reasonable idea of what a good feature is?”
A key foundational idea is Multiworld Testing: the capability to evaluate large numbers of policies mapping features to action in a manner exponentially more efficient than standard A/B testing. This is used pervasively in the Contextual Bandit literature and you can see it in action for the system we’ve made at Microsoft Research. The key design principles are:
- Contextual Bandits. Many people have tried to create online learning system that do not take into account the biasing effects of decisions. These fail near-universally. For example they might be very good at predicting what was shown (and hence clicked on) rather that what should be shown to generate the most interest.
- Data Lifecycle support. This system supports the entire process of data collection, joining, learning, and deployment. Doing this eliminates many stupid-but-killer bugs that I’ve seen in practice.
- Modularity. The system decomposes into pieces: exploration library, client library, online learner, join server, etc… because I’ve seen to many cases where the pieces are useful but the system is not.
- Reproducibility. Everything is logged in a fashion which makes online behavior offline reproducible. Consequently, the system is debuggable and hence improvable.
The system we’ve created is open source with system components in mwt-ds and the core learning algorithms in Vowpal Wabbit. If you use everything it enables a fully automatic causally sound learning loop for contextual control of a small number of actions. This is strongly scalable, for example a version of this is in use for personalized news on MSN. It can be either low-latency (with a client side library) or cross platform (with a JSON REST web interface). Advanced exploration algorithms are available to enable better exploration strategies than simple epsilon-greedy baselines. The system autodeploys into a chosen Azure account with a baseline cost of about $0.20/hour. The autodeployment takes a few minutes after which you can test or use the system as desired.
This system is open source and there are many ways for people to help if they are interested. For example, support for the client-side library in more languages, support of other learning algorithms & systems, better documentation, etc… are all obviously useful.
More and more Linked Data applications seem to emerge in the business world and software companies make it part of their business plan to integrate Graph Data in their data stories or in their features.
MarkLogic is opening a new wave to how enterprise databases should be used to push over the limits of closed, rigid structures to integrate more data. Neo4j explains how you can enrich existing data and follow new connections and leads for investigations of the Panama Papers.
No wonder the communities in different locations gather to share, exchange and network around topics like Linked Data. In London, a new conference is emerging exactly for this purpose: Connected Data London. The conference sets the stage for industry leaders and early adopters as well as researchers to present their use cases and stories. You can hear talks from multiple domains about how they put Linked Data to a good use: space exploration, financial crime, bioinformatics, publishing and more.
The conference will close with an interesting panel discussion about “How to build a Connected Data capability in your organization.” You can hear from the specialists how this task is approached. And immediately after acquiring the know-how you will need a easy-to-use and easy-to-integrate software to help with your Knowledge Model creation and maintenance as well as Text Mining and Concept Annotating.
In our dedicated slot we present how a Connected Data Application is born from a Knowledge Model and which are the steps to get there.
Connected Data London – London, 12th July, Holiday Inn Mayfair
Back then, we were analysing on average 500,000 (yes, half a million!) tweets a day. Then, on referendum day alone, we had to analyse in real-time well over 2 million tweets. Or on average, just over 23 tweets per second! It wasn't quite so simple though, as tweet volume picked up dramatically as soon as the polls closed at 10pm and we were consistently getting around 50 tweets per second and were also being rate-limited by the Twitter API.
These are some pretty serious data volumes, as well as veracity. So how did we build the Brexit Analyser to cope?
For analysis, we are using GATE's TwitIE system, which consists of a tokenizer, normalizer, part-of-speech tagger, and a named entity recognizer. After that, we added our Leave/Remain classifier, which helps us identify a reliable sample of tweets with unambiguous stance. Next is a tweet geolocation component, which uses latitude/longitude, region, and user location metadata to geolocate tweets within the UK NUTS2 regions. We also detect key themes and topics discussed in the tweets (more than one topic/theme can be contained in each tweet), followed by topic-centric sentiment analysis.
We kept the processing pipeline as simple and efficient as possible, so it can run at 100 tweets per second even on a pretty basic server.
The analysis results are fed into GATE Mimir, which indexes efficiently tweet text and all our linguistic annotations. Mimir has a powerful programming API for semantic search queries, which we use to drive different web pages with interactive visualisations. The user can choose what they want to see, based on time (e.g. most popular hashtags on 23 Jun; most talked about topics in Leave/Remain tweets on 23 Jun). Clicking on these infographics shows the actual matching tweets.
All my blog posts so far have been using screenshots of such interactively generated visualisations.
Mimir also has a more specialised graphical interface (Prospector), which I use for formulating semantic search queries and inspecting the matching data, coupled with some pre-set types of visualisations. The screen shot below shows my Mimir query for all original tweets on 23 Jun which advocate Leave. I can then inspect the most mentioned twitter users within those. (I used Prospector for my analysis of Leave/Remain voting trends on referendum day).
So how do I do my analyses
First I decide what subset of tweets I want to analyse. This is typically a Mimir query restricting by timestamp (normalized to GMT), tweet kind (original, reply, or retweet), voting intention (Leave/Remain), mentioning a specific user/hashtag/topic, written by a specific user, containing a given hashtag or a given topic (e.g. all tweets discussing taxes).
Then, once I identify this dynamically generated subset of tweets, I can analyse it with Prospector or use the visualisations which we generate via the Mimir API. These include:
- Top X most frequently mentioned words, nouns, verbs, or noun phrases
- Top X most frequent posters/frequently mentioned tweeterers
- Top X most frequent Locations, Organizatons, or Persons within those tweets
- Top X themes / sub-themes according to our topic classifier
- Frequent URLs, language of the tweets, and sentiment
It's built using GATE Cloud Paralleliser and some clever queueing, but the take away message is: we can process and index over 100 tweets per second, which allows us to cope in real time with the tweet stream we receive via the Twitter Search API, even at peak times. All of this runs on a server which cost us under £10,000.
The architecture can be scaled up further, if needed, should we get access to a Twitter feed with higher API rate limits than the standard.
Thanks to:Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team
Any mistakes are my own.
The Press Association, in cooperation with Twitter and Blurrt, have created the #EURef Data Hub. It shows live statistics of which of the two campaigns is more talked about, tweet volumes over time, who are the most talked about campaigners (a fixed set, separated into Leave and Remain), and what are the most popular topics in the discussion (four categories: foreign relations, economy, immigration, security).
Post-Referendum Sample Studies
MonkeyLearn posted on June 24th, a post-referendum analysis of 450,000 tweets with the #Brexit hashtag. After filtering for language (English), 250,000 were retained for analysis of the sentiment expressed within, as well as some prominent keywords. They also analysed the difference in keywords between positive and negative tweets.
Referendum Day Studies
The S-Six Social Sentiment Indexes on referendum day showed Remain tweets dominating Leave ones, with volumes and opinion staying stable.
Pre-Referendum Sample StudiesThe Sensei project analysed tweets in June on the referendum and made predictions on voting outcome, based on a sample of those. They also plotted the hottest topics on June 21st (e.g. immigration, leave, remain, Scotland), a word cloud of frequent words/topics on June 21st, active authors, influential conversations, and adjectives associated with leave and remain.
Expert Systems and the University of Aberdeen analysed a sample of 55,000 tweets pre-referendum in June. They identified key topics are jobs, immigration, security, NHS, taxes and government issues. Their sample indicated the Leave voters are more active on Twitter, especially those from England and Scotland were very strongly in favour of Leave. In total, 64.75% of tweets from Britain were pro-Leave.
This is not the only study to analyse referendum day tweets, but here I present a more in-depth analysis, also based on a sample of tweets selected specifically as advocating #Leave/#Remain respectively.
#Leave / #Remain Trend Based on @Brndstr
Our real-time analysis uncovered the most popular user mentioned in posts on referendum day: @Brndstr. @Brndstr are building bots to help brands engage with their customers and also for users to turn into social ambassadors of brands they endorse.
On referendum day, they ran a campaign which encouraged people to tweet how they voted and, in return, their profile picture will change accordingly. This was not uncontroversial to some Twitter users, who took issue with the choice of the Union Jack (for Out voters) vs the EU flag (for In voters), but nevertheless, many people declared their votes in this way.
Show your support with a custom Profile Flag Filter for the #EUref - what will you vote for? #iVoted
So what did the #EUReferendum debate look like today? Is Twitter still voting #Leave as it did back in May? What were the main hashtags and user mentions in today's tweets?
Tweet VolumesRecord breaking 1.9 million tweets were posted today on the #InOrOut #EUReferendum, which is between three and six times the daily volumes observed earlier in June. On average, this is 21 tweets per second over the day, although, the peaks of activity occurred after 9am (see graphs below). 1.5 million of those tweets were posted during poll opening times. In that period, only 3,300 posts were inaccessible to us due to Twitter rate limits.
Since the polls closed at 10pm tonight, there was a huge surge in Twitter activity with over 60,000 posts between 10pm and 11pm alone. Twitter rate limits meant that we could not access another 6,000 posts from that period. Since this is only 10% of the overall data in this hour, we still have a representative sample for our analyses.
Amongst the 1.9 million posts, over 1 million (57%) were retweets and 94 thousand (5%) - replies. These proportions of retweets and replies are consistent with patterns observed earlier in June.
Tweets, Re-tweets, and Replies: #Leave or #Remain
Let's start by looking at original tweets, i.e. tweets which have been posted by their authors and are not a reply to another tweet or a retweet. I refer to the authors of those tweets as the OPs (Original Posters), following terminology adopted from online forums.
My analysis of voting intentions showed some conflicting findings, depending on the way used to sample tweets (details and trend graphs here).
The gist is that, using @brndstr and “I voted XX” patterns both gave Remain a majority over Leave, but using our voting intention classification heuristic, the opposite was true (i.e. Leave was the more likely winner).
In retweets, the #Leave proponents were more vocal in comparison to the #Remain.
The difference is particularly pronounced for replies, where #Leave proponents are engaging in more debates than #Remain ones. Nevertheless, with replies constituting only 5% of all tweets today, the echo chamber effect observed earlier in June still remains unchanged.
#InOrOut, #Leave, #Remain and Other Popular HashtagsInterestingly, 75% of all tweets today (1.4 million) contained at least one hashtag. This is a very significant increase on the 56.5% observed several days ago.
Some of the most popular hashtags remain unchanged from earlier in June. These refer to the leave and remain campaigns, immigration, NHS, parties, media, and politicians. Interestingly, there is now increased interest in #forex and #stocks, as predictors of the likely outcome.
Most Mentioned Users Today: What is @Brndstr
Last for tonight, I compared the most frequently mentioned Twitter users in original tweets from today (see above) against those most mentioned earlier in June. The majority of popular mentioned users remains unchanged, with a mix of campaign Twitter accounts, media, and key political leaders.
The most prominent difference is that @Brndstr (Bots for Brands) came top (mentioned in over 14 thousand tweets), followed by @YouTube with 3 thousand mentions. Other new, frequently mentioned accounts today were Avaaz, DanHannanMEP,BuzzFeedUK, and realDonaldTrump.
So What Does This Tell Us?
The #InOrOut #EUReferendum has attracted unprecedented tweet volumes on poll day, with a significantly higher proportion of hashtags than previously. This seems to suggest that Twitter users are trying to get their voices heard and spread the word far and wide, well beyond the bounds of their normal follower network.
There are some exciting new entrants in the top 30 most mentioned Twitter accounts in today's referendum posts. I will analyse these in more depth tomorrow. For now, good night!
Thanks to:Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team
Any mistakes are my own.
Having looked at tweet volumes and basic characteristics of the Twitter discourse around the EU referendum, we now turn to the method we chose for identify a reliable, even if incomplete, sample of leave and remain tweets.
No ground truth; not trying to predict if leave or remain are leading, but instead interested in identifying a reliable, if incomplete subset, so we can analyse topics discussed and active users within.
Are Hashtags A Reliable Predictor of Leave/Remain Support?As discussed in our earlier post, over 56% of all tweets on the referendum contain at least one hashtag. Some of these are actually indicative of support for the leave/remain campaigns, e.g. #votetoleave, #voteout, #saferin, #strongertogether. Then there are also hashtags which try to address undecided voters, e.g. #InOrOut, #undecided, while promoting either a remain or leave vote but not through explicit hashtags.
A recent study of EU referendum tweets by Ontotext, carried out over tweets in May 2016, classified tweets as leave or remain on the basis of approximately 30 hashtags. Some of those were associated with leave, the rest -- with remain, and each tweet was classified as leave or remain based on whether it contains predominantly leave or predominantly remain hashtags.
Based on analysing manually a sample of random tweets with those hashtags, we found that this strategy does not always deliver a reliable assessment, since in many cases leave hashtags are used as a reference to the leave campaign, while the tweet itself is supportive of remain or neutral. The converse is also true, i.e. remain hashtags are used to refer to the remain stance/campaign. We have included some examples below.
A more reliable, even if somewhat more restrictive, approach is to consider the last hashtag in the tweet as the most indicative of its intended stance (pro-leave or pro-remain). This results in a higher precision sample of remain/leave tweets, which we can then analyse in more depth in terms of topics discussed and opinions expressed.
Using this approach, amongst the 1.9 million tweets between June 13th and 19th, 5.5% (106 thousand) were identified as supporting the Leave campaign, while 4% (80 thousand) - as supporting the Remain campaign. Taken together, this constitutes just under a 10% sample, which we consider sufficient for the purposes of our analysis.
These results, albeit drawn from a smaller, high-precision sample, seem to indicate that the Leave campaign is receiving more coverage and support on Twitter, when compared to Remain. This is consistent also with the findings of the Ontotext study .
In subsequent posts we will look into the most frequently mentioned hashtags, the most active Twitter users, and the topics discussed in the Remain and Leave samples separately.
What about #Brexit in particular? The recent Ontotext study on May 2016 data used #Brexit as one of the key hashtags indicative of leave. Others have also used #Brexit in the same fashion.
In our more recent 6.5 million tweets (dated between 1 June and 19 June 2016), just under 1.7 million contain the #Brexit hashtag (26%). However, having examined a random sample of those manually (see examples below), we established that while many tweets did use #Brexit to indicate support for leave, there were also many cases where #Brexit referred to the referendum, or the leave/remain question, or the Brexit campaign as a whole. We have provided some such examples at the end of this blog post. We also found a sufficient number of examples where #Brexit appears at the end of tweets while still not indicating support for voting leave.
Therefore, we chose to distinguish the #Brexit hashtag from all other leave hashtags and tagged tweets with a final #Brexit tag separately. This enables us, in subsequent analyses, to compare findings with and without considering #Brexit.
Example Remain/Leave Hashtag Use
It doesnt matter who some of the dodgy leaders of #Remain and #Brexit are, they each only have ONE VOTE, like all of us public #EURef— Marcus Storm (@MarcsandSparks) 20 June 2016
Perfect question! "Why is #brexit ahead, despite all the experts supporting #remain?" #questiontime— Steve Parrott (@steveparrott50) 19 June 2016
Could the last decent politician (of any party) to leave the #Leave camp please turn off the lights.....#Bremain pic.twitter.com/zQjjoIXcyO— Dr Hamed Khan (@drhamedkhan) 19 June 2016
Today's @thesundaytimes #focus articles on #brexit say it all. #remain is forward-looking, #leave backward— Patrick White (@pbpwhite) 20 June 2016
Example Brexit Tweets
#Brexit probability declines as campaigns remain quiet https://t.co/qrAhURvRDk via @RJ_FXandRates pic.twitter.com/UnNV1NDnZv— Bloomberg London (@LondonBC) 17 June 2016
#VoteRemain #VoteLeave #InOrOut #EURef #StrongerIn -- Is #Brexit The End Of The World As We Know It? via @forbes https://t.co/lQ6Xgf0oEW— Jolly Roger (@EUGrassroots) 17 June 2016
Remaining #Brexit Polls scheduled releases pic.twitter.com/DKzBqjoGcs— Nicola Duke (@NicTrades) 17 June 2016
Blame austerity—not immigration—for bringing Britain to ‘breaking point’https://t.co/f3oKODbLSe#Brexit #EUref pic.twitter.com/lLJHOsUO7J— The Conversation (@ConversationUK) June 20, 2016
BREAK World's biggest carmaker #Ford tells staff of "deep concerns abt "uncertainty/potential downsides" of #Brexit pic.twitter.com/bYQ3LyIA6i— Beth Rigby (@BethRigby) June 20, 2016
Thanks to:Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team
Any mistakes are my own.
Following up on an interesting suggestion, we are creating a “Birds of a Feather Unworkshop” with a leftover room (Duffy/Columbia) on Thursday and Friday during the workshops. People interested in ad-hoc topics can post a time and place to meet and discuss. Details are here a little ways down.
The work follows on from our successful collaboration with Nesta on the Political Futures Tracker, which analysed tweets in real-time in the run up to the UK General Election in 2015.
Unlike others, we do not try to predict the outcome of the referendum or answer the question of whether Twitter can be used as a substitute for opinion polls. Instead, our focus is on a more in-depth analysis of the referendum debate; the people and organisations who engage in those debates; what topics are discussed and opinion expressed, and who the top influencers are.
What does it do?It analyses and indexes tweets as they come in (i.e. in real time), in order to identify commonly discussed topics, opinions expressed, and whether a tweet is expressing support for remaining or leaving the EU. It must be noted that not all tweets have a clear stance and also that not all tweets express a clear voting intention (e.g. "Brexit & Bremain"). More on this in subsequent posts!
In more detail, the Brexit Analyser uses text analytics and opinion mining techniques from GATE, in order to identify tweets expressing voting intentions, the topics discussed within, and the sentiment expressed towards these topics. Watch this space!
The Data (So Far)We are collecting tweets based on a number of referendum related hashtags and keywords, such as #voteremain, #voteleave, #brexit, #eureferendum.
The volume of original tweets, replies, and re-tweets per day collected so far is shown below. On average, this is close to half a million tweets per day (480 thousand), which is 1.6 times the tweets on 26 March 2015 (300,000), when the Battle For Number 10 interviews took place, in the run up to the May 2015 General Elections.
In total, we have analysed just over 1.9 million tweets in the past 4 days, with 60% of those being re-tweets. On average, a tweet is re-tweeted 1.65 times.
Subsequent posts will examine the distribution of original tweets, re-tweets, and replies specifically in tweets expressing a remain/leave voting intention.
Hashtags: 1 million of those 1.9 million tweets contain at least one hashtag (i.e. 56.5% of all tweets have hashtags). If only original tweets are considered (i.e. all replies and retweets are excluded), then there are 319 thousand tweets with hashtags amongst the original 678 thousand tweets (i.e. 47% of original tweets are hashtag bearing).
Analysing hashtags used in a Twitter debate is interesting, because they indicate commonly discussed topics, stance taken towards the referendum, and also key influencers. As they are easy to search for, hashtags help Twitter users participate in online debates, including other users they are not directly connected to.
Below we show some common hashtags on June 16, 2016. As can be seen, most are associated directly with the referendum and voting intentions, while others refer to politicians, parties, media, places, and events:
URLs: Interestingly, amongst the 1.9 million tweets only 134 thousand contain a URL (i.e. only 7%). Amongst the 1.1 million re-tweets, 11% contain a URL, which indicates that tweets with URLs tend to be retweet more.
These low percentages suggest that the majority of tweets on the EU referendum are expressing opinions or addressing another user, rather than sharing information or providing external evidence.
@Mentions: Indeed, 90 thousand (13%) of the original 678 thousand tweets contain an username mention. The 50 most mentioned users in those tweets are shown below. The size of the user name indicates frequency, i.e. the larger the text the more frequently has this username been mentioned in tweets.
In subsequent posts we will provide information on the most frequently re-tweeted users and the most prolific Twitter users in the dataset.
So What Does This Tell Us?
Without a doubt, there is a heavy volume of tweets on the EU referendum, published daily. However, with only 6.8% of all tweets being replies and over 58% -- re-tweets, this resembles more an echo chamber, rather than a debate.
Pointers to external evidence/sources via URLs are scarce, as are user mentions. The most frequently mentioned users are predominantly media (e.g., BBC, Reuters, FT, the Sun, Huffington Post); politicians playing a prominent role in the campaign (e.g. David Cameron, Boris Johnson, Nigel Farage, Jeremy Corbyn); and campaign accounts created especially for the referendum (e.g. @StrongerIn, @Vote_Leave).
Thanks to:Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team
The space problem started long ago.
At ICML last year and the year before the amount of capacity that needed to fit everyone on any single day was about 1500. My advice was to expect 2000 and have capacity for 2500 because “New York” and “Machine Learning”. Was history right? Or New York and buzz?
I was not involved in the venue negotiations, but my understanding is that they were difficult, with liabilities over $1M for IMLS the nonprofit which oversees ICML year to year. The result was a conference plan with a maximum capacity of 1800 for the main conference, a bit less for workshops, and perhaps 1000 for tutorials.
Then the NIPS registration numbers came in: 3900 last winter. It’s important to understand here that a registration is not a person since not everyone registers for the entire event. Nevertheless, NIPS was very large with perhaps 3K people attending at any one time. Historically, NIPS is the conference most similar to ICML with a history of NIPS being a bit larger. Most people I know treat these conferences as indistinguishable other than timing: ICML in the summer and NIPS in the winter.
Given this, I had to revise my estimate up: We should really have capacity for 3000, not 2500. It also convinced everyone that we needed to negotiate for more space with the Marriott. This again took quite awhile with the result being a modest increase in capacity for the conference (to 2100) and the workshops, but nothing for the tutorials.
The situation with tutorials looked terrible while the situation with workshops looked poor. Acquiring more space at the Marriott looked near impossible. Tutorials require a large room, so we looked into the Kimmel Center at NYU acquiring a large room and increasing capacity to 1450 for the tutorials. We also looked into additional rooms for workshops finding one at Columbia and another at the Microsoft Technology Center which has a large public use room 2 blocks from the Marriott. Other leads did not pan out.
This allowed us to cover capacity through early registration (May 7th). Based on typical early vs. late registration distributions I was expecting registrations might need to close a bit early similar to what happened with KDD in 2014.
Then things blew up. Tutorial registration reached capacity the week of May 23rd, and then all registration stopped May 28th, 3 weeks before the conference. Aside from simply failing to meet demand this also creates lots of problems. What do you do with authors? And when I looked into things in detail for workshops I realized we were badly oversubscribed for some workshops. It’s always difficult to guess which distribution of room sizes is needed to support the spectrum of workshop interests in advance so there were serious problems. What could we do?
The first step was tutorial and main conference registration which reopened last Tuesday using some format changes which allowed us to increase capacity further. We will use simulcast to extra rooms to support larger audiences for tutorials and plenary talks allowing us to up the limit for tutorials to 1590 and for the main conference to 2400. We’ve also shifted the poster session to run in parallel with main tracks rather than in the evening. Now, every paper will have 3-4 designated hours during the day (ending at 7pm) for authors to talk to people individually. As a side benefit, this will also avoid the competition between posters and company-sponsored parties which have become common. We’ll see how this works as a format, but it was unavoidable here: even without increasing registration the existing evening poster session plan was a space disaster.
The workshop situation was much more difficult. I walked all over the nearby area on Wednesday, finding various spaces and getting quotes. I also realized that the largest room at the Crown Plaza could help with our tutorials: it was both bigger and much closer than NYU. On Thursday, we got contract offers from the promising venues and debated into the evening. On Friday morning at 6am the Marriott suddenly gave us a bunch of additional space for the workshops. Looking through things, it was enough to shift us from ‘oversubscribed’ to ‘crowded’ with little capacity to register more given natural interests. We developed a new plan on the fly, changed contracts, negotiated prices down, and signed Friday afternoon.
The local chairs (Marek Petrik and Peder Olsen) and Mary Ellen were working hard with me through this process. Disruptive venue changes 3 weeks before the conference are obviously not the recommended way of doing things:-) And yet it seems to be working out now, much better than I expected last weekend. Here’s the situation:
- Tutorials ~1600 registered with capacity for 1850. I expect this to run out of capacity, but it will take a little while. I don’t see a good way to increase capacity further.
- The main conference has ~2200 registered with capacity for 2400. Maybe this can be increased a little bit, but it is quite possible the main conference will run out of capacity as well. If it does, only authors will be allowed to register.
- Workshops ~1900 registered with capacity for 3000. Only the Deep Learning workshop requires a simulcast. It seems very unlikely that we’ll run out of capacity so this should be the least crowded part of the conference. We even have some left-over little rooms (capacity for 125 or less) that are looking for a creative use if you have one.
In this particular case, “New York” was both part of the problem and much of the solution. Where else can you walk around and find large rooms on short notice within 3 short blocks? That won’t generally be true in the future, so we need to think carefully about how to estimate attendance.
My greatest concern with the many machine learning conferences in New York this year was the relatively high cost that implied, particularly for hotel rooms in Manhattan. Keeping the conference affordable for graduate students seems critical to what ICML is really about.
The price becomes much more reasonable if you can find roommates to share the price. For example, the conference hotel can have 3 beds in a room.
This still leaves a coordination problem: How do you find plausible roommates? If only there was a website where the participants in a conference could look for roommates. Oh wait, there is. Conferenceshare.co is something new which might measurably address the cost problem. Obviously, you’ll want to consider roommate possibilities carefully, but now at least there is a place to meet.
Note that the early registration deadline for ICML is May 7th.
I’m doing a Quora Session today that may be of interest. I’m impressed with both the quality and quantity of questions.
Here. I would recommend registering early because there is a difficult to estimate(*) chance you will not be able to register later.
These numbers are as aggressively low as the local chairs and I can sleep with at night. The prices are higher than I’d like (New York is expensive), but a bit lower than last year, particularly for students(***).
(*) Relevant facts:
- ICML 2016: submissions up 30% to 1300.
- NIPS 2015 in Montreal: 3900 registrations (way up from last year).
- NIPS 2016 is in Barcelona.
- ICML 2015 in Lille: 1670 registrations.
- KDD 2014 in NYC: closed@3000 registrations 1 week before the conference.
I tried to figure out how to setup a prediction market to estimate what will happen this year, but didn’t find an easy-enough way to do that.
(**) I kind of wish we could make up the titles. How about: “Go is Too Easy” and “My Neural Network is Deeper than Yours”?
(***) Sponsors are very generous and are mostly giving to defray student costs. Approximately every dollar of the difference between Regular and Student registration is due to company donations. For students, also note that there will be some scholarship opportunities to defray costs coming out soon.
In recent years, Nature has adopted linked data technologies on a broader scale. Andreas Blumauer was intrigued to discover more about the strategy and technologies behind. He had the opportunity to talk with Michele Pasin and Tony Hammond who are the architects of Nature’s data publishing portal.
Semantic Puzzle: Nature’s data publishing portal is one of the most renowned ones in the linked data community. Could you talk a bit about its history? Why was this project initiated and who have been the brains behind it since then?
Michele Pasin: We have been involved with semantic technologies at Macmillan since 2010. At the time it was primarily my colleague Tony Hammond who saw the potential of these technologies for metadata management and data sharing. Tony set up the data.nature.com portal in April 2012 (and expanded in July 2012), in the context of a broader company initiative aimed at moving towards a ‘digital first’ publication workflow.
The data.nature.com platform was essentially a public RDF output of some of the metadata embedded in our XML articles archive. This included a SPARQL endpoint for data about articles published by NPG from 1845 through to the present day. Additionally the datasets include NPG product and subject ontologies. These datasets are available under a Creative Commons Zero waiver.
The data.nature.com platform was only for external use though, so it was essentially detached from the products end users would see on nature.com. Still, it allowed us to mature a better understanding of how to make use of these tools within our existing technology stack. It is important to remember that in the years the company has been investing a considerable amount of resources on an XML-centered architecture, so finding a solution that could leverage the legacy infrastructure with these new technologies has always been a fundamental requirement for us.
More recently, in 2013 we started working on a new hybrid linked data platform, this time with a much stronger focus on supporting our internal applications. That’s pretty much around the time I joined the company. In essence, we made the point that in order to achieve stronger interoperability levels within our systems we had to create an architecture where RDF is core to the publishing workflow as much as XML is. (By the way if you are interested in the details of this, we presented a paper about this at ISWC 2014.) As part of this phase, we also built a more sophisticated set of ontologies used for encoding the semantics of our data, together with improved versions of the datasets previously released.
The nature.com ontologies portal came out in early 2015 as the result of this second phase of work. On the portal one can find extensive documentation about all of our models, as well as periodical downloads in various RDF formats. The idea is to make it easier for people – both within the enterprise and externally – to access, understand and reuse our linked data.
At the same time, since user engagement level on data.nature.com was not as good as expected, we decided to terminate that service. In the future, we plan to keep releasing periodic snapshots of the datasets and the ontologies we are using, but not a public endpoint in the immediate future.
Semantic Puzzle: As one of your visions you’re stating that your “primary reason for adopting linked data technologies is quite simply better metadata management”. How did you deal with metadata before you started with this transition? What has changed since then, also from a business point of view?
Michele Pasin: Our pre–linked data approach to dealing with metadata and enterprise taxonomies is probably not unheard of, especially within similar sized companies: a vast array of custom-made solutions, varying from simple word documents sitting in someone’s computer, to Excel spreadsheets or, in the best of cases, database tables in one of our production systems. Of course, there were also a number of ad-hoc applications/scripts responsible for the reading/updating of these metadata sources, as often they would be critical to one or more system in the publishing workflow (e.g. think of the journal’s master list, or the list of approved article-types).
It is worth stressing that the lack of a unified technical infrastructure aspect was a key problem, of course, but not the only one. In fact I would argue that addressing the lack of a centralized data governance approach was even more crucial. For example, most often you would not know who/which department was in charge of a particular controlled vocabulary or metadata specification. In some cases, no single source of truth was actually available, because different people/groups were in charge of specific aspects of a single specification (due to their differing interests).
Hence you need a certain amount of management buy-in to implement such a wide-ranging approach to metadata; moving to a single platform and technical solution based on linked data was fundamental, but an equally fundamental organizational change was also needed. Even more so, if one considers that this is not a time-boxed project but rather an ongoing process, an approach which pays off only as much as you can guarantee that as new products and services get launched, they all subscribe to the same metadata management ‘philosophy’.
Semantic Puzzle: One of the promises of Linked Data is that by “using a common data model and a common naming architecture, users can begin to realize the benefits and efficiencies of web scaling.” Could you describe a bit more in detail into which eco-system your content workflows and publishing processes are embedded (internally and externally) and why the use of standards is important for this?
Tony Hammond: We operate with an XML-based workflow for documents where we receive XML from our suppliers and store that within an XML database (MarkLogic). Increasingly we are beginning to move towards a dynamic publishing solution from that database. We are also using the database to provide a full-text search across all our content. In the past we had various workflows and a small number of different DTDs to reconcile, although we are currently converging on a single DTD. To facilitate search across this mixed XML content we abstracted certain key metadata elements into a common header. This was managed organically and was somewhat unpredictable both in terms of content model and naming.
By moving to a linked data solution for managing our metadata which is based on a single, core ontology we bypass our normalized metadata header and start to build on a new simpler data model (triples) with a common naming architecture. In effect, we have moved from a nominally normalized metadata to a super-normalized metadata which uses web standards for data (URI, RDF, OWL).
Semantic Puzzle: Your contents are also multimedia (image, video, …). How do you embed this non-textual contents into your linked data ecosystem? Which gateways, tools and connectors are used to bridge your linked data environment with multimedia?
Tony Hammond: Some years ago we embarked on a new initiative internally to streamline our production workflows. Our brief was to support a distributed content warehouse where digital assets would be stored in various locations. The idea was to abstract out our storage concerns and to maintain pointers to the various storage subsystems along with other physical characteristics required for accessing that storage.
In practice our main content was housed as XML documents within a MarkLogic XML database and associated media assets (e.g. images) were primarily stored on the filesystem with some secondary asset types (e.g. videos) being sourced from cloud services.
To relate a physical asset (e.g. an XML document, or a JEPG file) to the underlying concept (e.g. an article, or an image) we made use of XMP packets (a technology developed by Adobe Systems and standardized through ISO) which as simple RDF/XML descriptions allowed us to capture metadata about physical characteristics and to relate those properties to our data model. An XMP packet is a description of one physical resource and could be simply linked to the related conceptual resource.
We started this project with an RDF triplestore for maintaining and querying our metadata, but over time we moved towards a hybrid technology where our semantic descriptions were buried within XML documents as RDF/XML descriptions and could be queried within an XML context using XQuery to deliver a highly performant JSON API. These semantic descriptions enclosed minimal XMP documents which described the storage entities.
Semantic Puzzle: Nature links its datasets to external ones, e.g. to DBpedia or MeSH. Who exactly is benefiting from this and how?
Michele Pasin: I would say that there are at least two reasons why we did this. First, we wanted to maximize the potential reuse of our datasets and models within the semantic web. Building owl:sameAs relationships to other vocabularies, or marking up our ontology classes and properties with subclass/subproperty relationships pointing to external vocabularies is a way to be good ‘linked data citizens’. Moreover, this is a deliberate attempt to counterbalance one of our key design principles: minimal commitment to external vocabularies. This approach to data modeling means that we tend to create our own models and define them within our own namespaces, rather than building production-level software against third party ontologies. It is worth pointing out that this is not because we think our ontologies are better – but because we want our data architecture to reflect as closely as possible the ontological commitment of a publishing enterprise with decades of established business practices, naming conventions etc. In other words, we aimed at creating a very cohesive and robust domain model, one which is resilient to external changes but that also supports semantic interoperability by providing a number of links and mappings to other semantic web standards.
Pointing to external vocabularies is a way to be good ‘linked data citizens’
The second reason for creating these links is to enable more innovative discovery services. For example, a nature.com subject page about photosyntesis could surface encyclopedic materials automatically retrieved from DBpedia; or it could provide links to highly cited articles retrieved from PubMed using MeSH mappings. This just scratches the surface of what one could do. The real difficulty is, how to do it in such a way that the overall user experience improves, rather than adding up to the information overload the majority of internet users already have to deal with. So at the moment, while the data people (us) are focusing on building a rich network of entities for our knowledge graph, the UX and front end teams are exploring design and interaction models that truly take advantage of these functionalities. Hopefully we see these activities continue to converge!
Semantic Puzzle: How do you deal with data quality management in general, and how can linked data technologies help to improve it?
Tony Hammond: We can distinguish between two main types of data: documents and ontologies. (And by ontologies we also comprehend thesauri and taxonomies.) Our documents are created by our suppliers using XML and are amenable to some data validations. We use automated DTD validation in our new workflow and by hand DTD validation in the older workflows. We also use Schematron rulesets to validate certain data points but these address only certain elements. We have a couple hundred Schematron rules which implement various business rules and are also synchronized with our ontologies.
Our ontologies, on the other hand, are by their nature more curated datasets. These are mastered as RDF Turtle files and stored within GitHub. These are currently maintained by hand, although we are beginning now to transition some of our taxonomies to the PoolParty taxonomy manager. We have a build process for deploying the ontologies to our XML database where they are combined with our XML documents. During this build process we both validate the RDF as well as running SPIN rules over the datasets which can validate data elements as well as expanding the dataset with new triples from rules-based inferencing.
Semantic Puzzle: For a publisher like Nature it is somehow “natural” that Linked Data is used. How could other industries make use of these principles for information management?
Tony Hammond: The main reason for using linked data is not to do with publishing the data (and indeed many other data models are generally used for data publishing), but with the desire to join one dataset with other datasets – or rather, the data within a dataset to the data within other datasets. It is for this reason that we make use of URIs as common (global) names for data points. Linking data is not just a goal in publishing data but applies equally when consuming data from various sources and integrating over those data sources within an internal environment. Indeed, arguably, the biggest use case for linked data is within private enterprises rather than surfaced on the open web. Once that point is appreciated there is no restriction on any industry in being more disposed to using linked data than any other, and it is used as a means to maximize the data surface that a company operates over.
The biggest use case for linked data is within private enterprises rather than surfaced on the open web
Semantic Puzzle: Where are the limits of Linked Data from your perspective, and do you believe they will ever be exceeded?
Tony Hammond: The limits to using linked data are more to do with top-down vs bottom-up approaches in dealing with data, i.e. linked data vs big data, or data curation vs data crunching. Linked data makes use of global names (URIs), schemas, ontologies. It is highly structured, organized data.
Now, whether it is feasible to bring this level of organization to data at large or whether data crunching will provide the appropriate insights over the data is an open question. Our expectation is that we will still need to use ontologies – and hence linked data – as an organizing principle, or reference, to guide us in processing large datasets and for sharing those data organizations. The question may be how much human curation is required in assembling these ontologies.
Michele Pasin: On a more practical level, I’d say that the biggest problem with linked data is still its rather limited adoption on a large scale. I’m referring in particular to the data publishing and reuse aspect. On this front, we really struggled to get the levels of uptake the business was expecting from us. Consider this: we have been publishing metadata for our entire archive since 2012 (approx. 1.2m documents, resulting in almost half a billion triples). However very few people made use of these data, either in the form of bulk downloads or via the SPARQL API we once hosted (and that was then retired due to low usage). This is in stark contrast with other – arguably less flexible – services we make available, e.g. the OpenSearch APIs, or a JSON REST service, which often see significant traffic.
Last year we gave a paper at the Linked Science workshop (affiliated with ISWC 2015) with the specific intent to address the problem within that community. What seemed to emerge is that possibly this has to do with the same reason why this technology has been so useful to us. RDF is an extremely flexible and powerful model, however, when it comes to data consumption and access, the average user cares more about simplicity than flexibility. Also, outside linked data circles we all know that the standard tech for APIs is JSON and REST, rather than RDF and SPARQL.
Lowering the bar to the adoption of semantic tech
The good news though is that we are seeing more initiatives aimed at bridging these two worlds. One that we are keeping an eye on, for example, is JSON-LD. The way this format hides various RDF complexities behind a familiar JSON structure makes it an ideal candidate for a linked data publishing product with a much wider user base. Which is exactly what we are looking for: lowering the bar to the adoption of semantic tech.
About Michele Pasin
Michele Pasin is an information architect and product manager with a focus on enterprise metadata management and semantic technologies.
Michele currently works for Springer Nature, a publishing company resulting from the May 2015 merger of Springer Science+Business Media and Holtzbrinck Publishing Group’s Nature Publishing Group, Palgrave Macmillan, and Macmillan Education.
He has recently taken up the role of product manager for the knowledge graph project, an initiative whose goal is to bring together various preexisting linked data repositories, plus a number of other structured and unstructured data sources, into a unified, highly integrated knowledge discovery platform. Before that, he worked on projects like nature.com’s subject pages (a dynamic section of the website that allow users to navigate content by topic) and the nature.com ontologies portal (a public repository of linked open data).
He holds a PhD in semantic web technologies from the Knowledge Media Institute (The Open University, UK) and advanced degrees in logic and philosophy of language from the University of Venice (Italy). Previously, he was a research associate at King’s College Department of Digital Humanities (London), where he developed on a number of cultural informatics projects such as the People of Medieval Scotland and the Art of Making in Antiquity. Online Portfolio: http://www.michelepasin.org/projects/.
Michele Pasin will give a keynote at this year’s SEMANTiCS conference.About Tony Hammond
Tony Hammond is a data architect with a primary focus in the general area of machine-readable description technologies. He has been actively involved in developing industry standards for network identifiers and metadata frameworks. He has had experience working on both sides of the scientific publishing information chain, from international research centres to leading publishing houses. His background is in physics with astrophysics.
Tony currently works for Springer Nature, a publishing company resulting from the May 2015 merger of Springer Science+Business Media and Holtzbrinck Publishing Group’s Nature Publishing Group, Palgrave Macmillan, and Macmillan Education.
At the annual SEMANTiCS Conference, experts from academia and industry meet to discuss semantic computing, its benefits and future business implications. Since 2005, SEMANTiCS has been attracting the opinion leaders in semantic web and big data technology, ranging from information managers and software engineers, to commerce experts and business developers as well as researchers and IT architects, when it comes to defining the future of information technology.
The SEMANTiCS 2016 takes place from September 12th to 15th at the second oldest university of Germany – the Leipzig University. Leipzig University hosts several departments in particular AKSW focused on Linked Data and Semantic Web and is therefore THE European hotspot, when it comes to graph-based technologies and knowledge engineering.
You want to be a part of the SEMANTiCS Conference and are interested to get in touch with the following audiences?
- IT professionals & IT architects
- Software developers
- Knowledge Management Executives
- Innovation Executives
- R&D Executives
Calls are open now. Industrial presentation offer a platform to reach a huge network of practicioners and users to get feedback and academic submission are published in the well-known ACM-ICPS series (deadline 21st April, 23% acceptance rate). To submit your contribution, please visit the section calls on our website. To attend the workshops, the tutorials or to enjoy the talks in one of the offered sessions, please visit our registration site.
You want to partner with SEMANTiCS 2016? Then get a sponsor package or become an exhibitor! For more details, please click here.