Skip navigation.
Home
Semantic Software Lab
Concordia University
Montréal, Canada

Blogroll

ICML 2016 videos and statistics

Machine Learning Blog - Fri, 2016-08-26 16:04

The ICML 2016 videos are out.

I also wanted to share some statistics from registration that might be of general interest.

The total number of people attending: 3103.

Industry: 47% University: 46%

Male: 83% Female: 14%

Local (NY, NJ, or CT): 27%

North America: 70% Europe: 18% Asia: 9% Middle East: 2% Remainder: <1% including 2 from Antarctica

Categories: Blogroll

Attend and contribute to the SEMANTiCS 2016 in Leipzig

Semantic Web Company - Tue, 2016-08-16 11:58

The 12th edition of the SEMANTiCS, which is a well known platform for professionals and researchers who make semantic computing work, will be held in the city of Leipzig from September 12th till 15th. We are proud to announce the final program of the SEMANTiCS conference. The program will cover 6 keynote speakers, 40 industry presentations, 30 scientific paper presentations, 40 poster & demo presentations and a huge number of satellite events. Special talks will given by Thomas Vavra from IDC and Sören Auer, who will feature the LEDS track. On top of that there will be a fishbowl session ‘Knowledge Graphs – A Status Update’ with lightning talks from Hans Uszkoreit (DFKI) and Andreas Blumenauer (SWC). This week, the set of our distinguished keynote speakers has been fixed and we are quite excited to have them at this years’ edition of SEMANTiCS. Please join us to listen to talks from representatives from IBM, Siemens, Springer Nature, Wikidata, International Data Corporation (IDC), Fraunhofer IAIS, Oxford University Press and the Hasso-Plattner-Institut, who will share their latest insights on applications of Semantic technologies with us. To register and be part of the SEMANTiCS 2016 in Leipzig, please go to: http://2016.semantics.cc/registration.

Share your ideas, tools and ontologies, last minute submissions
Meetup: Big Data & Linked Data – The Best of Both Worlds  

On the first eve of the SEMANTiCS conference we will discuss how Big Data & Linked Data technologies could become a perfect match. This meetup gathers experts on Big and Linked Data to discuss the future agenda on research and implementation of a joint technology development.

  • Register (free)

  • If you are interested to present your idea, approach or project which links Semantic technologies with Big Data in an ad-hoc lightning talk, please get in touch with Thomas Thurner (t.thurner@semantic-web.at).

WORKSHOPS/TUTORIALS

This year’s SEMANTiCS is starting on September 12th with a full day of exciting and interesting satellite events. In 6 parallel tracks scientific and industrial workshops and tutorials are scheduled to provide a forum for groups of researchers and practitioners to discuss and learn about hot topics in Semantic Web research.

How to find users and feedback for your vocabulary or ontology?

The Vocabulary Carnival is a unique opportunity for vocabulary publishers to showcase and share their work in form of a poster and a short presentation, meet the growing community of vocabulary publishers and users to build useful semantic, technical and social links. You can join the Carnival Minute Madness on the 13th of September.

How to submit to ELDC?

The European Linked Data Contest awards prizes to stories, products, projects or persons presenting novel and innovative projects, products and industry implementations involving linked data. The ELDC is more than yet another competition. We envisage to build a directory of the best European projects in the domain of Linked Data and the Semantic Web. This year the ELDC is awarded in the categories Linked Enterprise Data and Linked Open Data, with €1.500,- for each of the winners. Submission deadline is August 31, 2016.

7th DBpedia Community Meeting in Leipzig 2016

Co-located with SEMANTiCS, the next DBpedia meeting will be held at Leipzig on September 15th. Experts will speak about topics such as Wikidata: bringing structured data to Wikipedia with 16.000 volunteers. The 7th edition of this event covers a DBpedia showcase session, breakout sessions and a DBpedia Association meeting where we will discuss new strategies and which direction is important for DBpedia. If you like to become part of the DBpedia community and present your ideas, please submit your proposal or check our meeting website: http://wiki.dbpedia.org/meetings/Leipzig2016

Sponsorship  opportunities

We would be delighted to welcome new sponsors for SEMANTiCS 2016. You will find a number of sponsorship packages with an indication of benefits and prices here: http://semantics.cc/sponsorship-packages.

Special offer: You can buy a special SEMANTiCS industry ticket for €400 which includes a poster presentation at our marketplace. So take the opportunity to increase the visibility of your company, organisation or project among an international and high impact community. If you are interested, please contact us via email to semantics2016@fu-confirm.de.  

Categories: Blogroll

Introducing a Graph-based Semantic Layer in Enterprises

Semantic Web Company - Mon, 2016-08-15 09:34

Things, not Strings
Entity-centric views on enterprise information and all kinds of data sources provide means to get a more meaningful picture about all sorts of business objects. This method of information processing is as relevant to customers, citizens, or patients as it is to knowledge workers like lawyers, doctors, or researchers. People actually do not search for documents, but rather for facts and other chunks of information to bundle them up to provide answers to concrete questions.

Strings, or names for things are not the same as the things they refer to. Still, those two aspects of an entity get mixed up regularly to nurture the Babylonian language confusion. Any search term can refer to different things, therefore also Google has rolled out its own knowledge graph to help organizing information on the web at a large scale.

Semantic graphs can build the backbone of any information architecture, not only on the web. They can enable entity-centric views also on enterprise information and data. Such graphs of things contain information about business objects (such as products, suppliers, employees, locations, research topics, …), their different names, and relations to each other. Information about entities can be found in structured (relational databases), semi-structured (XML), and unstructured (text) data objects. Nevertheless, people are not interested in containers but in entities themselves, so they need to be extracted and organized in a reasonable way.

Machines and algorithms make use of semantic graphs to retrieve not only simply the objects themselves but also the relations that can be found between the business objects, even if they are not explicitly stated. As a result, ‘knowledge lenses’ are delivered that help users to better understand the underlying meaning of business objects when put into a specific context.

Personalization of information
The ability to take a view on entities or business objects in different ways when put into various contexts is key for many knowledge workers. For example, drugs have regulatory aspects, a therapeutical character, and some other meaning to product managers or sales people. One can benefit quickly when only confronted with those aspects of an entity that are really relevant in a given situation. This rather personalized information processing has heavy demand for a semantic layer on top of the data layer, especially when information is stored in various forms and when scattered around different repositories.

Understanding and modelling the meaning of content assets and of interest profiles of users are based on the very same methodology. In both cases, semantic graphs are used, and also the linking of various types of business objects works the same way.

Recommender engines based on semantic graphs can link similar contents or documents that are related to each other in a highly precise manner. The same algorithms help to link users to content assets or products. This approach is the basis for ‘push-services’ that try to ‘understand’ users’ needs in a highly sophisticated way.

‘Not only MetaData’ Architecture
Together with the data and content layer and its corresponding metadata, this approach unfolds into a four-layered information architecture as depicted here.

Following the NoSQL paradigm, which is about ‘Not only SQL’, one could call this content architecture ‘Not only Metadata’, thus ‘NoMeDa’ architecture. It stresses the importance of the semantic layer on top of all kinds of data. Semantics is no longer buried in data silos but rather linked to the metadata of the underlying data assets. Therefore it helps to ‘harmonize’ different metadata schemes and various vocabularies. It makes the semantics of metadata, and of data in general, explicitly available. While metadata most often is stored per data source, and therefore not linked to each other, the semantic layer is no longer embedded in databases. It reflects the common sense of a certain domain and through its graph-like structure it can serve directly to fulfill several complex tasks in information management:

  • Knowledge discovery, search and analytics
  • Information and data linking
  • Recommendation and personalization of information
  • Data visualization

Graph-based Data Modelling
Graph-based semantic models resemble the way how human beings tend to build their own models of the world. Any person, not only subject matter experts, organize information by at least the following six principles:

  1. Draw a distinction between all kinds of things: ‘This thing is not that thing’
  2. Give things names: ‘This thing is my dog Goofy’ (some might call it Dippy Dawg, but it’s still the same thing)
  3. Categorize things: ‘This thing is a dog but not a cat’
  4. Create general facts and relate categories to each other: ‘Dogs don’t like cats’
  5. Create specific facts and relate things to each other: ‘Goofy is a friend of Donald’, ‘Donald is the uncle of Huey, Dewey, and Louie’, etc.
  6. Use various languages for this; e.g. the above mentioned fact in German is ‘Donald ist der Onkel von Tick, Trick und Track’ (remember: the thing called ‘Huey’ is the same thing as the thing called ‘Tick’ – it’s just that the name or label for this thing that is different in different languages).

These fundamental principles for the organization of information are well reflected by semantic knowledge graphs. The same information could be stored as XML, or in a relational database, but it’s more efficient to use graph databases instead for the following reasons:

  • The way people think fits well with information that is modelled and stored when using graphs; little or no translation is necessary.
  • Graphs serve as a universal meta-language to link information from structured and unstructured data.
  • Graphs open up doors to a better aligned data management throughout larger organizations.
  • Graph-based semantic models can also be understood by subject matter experts, who are actually the experts in a certain domain.
  • The search capabilities provided by graphs let you find out unknown linkages or even non-obvious patterns to give you new insights into your data.
  • For semantic graph databases, there is a standardized query language called SPARQL that allows you to explore data.
  • In contrast to traditional ways to query databases where knowledge about the database schema/content is necessary, SPARQL allows you to ask “tell me what is there”.

Standards-based Semantics
Making the semantics of data and metadata explicit is even more powerful when based on standards. A framework for this purpose has evolved over the past 15 years at W3C, the World Wide Web Consortium. Initially designed to be used on the World Wide Web, many enterprises have been adopting this stack of standards for Enterprise Information Management. They now benefit from being able to integrate and link data from internal and external sources with relatively low costs.

At the base of all those standards, the Resource Description Framework (RDF) serves as a ‘lingua franca’ to express all kinds of facts that can involve virtually any kind of category or entity, and also all kinds of relations. RDF can be used to describe the semantics of unstructured text, XML documents, or even relational databases. The Simple Knowledge Organization System (SKOS) is based on RDF. SKOS is widely used to describe taxonomies and other types of controlled vocabularies. SPARQL can be used to traverse and make queries over graphs based on RDF or standard schemes like SKOS.

With SPARQL, far more complex queries can be executed than with most other database query languages. For instance, hierarchies can be traversed and aggregated recursively: a geographical taxonomy can then be used to find all documents containing places in a certain region although the region itself is not mentioned explicitly.

Standards-based semantics also helps to make use of already existing knowledge graphs. Many government organisations have made available high-quality taxonomies and semantic graphs by using semantic web standards. These can be picked up easily to extend them with own data and specific knowledge.

Semantic Knowledge Graphs will grow with your needs!
Standards-based semantics provide yet another advantage: it is becoming increasingly simpler to hire skilled people who have been working with standards like RDF, SKOS or SPARQL before. Even so, experienced knowledge engineers and data scientists are a comparatively rare species. Therefore it’s crucial to grow graphs and modelling skills over time. Starting with SKOS and extending an enterprise knowledge graph over time by introducing more schemes and by mapping to other vocabularies and datasets over time is a well established agile procedure model.

A graph-based semantic layer in enterprises can be expanded step-by-step, just like any other network. Analogous to a street network, start first with the main roads, introduce more and more connecting roads, classify streets, places, and intersections by a more and more distinguished classification system. It all comes down to an evolving semantic graph that will serve more and more as a map of your data, content and knowledge assets.

Semantic Knowledge Graphs and your Content Architecture
It’s a matter of fact that semantics serves as a kind of glue between unstructured and structured information and as a foundation layer for data integration efforts. But even for enterprises dealing mainly with documents and text-based assets, semantic knowledge graphs will do a great job.

Semantic graphs extend the functionality of a traditional search index. They don’t simply annotate documents and store occurrences of terms and phrases, they introduce concept-based indexing in contrast to term based approaches. Remember: semantics helps to identify the things behind the strings. The same applies to concept-based search over content repositories: documents get linked to the semantic layer, and therefore the knowledge graph can be used not only for typical retrieval but to classify, aggregate, filter, and traverse the content of documents.

PoolParty combines Machine Learning with Human Intelligence

Semantic knowledge graphs have the potential to innovate data and information management in any organisation. Besides questions around integrability, it is crucial to develop strategies to create and sustain the semantic layer efficiently.

Looking at the broad spectrum of semantic technologies that can be used for this endeavour, they range from manual to fully automated approaches. The promise to derive high-quality semantic graphs from documents fully automatically has not been fulfilled to date. On the other side, handcrafted semantics is error-prone, incomplete, and too expensive. The best solution often lies in a combination of different approaches. PoolParty combines Machine Learning with Human Intelligence: extensive corpus analysis and corpus learning support taxonomists, knowledge engineers and subject matter experts with the maintenance and quality assurance of semantic knowledge graphs and controlled vocabularies. As a result, enterprise knowledge graphs are more complete, up to date, and consistently used.

“An Enterprise without a Semantic Layer is like a Country without a Map.

Categories: Blogroll

PoolParty Academy is opening in September 2016

Semantic Web Company - Thu, 2016-08-04 03:16

PoolParty Academy offers three E-Learning tracks that enable customers, partners and individual professionals to learn Semantic Web technologies and PoolParty Semantic Suite in particular.

You can pre-register for the PoolParty Academy training tracks at the academy’s website or join our live class-room at the biggest European industrial Semantic Web conference – SEMANTiCS 2016.

read more

Categories: Blogroll

ICML 2016 was awesome

Machine Learning Blog - Tue, 2016-07-26 17:32
I had a fantastic time at ICML 2016— I learned a great deal. There was far more good stuff than I could see, and it was exciting to catch up on recent advances.

David Silver gave one of the best tutorials I’ve seen on his group’s recent work in “deep” reinforcement learning. I learned about a few new techniques, including the benefits of asychrononous  updates in distributed Q-learning https://arxiv.org/abs/1602.01783, which was presented in more detail at the main conference. The new domains being explored were exciting, as were the improvements made on the computational side. I would love to seen more pointers to some of the related work from the tutorial, particularly given there was such an exciting mix of new techniques and old staples (e.g. experience replay http://www.dtic.mil/dtic/tr/fulltext/u2/a261434.pdf ), but the talk was so information packed it would have been difficult.

Pieter Abbeel gave an outstanding talk in the Abstraction in RL workshop http://rlabstraction2016.wix.com/icml#!schedule/bx34m, and (I heard) another excellent one during the deep learning workshop. It was rumored that Aviv Tamar gave an exciting talk (I believe on this http://arxiv.org/abs/1602.02867) , but I was forced to miss it to see Rong Ge’s https://users.cs.duke.edu/~rongge/ outstanding talk on a new-ish geometric tool for understanding non-convex optimization, the strict saddle. I first read about the approach here http://arxiv.org/abs/1503.02101, but at ICML he and other authors have demonstrated a remarkable number of problems that have this property that enables efficient optimization via an stochastic gradient descent (and other) procedures.

This was a theme of ICML— an incredible amount of good material, so much that I barely saw the posters at all because there was nearly always a talk I wanted to see!

Rocky Duan surveyed some benchmark RL continuous control problems http://jmlr.org/proceedings/papers/v48/duan16.pdf  An interesting theme of the conference— and came up in conversation with John Schulman and Yann LeCun– was really old methods working well. In fact, this group demonstrated that variants of the natural/covariant policy gradient proposed originally by Sham Kakade (with a derivation here: http://repository.cmu.edu/cgi/viewcontent.cgi?article=1080&context=robotics) are largely at the state-of-the-art on many benchmark problems. There are some clever tricks necessary for large policy classes like neural networks (like using a partial-least squares-style truncated  conjugate gradient to solve for the change in policy in the usual F \delta = \nabla one solves in the natural gradient procedure) that dramatically improve performance (https://arxiv.org/abs/1502.05477).  I had begun to view these methods as doing little better (or worse) then black-box search, so it’s exciting to see them make a comeback.

Chelsea Finn http://people.eecs.berkeley.edu/~cbfinn/ gave an outstanding talk on this work https://arxiv.org/abs/1603.00448. She and co-authors (Sergey Levine and Pieter) effectively came up with a technique that lets one apply Maximum Entropy Inverse Optimal Control without the double-loop procedure and using policy gradient techniques.  Jonathan Ho described a related algorithm http://jmlr.org/proceedings/papers/v48/ho16.pdf that also appeared to mix policy gradient and an optimization over cost functions. Both are definitely on my reading list, and I want to understand the trade-offs of the techniques.

Both presentations were informative, and both made the interesting connection to Generative Adversarial Nets (GANS) http://arxiv.org/abs/1406.2661 . These were also a theme of the conference in both talks and during discussions. A very cool idea getting more traction, and being embraced by the neural net pioneers.

David Belanger https://people.cs.umass.edu/~belanger/belanger_spen_icml.pdf gave a interesting talk on using backprop to optimize a structured output relative to a a learned cost function. I left thinking the technique was closely related to inverse optimal control methods and the GANs, and wanting understand how implicit differentiation wasn’t being used to optimize the energy function parameters.

Speaking of neural net pioneers— there was lots of good talks during both the main conference and workshops on what’s new — and what’s old https://sites.google.com/site/nnb2tf/— in neural network architectures and algorithms.

I was intrigued by http://jmlr.org/proceedings/papers/v48/balduzzi16.pdf and particularly by the well written blog post it mentions http://colah.github.io/posts/2015-09-NN-Types-FP/ by Christopher Olah. The notion that we need language tools to structure the design of learning programs (e.g. http://www.umiacs.umd.edu/~hal/docs/daume14lts.pdf)  and have tools to reason about them seems to be gaining currency. After reading these, I began to view some of the recent work of Wen, Arun, Byron, and myself (including at http://jmlr.org/proceedings/papers/v48/sun16.pdf  ICML) in this light— generative RNNs “should” have a well defined hidden state whose “type” is effectively (moments of) future observations. I wonder now if there is a larger lesson here in the design of learning programs.

Nando de Freitas and colleagues approach of separating value and advantage function predictions in one network http://jmlr.org/proceedings/papers/v48/wangf16.pdf was quite interesting and had a lot of buzz. Ian Osband gave an amazing talk on another topic that previously made me despair: exploration in RL http://jmlr.org/proceedings/papers/v48/osband16.pdf. This is one of few approaches that combines the ability to function approximation with rigorous exploration guarantees/sample complexity in the tabular case (and amazingly *better* sample complexity then previous papers that work only in the tabular case).  Super cool and also very high on my reading list.

Boaz Barak http://www.boazbarak.org/ gave a truly inspired talk that mixed a kind of coherent computationally-bounded Bayesian-ism (Slogan: ”Compute like a frequentist, think like a Bayesian.”) with demonstrating a lower bound for SoS procedures. Well outside of my expertise, but delivered in a way that made you feel like you understood all of it.

Honglak Lee gave an exciting talk on the benefits of semi-supervision in CNNs http://web.eecs.umich.edu/~honglak/icml2016-CNNdec.pdf. The authors demonstrated that a remarkable amount of information needed to reproduce an input image was preserved quite deep in CNNs, and further that encouraging the ability to reconstruct could significantly enhance discriminative performance on real benchmarks.

The problem with this ICML is that I think it would take literally weeks of reading/watching talks to really absorb the high quality work that was presented. I’m *very* grateful to the organizing committee http://icml.cc/2016/?page_id=39 for making it so valuable.

Categories: Blogroll

The Multiworld Testing Decision Service

Machine Learning Blog - Mon, 2016-07-11 09:14

We made a tool that you can use. It is the first general purpose reinforcement-based learning system

Reinforcement learning is much discussed these days with successes like AlphaGo. Wouldn’t it be great if Reinforcement Learning algorithms could easily be used to solve all reinforcement learning problems? But there is a well-known problem: It’s very easy to create natural RL problems for which all standard RL algorithms (epsilon-greedy Q-learning, SARSA, etc…) fail catastrophically. That’s a serious limitation which both inspires research and which I suspect many people need to learn the hard way.

Removing the credit assignment problem from reinforcement learning yields the Contextual Bandit setting which we know is generically solvable in the same manner as common supervised learning problems. I know of about a half-dozen real-world successful contextual bandit applications typically requiring the cooperation of engineers and deeply knowledgeable data scientists.

Can we make this dramatically easier? We need a system that explores over appropriate choices with logging of features, actions, probabilities of actions, and outcomes. These must then be fed into an appropriate learning algorithm which trains a policy and then deploys the policy at the point of decision. Naturally, this is what we’ve done and now it can be used by anyone. This drops the barrier to use down to: “Do you have permissions? And do you have a reasonable idea of what a good feature is?”

A key foundational idea is Multiworld Testing: the capability to evaluate large numbers of policies mapping features to action in a manner exponentially more efficient than standard A/B testing. This is used pervasively in the Contextual Bandit literature and you can see it in action for the system we’ve made at Microsoft Research. The key design principles are:

  1. Contextual Bandits. Many people have tried to create online learning system that do not take into account the biasing effects of decisions. These fail near-universally. For example they might be very good at predicting what was shown (and hence clicked on) rather that what should be shown to generate the most interest.
  2. Data Lifecycle support. This system supports the entire process of data collection, joining, learning, and deployment. Doing this eliminates many stupid-but-killer bugs that I’ve seen in practice.
  3. Modularity. The system decomposes into pieces: exploration library, client library, online learner, join server, etc… because I’ve seen to many cases where the pieces are useful but the system is not.
  4. Reproducibility. Everything is logged in a fashion which makes online behavior offline reproducible. Consequently, the system is debuggable and hence improvable.

The system we’ve created is open source with system components in mwt-ds and the core learning algorithms in Vowpal Wabbit. If you use everything it enables a fully automatic causally sound learning loop for contextual control of a small number of actions. This is strongly scalable, for example a version of this is in use for personalized news on MSN. It can be either low-latency (with a client side library) or cross platform (with a JSON REST web interface). Advanced exploration algorithms are available to enable better exploration strategies than simple epsilon-greedy baselines. The system autodeploys into a chosen Azure account with a baseline cost of about $0.20/hour. The autodeployment takes a few minutes after which you can test or use the system as desired.

This system is open source and there are many ways for people to help if they are interested. For example, support for the client-side library in more languages, support of other learning algorithms & systems, better documentation, etc… are all obviously useful.

Have fun.

Categories: Blogroll

PoolParty at Connected Data London

Semantic Web Company - Fri, 2016-07-01 09:33

Connected Data London “brings together the Linked and Graph Data communities”.

More and more Linked Data applications seem to emerge in the business world and software companies make it part of their business plan to integrate Graph Data in their data stories or in their features.

MarkLogic is opening a new wave to how enterprise databases should be used to push over the limits of closed, rigid structures to integrate more data. Neo4j explains how you can enrich existing data and follow new connections and leads for investigations of the Panama Papers.

No wonder the communities in different locations gather to share, exchange and network around topics like Linked Data. In London, a new conference is emerging exactly for this purpose: Connected Data London. The conference sets the stage for industry leaders and early adopters as well as researchers to present their use cases and stories. You can hear talks from multiple domains about how they put Linked Data to a good use: space exploration, financial crime, bioinformatics, publishing and more.

The conference will close with an interesting panel discussion about “How to build a Connected Data capability in your organization.” You can hear from the specialists how this task is approached. And immediately after acquiring the know-how you will need a easy-to-use and easy-to-integrate software to help with your Knowledge Model creation and maintenance as well as Text Mining and Concept Annotating.

Semantic Web Company has an experienced team of professional consultants who help you in all your steps of implementing the acquired know-how together with PoolParty.

In our dedicated slot we present how a Connected Data Application is born from a Knowledge Model and which are the steps to get there.

Watch the video!

Connected Data London – London, 12th July, Holiday Inn Mayfair

Categories: Blogroll

The Tools Behind Our Brexit Analyser

It will be two weeks tomorrow since we launched the Brexit Analyser -- our real-time tweet analysis system, based on our GATE text analytics and semantic annotation tools

Back then, we were analysing on average 500,000 (yes, half a million!) tweets a day. Then, on referendum day alone, we had to analyse in real-time well over 2 million tweets. Or on average, just over 23 tweets per second! It wasn't quite so simple though, as tweet volume picked up dramatically as soon as the polls closed at 10pm and we were consistently getting around 50 tweets per second and were also being rate-limited by the Twitter API. 

These are some pretty serious data volumes, as well as veracity. So how did we build the Brexit Analyser to cope?

For analysis, we are using GATE's TwitIE system, which consists of a tokenizer, normalizer, part-of-speech tagger, and a named entity recognizer. After that, we added our Leave/Remain classifier, which helps us identify a reliable sample of tweets with unambiguous stance.  Next is a tweet geolocation component, which uses latitude/longitude, region, and user location metadata to geolocate tweets within the UK NUTS2 regions. We also detect key themes and topics discussed in the tweets (more than one topic/theme can be contained in each tweet), followed by topic-centric sentiment analysis.  



We kept the processing pipeline as simple and efficient as possible, so it can run at 100 tweets per second even on a pretty basic server.  

The analysis results are fed into GATE Mimir, which indexes efficiently tweet text and all our linguistic annotations. Mimir has a powerful programming API for semantic search queries, which we use to drive different web pages with interactive visualisations. The user can choose what they want to see, based on time (e.g. most popular hashtags on 23 Jun; most talked about topics in Leave/Remain tweets on 23 Jun). Clicking on these infographics shows the actual matching tweets. 

All my blog posts so far have been using screenshots of such interactively generated visualisations. 

Mimir also has a more specialised graphical interface (Prospector), which I use for formulating semantic search queries and inspecting the matching data, coupled with some pre-set types of visualisations. The screen shot below shows my Mimir query for all original tweets on 23 Jun which advocate Leave. I can then inspect the most mentioned twitter users within those. (I used Prospector for my analysis of Leave/Remain voting trends on referendum day). 


So how do I do my analyses
First I decide what subset of tweets I want to analyse. This is typically a Mimir query restricting by timestamp (normalized to GMT), tweet kind (original, reply, or retweet), voting intention (Leave/Remain), mentioning a specific user/hashtag/topic, written by a specific user, containing a given hashtag or a given topic (e.g. all tweets discussing taxes).  

Then, once I identify this dynamically generated subset of tweets, I can analyse it with Prospector or use the visualisations which we generate via the Mimir API. These include:

  • Top X most frequently mentioned words, nouns, verbs, or noun phrases
  • Top X most frequent posters/frequently mentioned tweeterers
  • Top X most frequent Locations, Organizatons, or Persons within those tweets
  • Top X themes / sub-themes according to our topic classifier
  • Frequent URLs, language of the tweets, and sentiment
How do we scale it up
It's built using GATE Cloud Paralleliser and some clever queueing, but the take away message is: we can process and index over 100 tweets per second, which allows us to cope in real time with the tweet stream we receive via the Twitter Search API, even at peak times. All of this runs on a server which cost us under £10,000. 
The architecture can be scaled up further, if needed, should we get access to a Twitter feed with higher API rate limits than the standard. 


Thanks to:Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team 

Any mistakes are my own.






Categories: Blogroll

Other Studies of #EURef Tweets

Continuous Studies
The Press Association, in cooperation with Twitter and Blurrt, have created the #EURef Data Hub. It shows live statistics of which of the two campaigns is more talked about, tweet volumes over time, who are the most talked about campaigners (a fixed set, separated into Leave and Remain), and what are the most popular topics in the discussion (four categories: foreign relations, economy, immigration, security).

Post-Referendum Sample Studies
MonkeyLearn posted on June 24th, a post-referendum analysis of 450,000 tweets with the #Brexit hashtag. After filtering for language (English), 250,000 were retained for analysis of the sentiment expressed within, as well as some prominent keywords. They also analysed the difference in keywords between positive and negative tweets.

Referendum Day Studies
The S-Six Social Sentiment Indexes  on referendum day showed Remain tweets dominating Leave ones, with volumes and opinion staying stable.

Pre-Referendum Sample StudiesThe Sensei project analysed tweets in June on the referendum and made predictions on voting outcome, based on a sample of those. They also plotted the hottest topics on June 21st (e.g. immigration, leave, remain, Scotland), a word cloud of frequent words/topics on June 21st, active authors, influential conversations, and adjectives associated with leave and remain.

Expert Systems and the University of Aberdeen analysed a sample of 55,000 tweets pre-referendum in June.  They identified key topics are jobs, immigration, security, NHS, taxes and government issues. Their sample indicated the Leave voters are more active on Twitter, especially those from England and Scotland were very strongly in favour of Leave. In total, 64.75% of tweets from Britain were pro-Leave.








Categories: Blogroll

#InOrOut: Analysing Voting Trends in Tweets on #EURef Day

In this post I examine the question: could we have predicted the #EUReferendum outcome, based on #Leave and #Remain tweets posted on polling day? This follows up from my #InOrOut debate on Twitter on Jun 23rd, where I analysed tweet volumes, popular hashtags, and most mentioned users.

This is not the only study to analyse referendum day tweets, but here I present a more in-depth analysis, also based on a sample of tweets selected specifically as  advocating #Leave/#Remain respectively. 


#Leave / #Remain Trend Based on @Brndstr
Our real-time analysis uncovered the most popular user mentioned in posts on referendum day: @Brndstr. @Brndstr are building bots to help brands engage with their customers and also for users to turn into social ambassadors of brands they endorse. 

On referendum day, they ran a campaign which encouraged people to tweet how they voted and, in return, their profile picture will change accordingly. This was not uncontroversial to some Twitter users, who took issue with the choice of the Union Jack (for Out voters) vs the EU flag (for In voters), but nevertheless, many people declared their votes in this way.  


Show your support with a custom Profile Flag Filter for the #EUref - what will you vote for? #iVoted
Categories: Blogroll

#InOrOut: Today's #EURef Debate on Twitter


So what did the #EUReferendum debate look like today? Is Twitter still voting #Leave as it did back in May? What were the main hashtags and user mentions in today's tweets?


Tweet VolumesRecord breaking 1.9 million tweets were posted today on the #InOrOut #EUReferendum, which is between three and six times the daily volumes observed earlier in June. On average, this is 21 tweets per second over the day, although, the peaks of activity occurred after 9am (see graphs below). 1.5 million of those tweets were posted during poll opening times. In that period, only 3,300 posts were inaccessible to us due to Twitter rate limits. 

Since the polls closed at 10pm tonight, there was a huge surge in Twitter activity with over 60,000 posts between 10pm and 11pm alone.  Twitter rate limits meant that we could not access another 6,000 posts from that period. Since this is only 10% of the overall data in this hour, we still have a representative sample for our analyses. 

Amongst the 1.9 million posts, over 1 million (57%) were retweets and 94 thousand (5%) - replies. These proportions of retweets and replies are consistent with patterns observed earlier in June.   
Tweets, Re-tweets, and Replies: #Leave or #Remain
Let's start by looking at original tweets, i.e. tweets which have been posted by their authors and are not a reply to another tweet or a retweet. I refer to the authors of those tweets as the OPs (Original Posters), following terminology adopted from online forums.

My analysis of voting intentions showed some conflicting findings, depending on the way used to sample tweets  (details and trend graphs here). 

The gist is that, using @brndstr and “I voted XX” patterns both gave Remain a majority over Leave, but using our voting intention classification heuristic, the opposite was true (i.e. Leave was the more likely winner).  

In retweets, the #Leave proponents were more vocal in comparison to the #Remain.   

The difference is particularly pronounced for replies,  where #Leave proponents are engaging in more debates than #Remain ones. Nevertheless, with replies constituting only 5% of all tweets today, the echo chamber effect observed earlier in June still remains unchanged. 
#InOrOut, #Leave, #Remain and Other Popular HashtagsInterestingly, 75% of all tweets today (1.4 million) contained at least one hashtag. This is a very significant increase on the 56.5% observed several days ago. 


Some of the most popular hashtags  remain unchanged from earlier in June. These refer to the leave and remain campaigns, immigration, NHS, parties, media, and politicians. Interestingly, there is now increased interest in #forex and #stocks, as predictors of the likely outcome. 


Most Mentioned Users Today: What is @Brndstr
Last for tonight, I compared the most frequently mentioned Twitter users in original tweets from today (see above) against those most mentioned earlier in June. The majority of popular mentioned users remains unchanged, with a mix of campaign Twitter accounts, media, and key political leaders.

The most prominent difference is that @Brndstr (Bots for Brands) came top (mentioned in over 14 thousand tweets), followed by @YouTube with 3 thousand mentions. Other new, frequently mentioned accounts today were Avaaz, DanHannanMEP,BuzzFeedUK, and realDonaldTrump.


So What Does This Tell Us?
The #InOrOut #EUReferendum has attracted unprecedented tweet volumes on poll day, with a significantly higher proportion of hashtags than previously. This seems to suggest that Twitter users are trying to get their voices heard and spread the word far and wide, well beyond the bounds of their normal follower  network. 

There are some exciting new entrants in the top 30 most mentioned Twitter accounts in today's referendum posts. I will analyse these in more depth tomorrow. For now, good night!  

Thanks to:Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team 

Any mistakes are my own.
Categories: Blogroll

Identifying A Reliable Sample of Leave/Remain Tweets

This post is the second in the series on the Brexit Tweet Analyser.

Having looked at tweet volumes and basic characteristics of the Twitter discourse around the EU referendum, we now turn to the method we chose for identify a reliable, even if incomplete, sample of leave and remain tweets.

No ground truth; not trying to predict if leave or remain are leading, but instead interested in identifying a reliable, if incomplete subset, so we can analyse topics discussed and active users within.



Are Hashtags A Reliable Predictor of Leave/Remain Support?As discussed in our earlier post, over 56% of all tweets on the referendum contain at least one hashtag. Some of these are actually indicative of support for the leave/remain campaigns, e.g. #votetoleave, #voteout, #saferin, #strongertogether. Then there are also hashtags which try to address undecided voters, e.g. #InOrOut, #undecided, while promoting either a remain or leave vote but not through explicit hashtags.

A recent study of EU referendum tweets by Ontotext, carried out over tweets in May 2016,  classified tweets as leave or remain on the basis of approximately 30 hashtags. Some of those were associated with leave, the rest -- with remain, and each  tweet was classified as leave or remain based on whether it contains predominantly leave or predominantly remain hashtags. 

Based on analysing manually a sample of random tweets with those hashtags, we found that this strategy does not always deliver a reliable assessment, since in many cases leave hashtags are used as a reference to the leave campaign, while the tweet itself is supportive of remain or neutral. The converse is also true, i.e. remain hashtags are used to refer to the remain stance/campaign. We have included some examples below. 

A more reliable, even if somewhat more restrictive, approach is to consider the last hashtag in the tweet as the most indicative of its intended stance (pro-leave or pro-remain). This results in a higher precision sample of remain/leave tweets, which we can then analyse in more depth in terms of topics discussed and opinions expressed. 

Using this approach, amongst the 1.9 million tweets between June 13th and 19th, 5.5% (106 thousand) were identified as supporting the Leave campaign, while 4% (80 thousand) - as supporting the Remain campaign. Taken together, this constitutes just under a 10% sample, which we consider sufficient for the purposes of our analysis. 

These results, albeit drawn from a smaller, high-precision sample, seem to indicate that the Leave campaign is receiving more coverage and support on Twitter, when compared to Remain. This is consistent also with the findings of the Ontotext study .

In subsequent posts we will look into the most frequently mentioned hashtags, the most active Twitter users, and the topics discussed in the Remain and Leave samples separately. 

What about #Brexit in particular?   The recent Ontotext study on May 2016 data used #Brexit as one of the key hashtags indicative of leave. Others have also used #Brexit in the same fashion.


In our more recent 6.5 million tweets (dated between 1 June and 19 June 2016), just under 1.7 million contain the #Brexit hashtag (26%). However, having examined a random sample of those manually (see examples below), we established that while many tweets did use #Brexit to indicate support for leave, there were also many cases where #Brexit referred to the referendum, or the leave/remain question, or the Brexit campaign as a whole. We have provided some such examples at the end of this blog post. We also found a sufficient number of examples where #Brexit appears at the end of tweets while still not indicating support for voting leave. 

Therefore, we chose to distinguish the #Brexit hashtag from all other leave hashtags and tagged tweets with a final #Brexit tag separately. This enables us, in subsequent analyses, to compare findings with and without considering #Brexit.  



Example Remain/Leave Hashtag Use


It doesnt matter who some of the dodgy leaders of #Remain and #Brexit are, they each only have ONE VOTE, like all of us public #EURef— Marcus Storm (@MarcsandSparks) 20 June 2016

Perfect question! "Why is #brexit ahead, despite all the experts supporting #remain?" #questiontime— Steve Parrott (@steveparrott50) 19 June 2016

Could the last decent politician (of any party) to leave the #Leave camp please turn off the lights.....#Bremain pic.twitter.com/zQjjoIXcyO— Dr Hamed Khan (@drhamedkhan) 19 June 2016


Today's @thesundaytimes #focus articles on #brexit say it all. #remain is forward-looking, #leave backward— Patrick White (@pbpwhite) 20 June 2016

Example Brexit Tweets
#Brexit probability declines as campaigns remain quiet https://t.co/qrAhURvRDk via @RJ_FXandRates pic.twitter.com/UnNV1NDnZv— Bloomberg London (@LondonBC) 17 June 2016
#VoteRemain #VoteLeave #InOrOut #EURef #StrongerIn -- Is #Brexit The End Of The World As We Know It? via @forbes https://t.co/lQ6Xgf0oEW— Jolly Roger (@EUGrassroots) 17 June 2016
Remaining #Brexit Polls scheduled releases pic.twitter.com/DKzBqjoGcs— Nicola Duke (@NicTrades) 17 June 2016
Blame austerity—not immigration—for bringing Britain to ‘breaking point’https://t.co/f3oKODbLSe#Brexit #EUref pic.twitter.com/lLJHOsUO7J— The Conversation (@ConversationUK) June 20, 2016
BREAK World's biggest carmaker #Ford tells staff of "deep concerns abt "uncertainty/potential downsides" of #Brexit pic.twitter.com/bYQ3LyIA6i— Beth Rigby (@BethRigby) June 20, 2016

Thanks to:Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team 

Any mistakes are my own.
Categories: Blogroll

An ICML unworkshop

Machine Learning Blog - Mon, 2016-06-20 22:29

Following up on an interesting suggestion, we are creating a “Birds of a Feather Unworkshop” with a leftover room (Duffy/Columbia) on Thursday and Friday during the workshops. People interested in ad-hoc topics can post a time and place to meet and discuss. Details are here a little ways down.

Categories: Blogroll

Introducing the Brexit Analyser: real-time Twitter analysis with GATE

The GATE team has been busy lately with building the real-time Brexit Analyser.  It analyses tweets related to the forthcoming EU referendum, as they come in, in order to track the referendum debate unfolding on Twitter. This research is being carried out as part of the SoBigData project



The work follows on from our successful collaboration with Nesta on the Political Futures Tracker, which analysed tweets in real-time in the run up to the UK General Election in 2015. 

Unlike others, we do not try to predict the outcome of the referendum or answer the question of whether Twitter can be used as a substitute for opinion polls. Instead, our focus is on a more in-depth analysis of the referendum debate; the people and organisations who engage in those debates; what topics are discussed and opinion expressed, and who the top influencers are.

What does it do?It analyses and indexes tweets as they come in (i.e. in real time), in order to identify commonly discussed topics, opinions expressed, and whether a tweet is expressing support for remaining or leaving the EU. It must be noted that not all tweets have a clear stance and also that not all tweets express a clear voting intention (e.g. "Brexit & Bremain"). More on this in subsequent posts! 

In more detail, the Brexit Analyser uses text analytics and opinion mining techniques from GATE, in order to identify tweets expressing voting intentions, the topics discussed within, and the sentiment expressed towards these topics. Watch this space! 
The Data  (So Far)We are collecting tweets based on a number of referendum related hashtags and keywords, such as #voteremain, #voteleave, #brexit, #eureferendum. 

The volume of original tweets, replies, and re-tweets per day collected so far is shown below. On average, this is close to half a million tweets per day (480 thousand), which is 1.6 times the tweets on 26 March 2015 (300,000), when the Battle For Number 10 interviews took place, in the run up to the May 2015 General Elections. 



In total, we have analysed just over 1.9 million tweets in the past 4 days, with 60% of those being re-tweets. On average, a tweet is re-tweeted 1.65 times. 

Subsequent posts will examine the distribution of original tweets, re-tweets, and replies specifically in tweets expressing a remain/leave voting intention.  

Hashtags: 1 million of those 1.9 million tweets contain at least one hashtag  (i.e. 56.5% of all tweets have hashtags). If only original tweets are considered (i.e. all replies and retweets are excluded), then there are 319 thousand tweets with hashtags amongst the original 678 thousand tweets (i.e. 47% of original tweets are hashtag bearing).

Analysing hashtags used in a Twitter debate is interesting, because they indicate commonly discussed topics, stance taken towards the referendum, and also key influencers. As they are easy to search for, hashtags help Twitter users participate in online debates, including other users they are not directly connected to.

Below we show some common hashtags on June 16, 2016. As can be seen, most are associated directly with the referendum and voting intentions, while others refer to politicians, parties, media, places, and events:




URLs:  Interestingly, amongst the 1.9 million tweets only 134 thousand contain a URL (i.e. only 7%).  Amongst the 1.1 million re-tweets, 11% contain a URL, which indicates that tweets with URLs tend to be retweet more.  

These low percentages suggest that the majority of tweets on the EU referendum are expressing opinions or addressing another user, rather than sharing information or providing external evidence. 

@Mentions: Indeed, 90 thousand (13%) of the original 678 thousand tweets contain an username mention. The 50 most mentioned users in those tweets are shown below. The size of the user name indicates frequency, i.e. the larger the text the more frequently has this username been mentioned in tweets. 

In subsequent posts we will provide information on the most frequently re-tweeted users and the most prolific Twitter users in the dataset. 



So What Does This Tell Us?
Without a doubt, there is a heavy volume of tweets on the EU referendum, published daily. However, with only 6.8% of all tweets being replies and over 58% -- re-tweets, this resembles more an echo chamber, rather than a debate.  

Pointers to external evidence/sources via URLs are scarce, as are user mentions. The most frequently mentioned users are predominantly media (e.g., BBC, Reuters, FT, the Sun, Huffington Post);  politicians playing a prominent role in the campaign (e.g. David Cameron,  Boris Johnson, Nigel Farage, Jeremy Corbyn); and campaign accounts created especially for the referendum (e.g. @StrongerIn, @Vote_Leave).    


Thanks to:Dominic Rout, Ian Roberts, Mark Greenwood, Diana Maynard, and the rest of the GATE Team 


Categories: Blogroll

The ICML 2016 Space Fight

Machine Learning Blog - Sat, 2016-06-04 17:29

The space problem started long ago.

At ICML last year and the year before the amount of capacity that needed to fit everyone on any single day was about 1500. My advice was to expect 2000 and have capacity for 2500 because “New York” and “Machine Learning”. Was history right? Or New York and buzz?

I was not involved in the venue negotiations, but my understanding is that they were difficult, with liabilities over $1M for IMLS the nonprofit which oversees ICML year to year. The result was a conference plan with a maximum capacity of 1800 for the main conference, a bit less for workshops, and perhaps 1000 for tutorials.

Then the NIPS registration numbers came in: 3900 last winter. It’s important to understand here that a registration is not a person since not everyone registers for the entire event. Nevertheless, NIPS was very large with perhaps 3K people attending at any one time. Historically, NIPS is the conference most similar to ICML with a history of NIPS being a bit larger. Most people I know treat these conferences as indistinguishable other than timing: ICML in the summer and NIPS in the winter.

Given this, I had to revise my estimate up: We should really have capacity for 3000, not 2500. It also convinced everyone that we needed to negotiate for more space with the Marriott. This again took quite awhile with the result being a modest increase in capacity for the conference (to 2100) and the workshops, but nothing for the tutorials.

The situation with tutorials looked terrible while the situation with workshops looked poor. Acquiring more space at the Marriott looked near impossible. Tutorials require a large room, so we looked into the Kimmel Center at NYU acquiring a large room and increasing capacity to 1450 for the tutorials. We also looked into additional rooms for workshops finding one at Columbia and another at the Microsoft Technology Center which has a large public use room 2 blocks from the Marriott. Other leads did not pan out.

This allowed us to cover capacity through early registration (May 7th). Based on typical early vs. late registration distributions I was expecting registrations might need to close a bit early similar to what happened with KDD in 2014.

Then things blew up. Tutorial registration reached capacity the week of May 23rd, and then all registration stopped May 28th, 3 weeks before the conference. Aside from simply failing to meet demand this also creates lots of problems. What do you do with authors? And when I looked into things in detail for workshops I realized we were badly oversubscribed for some workshops. It’s always difficult to guess which distribution of room sizes is needed to support the spectrum of workshop interests in advance so there were serious problems. What could we do?

The first step was tutorial and main conference registration which reopened last Tuesday using some format changes which allowed us to increase capacity further. We will use simulcast to extra rooms to support larger audiences for tutorials and plenary talks allowing us to up the limit for tutorials to 1590 and for the main conference to 2400. We’ve also shifted the poster session to run in parallel with main tracks rather than in the evening. Now, every paper will have 3-4 designated hours during the day (ending at 7pm) for authors to talk to people individually. As a side benefit, this will also avoid the competition between posters and company-sponsored parties which have become common. We’ll see how this works as a format, but it was unavoidable here: even without increasing registration the existing evening poster session plan was a space disaster.

The workshop situation was much more difficult. I walked all over the nearby area on Wednesday, finding various spaces and getting quotes. I also realized that the largest room at the Crown Plaza could help with our tutorials: it was both bigger and much closer than NYU. On Thursday, we got contract offers from the promising venues and debated into the evening. On Friday morning at 6am the Marriott suddenly gave us a bunch of additional space for the workshops. Looking through things, it was enough to shift us from ‘oversubscribed’ to ‘crowded’ with little capacity to register more given natural interests. We developed a new plan on the fly, changed contracts, negotiated prices down, and signed Friday afternoon.

The local chairs (Marek Petrik and Peder Olsen) and Mary Ellen were working hard with me through this process. Disruptive venue changes 3 weeks before the conference are obviously not the recommended way of doing things:-) And yet it seems to be working out now, much better than I expected last weekend. Here’s the situation:

  1. Tutorials ~1600 registered with capacity for 1850. I expect this to run out of capacity, but it will take a little while. I don’t see a good way to increase capacity further.
  2. The main conference has ~2200 registered with capacity for 2400. Maybe this can be increased a little bit, but it is quite possible the main conference will run out of capacity as well. If it does, only authors will be allowed to register.
  3. Workshops ~1900 registered with capacity for 3000. Only the Deep Learning workshop requires a simulcast. It seems very unlikely that we’ll run out of capacity so this should be the least crowded part of the conference. We even have some left-over little rooms (capacity for 125 or less) that are looking for a creative use if you have one.

In this particular case, “New York” was both part of the problem and much of the solution. Where else can you walk around and find large rooms on short notice within 3 short blocks? That won’t generally be true in the future, so we need to think carefully about how to estimate attendance.

Categories: Blogroll

Web 2: But Wait, There's More (And More....) - Best Program Ever. Period.

Searchblog - Thu, 2011-10-13 13:20
I appreciate all you Searchblog readers out there who are getting tired of my relentless Web 2 Summit postings. And I know I said my post about Reid Hoffman was the last of its kind. And it was, sort of. Truth is, there are a number of other interviews happening... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Reid Hoffman, Founder, LinkedIn (And Win Free Tix to Web 2)

Searchblog - Wed, 2011-10-12 12:22
Our final interview at Web 2 is Reid Hoffman, co-founder of LinkedIn and legendary Valley investor. Hoffman is now at Greylock Partners, but his investment roots go way back. A founding board member of PayPal, Hoffman has invested in Facebook, Flickr, Ning, Zynga, and many more. As he wears (at... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview the Founders of Quora (And Win Free Tix to Web 2)

Searchblog - Tue, 2011-10-11 13:54
Next up on the list of interesting folks I'm speaking with at Web 2 are Charlie Cheever and Adam D'Angelo, the founders of Quora. Cheever and D'Angelo enjoy (or suffer from) Facebook alumni pixie dust - they left the social giant to create Quora in 2009. It grew quickly after... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Ross Levinsohn, EVP, Yahoo (And Win Free Tix to Web 2)

Searchblog - Tue, 2011-10-11 12:46
Perhaps no man is braver than Ross Levinsohn, at least at Web 2. First of all, he's the top North American executive at a long-besieged and currently leaderless company, and second because he has not backed out of our conversation on Day One (this coming Monday). I spoke to Ross... (Go to Searchblog Main)
Categories: Blogroll

I Just Made a City...

Searchblog - Mon, 2011-10-10 14:41
...on the Web 2 Summit "Data Frame" map. It's kind of fun to think about your company (or any company) as a compendium of various data assets. We've added a "build your own city" feature to the map, and while there are a couple bugs to fix (I'd like... (Go to Searchblog Main)
Categories: Blogroll
Syndicate content