Skip navigation.
Home
Semantic Software Lab
Concordia University
Montréal, Canada

Blogroll

<h2 style="background-color: white;

The Tools Behind Our Twitter Abuse Analysis with BuzzFeed
Or...How to Quantify Abuse in Tweets in 5 Working Days
When BuzzFeed approached us with the idea to quantify Twitter abuse towards politicians during the election campaign, we only had five working days, before the article had to be completed and go public.   

The goal was to use text analytics and analyse tweets replying to UK politicians, in the run up to the 2017 general election, in order to answer questions such as:

  • How wide spread is abuse received by politicians?
  • Who are the main politicians targeted by such abusive tweets?
  • Are there any party or gender differences?
  • Do abuse levels stay constant over time or not?  
So here I explain first how we collect the data for such studies and then how it gets analysed at scale and fast, all with our GATE-based open-source tools and their GATE Cloud text analytics-as-a-service deployment.

For researchers wishing more in-depth details, please read and cite our paper:

D. Maynard, I. Roberts, M. A. Greenwood, D. Rout and K. Bontcheva. A Framework for Real-time Semantic Social Media Analysis. Web Semantics: Science, Services and Agents on the World Wide Web, 2017 (in press). https://doi.org/10.1016/j.websem.2017.05.002, pre-print
Tweet Collection 
We already had all necessary tweets at hand, since, within an hour of #GE2017 being announced, I set up, using the GATE Cloud tweet collection service:
https://cloud.gate.ac.uk/shopfront/displayItem/twitter-collector 
the continuous collection of tweets by MPs, prominent politicians, parties, and candidates, as well as retweets and replies thereof. 

I also made a second twitter collector service running in parallel, to collect election related tweets based purely on hashtags and keywords (e.g. #GE2017, vote, election).
How We Analysed and Quantified Abuse 
Given the short 5 day deadline, we were pleased to have at hand the large-scale, real-time text analytics tools in GATE, Mimir/Prospector, and GATE Cloud. 

The starting point was the real-time text analysis pipeline from the Brexit research last year. That is capable of analysing up to 100 tweets per second (tps), although, in practice, the tweets usually were coming at the much lower 23 tps.  

This time, however, we adapted it with a new abuse analysis component, as well as some more up-to-date knowledge about the politicians (including the new prime minister). 




The analysis backbone was again GATE's TwitIE system, which consists of a tokenizer, normalizer, part-of-speech tagger, and a named entity recognizer. TwitIE is also available as-a-service on GATE Cloud, for easy integration and use.

Next, we added information about politicians, e.g. their names, gender, party, constituencies, etc. In this way, we could produce aggregate statistics, such as abuse-containing tweets aimed at Labour or Conservative male/female politicians. 

Next is a tweet geolocation component, which uses latitude/longitude, region, and user location metadata to geolocate tweets within the UK NUTS2 regions. This is not always possible, since many accounts and tweets lack such information, and this narrow down the sample significantly, should we choose to restrict by geo-location.

We also detect key themes and topics discussed in the tweets (more than one topic/theme can be contained in each tweet). Here we reused the module from the Brexit analyser.

The most exciting part was working with BuzzFeed's journalists to curate a set of abuse nouns typically aimed at people (e.g. twat), racist words, and milder insults (e.g. coward).  We decided to differentiate these from general obscene language and swearing, as these were not always targeting the politician. Nevertheless, they were included in the system, to produce a separate set of statistics. We introduced also basic sub-classification by kind (e.g. racial) and strength (e.g. mild, medium, strong), derived from an Ofcom research report on offensive language


Overall, we kept the processing pipeline as simple and efficient as possible, so it can run at 100 tweets per second even on a pretty basic server.  

The analysis results were fed into GATE Mimir, which indexes efficiently tweet text and all our linguistic annotations. Mimir has a powerful programming API for semantic search queries, which we use to drive the various interactive visualisations and to generate the necessary aggregate statistics behind them. 

For instance, we used Mimir queries to generate statistics and visualisations, based on time (e.g. most popular hashtags in abuse-containing tweets on 4 Jun); topic (e.g. the most talked about topics in such tweets), or target of the abusive tweet (e.g. the most frequently targeted politicians by party and gender). We could also navigate to the corresponding tweets behind these aggregate statistics, for a more in-depth analysis.

A rich sample of these statistics, associated visualisations, and abusive tweets is available in the BuzzFeed article.

Research carried out by:
Mark A. Greenwood, Ian Roberts, Dominic Rout, and myself, with ideas and other contributions from Diana Maynard and others from the GATE Team

Any mistakes are my own.
Categories: Blogroll

ICML is changing its constitution

Machine Learning Blog - Wed, 2017-07-19 15:01

Andrew McCallum has been leading an initiative to update the bylaws of IMLS, the organization which runs ICML. I expect most people aren’t interested in such details. However, the bylaws change rarely and can have an impact over a long period of time so they do have some real importance. I’d like to hear comment from anyone with a particular interest before this year’s ICML.

In my opinion, the most important aspect of the bylaws is the at-large election of members of the board which is preserved. Most of the changes between the old and new versions are aimed at better defining roles, committees, etc… to leave IMLS/ICML better organized.

Anyways, please comment if you have a concern or thoughts.

Categories: Blogroll

GATE, Java 9, and HDPI Monitors

Over the last couple of months a few people have mentioned that running GATE Developer on HDPI monitors is a bit of a pain. The problem is that Java (up to and including the latest version of Java 8) doesn't have any support for HDPI monitors. The only solution I'd heard people suggest was to reduce the resolution of the monitor before launching GATE, but as you can imagine this is far from an ideal solution.

Having recently upgraded my laptop I also ran into the same problem, and as this screenshot highlights, by default GATE Developer isn't at all usable on a HDPI screen.


A quick hunt around the web and you'll find all sorts of suggestions for getting Java 8 to work nicely with HDPI screens, but try as I might I couldn't get any of them to work for me; I'm running OpenJDK 8 under Ubuntu 16.04. Fortunately HDPI support is going to be built into Java 9. Unfortunately Java 9 still hasn't been officially released so you need to rely on an early access version.

In theory it should have been easy for me to see if Java 9 was a solution, but unfortunately the version of Java 9 in the Ubuntu 16.04 repositorie causes a segfault as soon as you try to run any Java GUI program making life more difficult than it needs to be.

The solution is to install the Oracle early access build of Java 9. You can either download the JDK manually, or follow these instructions under Ubuntu to install from the very useful Web Upd8 repository. Either way once installed, launching GATE gives a usable UI.


Unfortunately this isn't quite enough to solve the problem. Under the hood Java 9 introduces a modular component system (often referred to as Project Jigsaw) which includes new rules on encapsulation. The issue is that one of the libraries GATE uses for reading and writing applications, XStream, uses a number of tricks to access internal data that are prohibited under the new rules. The result is that you can't load or save applications which makes the GUI kind of pointless. Fortunately there is a command line option you can pass to the JVM that allows you to bypass the encapsulation rules. So to get GATE to work properly with Java 9 you need to add
--permit-illegal-accessto the command line. When launching the GUI this is easy to do by adding the flag as a new line in the gate.l4j.ini file which you will find in the GATE home folder.

There are two important things to note. Firstly this fix is only temporary as the command line flag will be removed in a later version of Java, and secondly depending how you are deploying GATE it can be difficult to alter the command line arguments (for example if deploying as a web app). Once Java 9 is officially released we'll look again at this problem to find a more permanent solution. Until then this gives you a way of using GATE on a HDPI monitor, but where possible (i.e. only on a HDPI monitor when you need the UI) we'd still advise using Java 8 and this hack as a last resort.
Categories: Blogroll

Machine Learning the Future Class

Machine Learning Blog - Mon, 2017-06-12 11:37

This spring, I taught a class on Machine Learning the Future at Cornell Tech covering a number of advanced topics in machine learning including online learning, joint (structured) prediction, active learning, contextual bandit learning, logarithmic time prediction, and parallel learning. Each of these classes was recorded from the laptop via Zoom and I just uploaded the recordings to Youtube.

In some ways, this class is a followup to the large scale learning class I taught with Yann LeCun 4 years ago. The videos for that class were taken down(*) so these lectures both update and replace shared subjects as well as having some new subjects.

Much of this material is fairly close to research so to assist other machine learning lecturers around the world in digesting the material, I’ve made all the source available as well. Feel free to use and improve.

(*) The NYU policy changed so that students could not be shown in classroom videos.

Categories: Blogroll

Fact over Fiction

Machine Learning Blog - Sat, 2017-04-22 15:23

Politics is a distracting affair which I generally believe it’s best to stay out of if you want to be able to concentrate on research. Nevertheless, the US presidential election looks like something that directly politicizes the idea and process of research by damaging the association of scientists & students, funding for basic research, and creating political censorship.

A core question here is: What to do? Today’s March for Science is a good step, but I’m not sure it will change many minds. Unlike most scientists, I grew up in a a county (Linn) which voted overwhelmingly for Trump. As a consequence, I feel like I must translate the mindset a bit. For the median household left behind over my lifetime a march by relatively affluent people protesting the government cutting expenses will not elicit much sympathy. Discussion about the overwhelming value of science may also fall on deaf ears simply because they have not seen the economic value personally. On the contrary, they have seen their economic situation flat or worsening for 4 decades with little prospect for things getting better. Similarly, I don’t expect history lessons on anti-intellectualism to make much of a dent. Fundamentally, scientists and science fans are a small fraction of the population.

What’s needed is a campaign that achieves broad agreement across the population and which will help. One of the roots of the March for Science is a belief in facts over fiction which may have the requisite scope. In particular, there seems to be a good case that the right to engage in mass disinformation has been enormously costly to the United States and is now a significant threat to civil stability. Internally, disinformation is a preferred tool for starting wars or for wealthy companies to push a deadly business model. Externally, disinformation is now being actively used to sway elections and is self-funding.

The election outcome is actually less important than the endemic disagreement that disinformation creates. When people simply believe in different facts about the world how can you expect them to agree? There probably are some good uses of mass disinformation somewhere, but I’m extremely skeptical the value exceeds the cost.

Is opposition to mass disinformation broad enough that it makes a good organizing principle? If mass disinformation was eliminated or greatly reduced it would be an enormous value to society, particularly to the disinformed. It would not address the fundamental economic stagnation of the median household in the United States, but it would remove a significant threat to civil society which may be necessary for such progress. Given a choice between the right to mass disinform and democracy, I choose democracy.

A real question is “how”? We are discussing an abridgment of freedom of speech so from a legal perspective the basis must rest on the balance between freedom of speech and other constitutional rights. Many abridgements exist like censuring a yell of “fire” in a crowded theater unnecessarily.

Voluntary efforts (as Facebook and Twitter have undertaken) are a start, but it seems unlikely to go far enough as many other “news” organizations have made no such commitments. A system where companies commit to informing over disinforming and in return become both more trusted and simultaneously liable for disinformation damages (due to the disinformed) as assessed by civil law may make sense. Right now organizations are mostly free to engage in disinformation as long as it is not directed at an individual where libel laws apply. Penalizing an organization for individual mistakes seems absurd, but a pattern of errors backed by scientific surveys verifying an anomalously misinformed status of viewers/readers/listeners is cause for action. Getting this right is obviously a tricky thing—we want a solution that a real news organization with an existing mimetic immune system prefers to the status quo because it curbs competitors that disinform. At the same time, there must be enough teeth to make disinformation uneconomical or the problem only grows.

Should disinformation have criminal penalties? One existing approach here uses RICO laws to counter disinformation from Tobacco companies. Reading the history, this took an amazing amount of time—enough that it was ineffective for a generation. It seems plausible that an act directly addressing disinformation may be helpful.

What about technical solutions? Technical solutions seem necessary for success, perhaps with changes to law incentivizing this. It’s important to understand that something going before the courts is inherently slow, particularly because courts tend to be deeply overloaded. A significant increase in the number of cases going before courts makes an approach nonviable in practice.

Would we regret this? There is a long history of governments abusing laws to censor inconvenient news sources so caution is warranted. Structuring new laws in a manner such that they cannot be abused is an important consideration. It is obviously important to leave satire fully intact which seems entirely possibly by making the fact that it is satire unmistakable. This entire discussion is also not relevant to individuals speaking to other individuals—that is not what creates a problem.

Is this possible? It might seem obvious that mass disinformation should be curbed but there should be no doubt that powerful forces will work to preserve mass disinformation by subtle and unethical means.

Overall, I fundamentally believe that people in a position to inform or disinform have a responsibility to inform. If they don’t want that responsibility, then they should abdicate the position to someone who does, similar in effect to the proposed fiduciary rule for investments. I’m open to any approach towards achieving this.

Edit: also at CACM.

Categories: Blogroll

The Decision Service is Hiring

Machine Learning Blog - Wed, 2017-04-12 09:31

The Decision Service is a first-in-the-world project making tractable reinforcement learning easily used by developers everywhere. We are hiring for devel opers, data scientist, and a product manager. Please consider joining us to do something interesting this life

Categories: Blogroll

Web 2: But Wait, There's More (And More....) - Best Program Ever. Period.

Searchblog - Thu, 2011-10-13 13:20
I appreciate all you Searchblog readers out there who are getting tired of my relentless Web 2 Summit postings. And I know I said my post about Reid Hoffman was the last of its kind. And it was, sort of. Truth is, there are a number of other interviews happening... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Reid Hoffman, Founder, LinkedIn (And Win Free Tix to Web 2)

Searchblog - Wed, 2011-10-12 12:22
Our final interview at Web 2 is Reid Hoffman, co-founder of LinkedIn and legendary Valley investor. Hoffman is now at Greylock Partners, but his investment roots go way back. A founding board member of PayPal, Hoffman has invested in Facebook, Flickr, Ning, Zynga, and many more. As he wears (at... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview the Founders of Quora (And Win Free Tix to Web 2)

Searchblog - Tue, 2011-10-11 13:54
Next up on the list of interesting folks I'm speaking with at Web 2 are Charlie Cheever and Adam D'Angelo, the founders of Quora. Cheever and D'Angelo enjoy (or suffer from) Facebook alumni pixie dust - they left the social giant to create Quora in 2009. It grew quickly after... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Ross Levinsohn, EVP, Yahoo (And Win Free Tix to Web 2)

Searchblog - Tue, 2011-10-11 12:46
Perhaps no man is braver than Ross Levinsohn, at least at Web 2. First of all, he's the top North American executive at a long-besieged and currently leaderless company, and second because he has not backed out of our conversation on Day One (this coming Monday). I spoke to Ross... (Go to Searchblog Main)
Categories: Blogroll

I Just Made a City...

Searchblog - Mon, 2011-10-10 14:41
...on the Web 2 Summit "Data Frame" map. It's kind of fun to think about your company (or any company) as a compendium of various data assets. We've added a "build your own city" feature to the map, and while there are a couple bugs to fix (I'd like... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Vic Gundotra, SVP, Google (And Win Free Tix to Web 2)

Searchblog - Mon, 2011-10-10 14:03
Next up on Day 3 of Web 2 is Vic Gundotra, the man responsible for what Google CEO Larry Page calls the most exciting and important project at this company: Google+. It's been a long, long time since I've heard as varied a set of responses to any Google project... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview James Gleick, Author, The Information (And Win Free Tix to Web 2)

Searchblog - Sat, 2011-10-08 21:16
Day Three kicks off with James Gleick, the man who has written the book of the year, at least if you are a fan of our conference theme. As I wrote in my review of "The Information," Gleick's book tells the story of how, over the past five thousand or... (Go to Searchblog Main)
Categories: Blogroll

I Wish "Tapestry" Existed

Searchblog - Fri, 2011-10-07 15:34
(image) Early this year I wrote File Under: Metaservices, The Rise Of, in which I described a problem that has burdened the web forever, but to my mind is getting worse and worse. The crux: "...heavy users of the web depend on scores - sometimes hundreds - of services,... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Steve Ballmer, CEO of Microsoft (And Win Free Tix to Web 2)

Searchblog - Fri, 2011-10-07 13:17
Day Two at Web 2 Summit ends with my interview of Steve Ballmer. Now, the last one, some four years ago, had quite a funny moment. I asked Steve about how he intends to compete with Google on search. It's worth watching. He kind of turns purple. And not... (Go to Searchblog Main)
Categories: Blogroll

Me, On The Book And More

Searchblog - Thu, 2011-10-06 13:05
Thanks to Brian Solis for taking the time to sit down with me and talk both specifically about my upcoming book, as well as many general topics.... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Michael Roth, CEO of Interpublic Group (And Win Free Tix to Web 2)

Searchblog - Thu, 2011-10-06 12:37
What's the CEO of a major advertising holding company doing at Web 2 Summit? Well, come on down and find out. Marketing dollars are the oxygen in the Internet's bloodstream - the majority of our most celebrated startups got that way by providing marketing solutions to advertisers of all stripes.... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Mary Meeker of KPCB (And Win Free Tix to Web 2)

Searchblog - Wed, 2011-10-05 15:00
For the first time in eight years, Mary Meeker will let me ask her a few questions after she does her famous market overview. Each year, Mary pushes the boundaries of how many slides she can cram into one High Order Bit, topping out at 70+ slides in ten or... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Dennis Crowley, CEO, Foursquare (And Win Free Tix to Web 2)

Searchblog - Wed, 2011-10-05 13:06
Foursquare co-founder and CEO Dennis Crowley will give his first 1-1 interview on the Web 2 stage on the conference's second day, following a morning of High Order Bits and a conversation on privacy policy with leaders from government in both the US and Canada. After Crowley will be a... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Michael Dell, CEO, Dell (And Win Free Tix to Web 2)

Searchblog - Tue, 2011-10-04 12:35
Not unlike Steve Jobs back in the 1990s, Michael Dell returned to the helm of his company at a crucial moment, when his namesake was seemingly rudderless. Back in 2007, Dell was losing marketshare to HP, Apple had not yet proven the monster it has since become in mobile, and... (Go to Searchblog Main)
Categories: Blogroll
Syndicate content