Skip navigation.
Semantic Software Lab
Concordia University
Montréal, Canada


GATE's submission wins 2nd place in United Nations General Assembly Resolutions Extraction and Elicitation Global Challenge

In May 2019 we submitted a prototype to the United Nations General Assembly Resolutions Extraction and Elicitation Global Challenge, which asked for submissions using mature natural language processing techniques to produce semantically enhanced, machine-readable documents from PDFs of UN GA resolutions, with particular interest in identifying named entities and items in certain thesauri and ontologies and in making use of the document structure (in particular, the distinction between preamble and operative paragraphs).
Our prototype included a customized GATE application designed to read PDFs of United Nations General Assembly resolutions and identify named entities, resolution adoption information (resolution number and adoption date), preamble sections, operative sections, and references to keywords and phrases in the English parts of the UN Bibliographical Information System thesaurus.
We downloaded and automatically annotated over 2800 resolution documents and pushed the results into a Mímir index to allow semantic search using combinations of the entities and sections identified, such as the following (more examples are provided in the documentation that we submitted):
  • find sentences in operative paragraphs containing a person and an UNBIS term;
  • find preamble paragraphs containing a person, an organization, and a date;
  • find combinations referring to a specific UNBIS term.

We also developed an easier to use web front end for exploring co-occurrences of keywords and semantic annotations.

We are excited to receive the second place award, along with an invitation to improve our work with more feedback and a "lessons learned" discussion with the panel. The panel highlighted in particular the submission of comprehensive and testable code, and the use of GATE as a mature respected framework.

Our GitHub site contains the information extraction and search front end software, licensed under the GPL-3.0 and available for anyone to download and use.
Categories: Blogroll

ICML has 3(!) Real World Reinforcement Learning Workshops

Machine Learning Blog - Fri, 2019-06-07 11:28

The first is Sunday afternon during the Industry Expo day. This one is meant to be quite practical, starting with an overview of Contextual Bandits and leading into how to apply the new Personalizer service, the first service in the world functionally supporting general contextual bandit learning.

The second is Friday morning. This one is more academic with many topics. I’ll personally be discussing research questions for real world RL.

The third one is Friday afternoon with more emphasis on sequences of decisions. I expect to here “imitation learning” multiple times

I’m planning to attend all 3. It’s great to see interest building in this direction, because Real World RL seems like the most promising direction for fruitfully expanding the scope of solvable machine learning problems.

Categories: Blogroll

Toxic Online Discussions during the UK European Parliament Election Campaign

The Brexit Party attracted the most engagement on Twitter in the run-up to the UK European Parliament election on May 23rd, their candidates receiving as many tweets as all the other parties combined. Brexit Party leader Nigel Farage was the most interacted-with UK candidate on Twitter, with over twice as many replies as the next most replied-to candidate, Andrew Adonis of the Labour Party.

We studied all tweets sent to or from (or retweets of or by) UK European Election candidates in the month of May, and classified them as abusive or not using the classifier presented here. It must be noted, in particular, that the classifier only identifies reliably whether a reply is abusive or not. It is not sufficiently accurate for us to reliably judge the target politician or party of this abusive reply. What this means is that we can only reliably identify which EP candidates triggered abuse-containing discussion threads on Twitter, but that often this abuse is actually aimed at other politicians or parties.

In addition to attracting the most replies, the Brexit Party candidates also triggered an unusually high level of abuse-containing Twitter discussions. In particular, we found that posts by Farage triggered almost six times as many abuse-containing Twitter threads than the next most replied to candidate, Gavin Esler of Change UK, during May 2019.

There is an important difference, however, in that that many of the abuse-containing replies to posts by Farage and the Brexit Party were actually abusive towards other politicians (most notably the prime minister and the leader of the Labour party) and not Farage himself. In contrast, abusive replies to Gavin Esler were primarily aimed at the politician himself, triggered by his use of the phrase "village idiot" in connection with the Leave Campaign.

Candidates from other parties that triggered unusually high levels of abuse-containing discussions were those from the UK Independence Party, now considered far right, and Change UK, a newly formed but unstable remain party. Change UK was the most active on Twitter, with candidates sending more tweets than other parties. Gavin Esler was the most replied-to Change UK candidate, and also received an unusually high level of abuse. The abuse often referred to his use of the phrase "village idiot" in connection with the leave campaign, which resulted in anger and resentment.

In contrast, MEP candidates from the Conservative and Labour Parties were not hubs of polarised, abuse-containing discussions on Twitter.

What these findings, unsurprisingly, demonstrate is that politicians and parties who themselves use divisive and abusive language, for example, to brand political opponents as “village idiots”, “traitors”, or as “desperate to betray”, are thus triggering the toxic online responses and deep political antagonism that we have witnessed.

After the Brexit Party, the next most replied-to MEP candidates were from the Labour partyAfter the Brexit Party, the next most replied-to party was Labour, according to the study, followed by Change UK.

MEP candidates from both the Liberal Democrats and the Green Party were also active on Twitter, with the Green MEP candidates second only to Change UK ones for number of tweets sent, but didn't get a lot of engagement in return. The Liberal Democrats in particular received a low number of replies. This may suggest that these parties became the choices of default for a population of discouraged remainers, as both made gains in the election. Both parties attracted a particularly civil tone of reply.

Brexit Party candidates were also the ones that replied most to those who tweeted them, rather than authoring original tweets or retweeting other tweets.

Acknowledgements: Research carried out by Genevieve Gorrell, Mehmet Bakir, and Kalina Bontcheva. This work was partially supported by the European Union under grant agreements No. 654024 SoBigData and No. 825297 WeVerify.
Categories: Blogroll

GATE at World Press Freedom Day

GATE at World Press freedom day: STRENGTHENING THE MONITORING OF SDG 16.10.1
In her role with CFOM (the University's Centre for Freedom of the Media, hosted in the department of Journalism Studies), Diana Maynard travelled to Ethiopia together with CFOM members Sara Torsner and Jackie Harrison to present their research at the World Press Freedom Day Academic Conference on the Safety of journalists in Addis Ababa, on 1 May, 2019. This ongoing research aims to facilitate the comprehensive monitoring of violations against journalists, in line with Sustainable Development Goal (SDG) 16.10.1. This is part of a collaborative project between CFOM and the press freedom organisation Free Press Unlimited, which aims to develop a methodology for systematic data collection on a range of attacks on journalists, and to provide a mechanism for dealing with missing, conflicting and potentially erroneous information.
Discussing possibilities for adopting NLP tools for developing a monitoring infrastructure that allows for the systematisation and organisation of a range of information and data sources related to violations against journalists, Diana proposed a set of areas of research that aim to explore this in more depth. These include: switching to an events-based methodology, reconciling data from multiple sources, and investigating information validity.

Whereas approaches to monitoring violations against journalists traditionally uses a person-based approach, recording information centred around an individual, we suggest that adopting an events-based methodology instead allows for the violation itself to be placed at the centre: ‘by enabling the contextualising and recording of in-depth information related to a single instance of violence such as a killing, including information about key actors and their interrelationship (victim, perpetrator and witness of a violation), the events-based approach enables the modelling of the highly complex structure of a violation. It also allows for the recording of the progression of subsequent violations as well as multiple violations experienced by the same victim (e.g. detention, torture and killing)’.
Event-based data model from HURIDOCS Source:
: Another area of research includes possibilities for reconciling information from different databases and sources of information on violations against journalists through NLP techniques. Such methods would allow for the assessment and compilation of partial and contradictory data about the elements constituting a given attack on a journalist. ‘By creating a central categorisation scheme we would essentially be able to facilitate the mapping and pooling of data from various sources into one data source, thus creating a monitoring infrastructure for SDG 16.10.1’, said Diana Maynard. Systematic data on a range of violations against journalists that are gathered in a methodologically systematic and transparent way would also be able to address issues of information validity and source verification: ‘Ultimately such data would facilitate the investigation of patterns, trends and early warnings, leading to a better understanding of the contexts in which threats to journalists can escalate into a killing undertaken with impunity’. We thus propose a framework for mapping between different datasets and event categorisation schemes in order to harmonise information.

In our proposed methodology, GATE tools can be used to extract information from the free text portions of existing databases and link them to external knowledge sources in order to acquire more detailed information about an event, and to enable semantic reasoning about entities and events, thereby helping to both reconcile information at different levels of granularity (e.g. Dublin vs Ireland; shooting vs killing) and to structure information for further search and analysis. 

Slides from the presentation are available here; the full journal paper is forthcoming.
The original article from which this post is adapted is available on the CFOM website
Categories: Blogroll

WeVerify: Algorithm-Supported Verification of Digital Content

Announcing WeVerify: a new project developing AI-based tools for computer-supported digital content verification. The WeVerify platform will provide an independent and community driven environment for the verification of online content, to be used to assist journalists in gathering and verifying quickly online content. Prof. Kalina Bontcheva will be serving as the Scientific Director of the project.

Online disinformation and fake media content have emerged as a serious threat to democracy, economy and society. Content verification is currently far from trivial, even for experienced journalists, human rights activists or media literacy scholars. Moreover, recent advances in artificial intelligence (deep learning) have enabled the creation of intelligent bots and highly realistic synthetic multimedia content. Consequently, it is extremely challenging for citizens and journalists to assess the credibility of online content, and to navigate the highly complex online information landscapes.

WeVerify aims to address the complex content verification challenges through a participatory verification approach, open source algorithms, low-overhead human-in-the-loop machine learning and intuitive visualizations. Social media and web content will be analysed and contextualised within the broader online ecosystem, in order to expose fabricated content, through cross-modal content verification, social network analysis, micro-targeted debunking and a blockchain-based public database of known fakes.

Add captionA key outcome will be the WeVerify platform for collaborative, decentralised content verification, tracking, and debunking.

The platform will be open source to engage communities and citizen journalists alongside newsroom and freelance journalists. To enable low-overhead integration with in-house content management systems and support more advanced newsroom needs, a premium version of the platform will also be offered. It will be furthermore supplemented by a digital companion to assist with verification tasks.

Results will be validated by professional journalists and debunking specialists from project partners (DW, AFP, DisinfoLab), external participants (e.g. members of the First Draft News network), the community of more than 2,700 users of the InVID verification plugin, and by media literacy, human rights and emergency response organisations.

The WeVerify website can be found at, and WeVerify can be found on Twitter @WeV3rify!
Categories: Blogroll

Coming Up: 12th GATE Summer School 17-21 June 2019

It is approaching that time of the year again! The GATE training course will be held from 17-21 June 2019 at the University of Sheffield, UK.

No previous experience or programming expertise is necessary, so it's suitable for anyone with an interest in text mining and using GATE, including people from humanities backgrounds, social sciences, etc.

This event will follow a similar format to that of the 2018 course, with one track Monday to Thursday, and two parallel tracks on Friday, all delivered by the GATE development team. You can read more about it and register here. Early bird registration is available at a discounted rate until 1 May.

The focus will be on mining text and social media content with GATE. Many of the hands on exercises will be focused on analysing news articles, tweets, and other textual content.

The planned schedule is as follows (NOTE: may still be subject to timetabling changes).
Single track from Monday to Thursday (9am - 5pm):
  • Monday: Module 1: Basic Information Extraction with GATE
    • Intro to GATE + Information Extraction (IE)
    • Corpus Annotation and Evaluation
    • Writing Information Extraction Patterns with JAPE
  • Tuesday: Module 2: Using GATE for social media analysis
    • Challenges for analysing social media, GATE for social media
    • Twitter intro + JSON structure
    • Language identification, tokenisation for Twitter
    • POS tagging and Information Extraction for Twitter
  • Wednesday: Module 3: Crowdsourcing, GATE Cloud/MIMIR, and Machine Learning
    • Crowdsourcing annotated social media content with the GATE crowdsourcing plugin
    • GATE Cloud, deploying your own IE pipeline at scale (how to process 5 million tweets in 30 mins)
    • GATE Mimir - how to index and search semantically annotated social media streams
    • Challenges of opinion mining in social media
    • Training Machine Learning Models for IE in GATE
  • Thursday: Module 4: Advanced IE and Opinion Mining in GATE
    • Advanced Information Extraction
    • Useful GATE components (plugins)
    • Opinion mining components and applications in GATE
On Friday, there is a choice of modules (9am - 5pm):
  • Module 5: GATE for developers
    • Basic GATE Embedded
    • Writing your own plugin
    • GATE in production - multi-threading, web applications, etc.
  • Module 6: GATE Applications
    • Building your own applications
    • Examples of some current GATE applications: social media summarisation, visualisation, Linked Open Data for IE, and more
These two modules are run in parallel, so you can only attend one of them. You will need to have some programming experience and knowledge of Java to follow Module 5 on the Friday. No particular expertise is needed for Module 6.
Hope to see you in Sheffield in June!
Categories: Blogroll
Syndicate content