Skip navigation.
Semantic Software Lab
Concordia University
Montréal, Canada


GATE team wins first prize in the Hyperpartisan News Detection Challenge

SemEval 2019 recently launched the Hyperpartisan News Detection Task in order to evaluate how well tools could automatically classify hyperpartisan news texts. The idea behind this is that "given a news text, the system must decide whether it follows a hyperpartisan argumentation, i.e. whether it exhibits blind, prejudiced, or unreasoning allegiance to one party, faction, cause, or person.

Below we see an example of (part of) two news stories about Donald Trump from the challenge data. The one on the left is considered to be hyperpartisan, as it shows a biased kind of viewpoint. The one on the right simply reports a story and is not considered hyperpartisan. The distinction is difficult even for humans, because there are no exact rules about what makes a story hyperpartisan.

In total, 322 teams registered to take part, of which 42 actually submitted an entry, including the GATE team consisting of Ye Jiang, Xingyi Song and Johann Petrak, with guidance from Kalina Bontcheva and Diana Maynard.

The main performance measure for the task is accuracy on a balanced set of articles, though additionally precision, recall, and F1-score were measured for the hyperpartisan class. In the final submission, the GATE team's hyperpartisan classifying algorithm achieved 0.822 accuracy for manually annotated evaluation set, and ranked in first position in the final leader board.

Our winning system was based on using sentence representations from averaged word embeddings  generated from the pre-trained ELMo model with a Convolutional Neural Network and Batch Normalization for training on the provided dataset. An averaged ensemble of models was then used to generate the final predictions.

The source code and full system description is available on github.

Categories: Blogroll

Russian Troll Factory: Sketches of a Propaganda Campaign

When Twitter shared a large archive of propaganda tweets late in 2018 we were excited to get access to over 9 million tweets from almost 4 thousand unique Twitter accounts controlled by Russia's Internet Research Agency. The tweets are posted in 57 different languages, but most are in Russian (53.68%) and English (36.08%). Average account age is around four years, and the longest accounts are as much as ten years old.

A large amount of activity in both the English and Russian accounts is given to news provision. Secondly, many accounts seem to engage in hashtag games, which may be a way to establish an account and get some followers. Of particular interest however are the political trolls. Left trolls pose as individuals interested in the Black Lives Matter campaign. Right trolls are patriotic, anti-immigration Trump supporters. Among left and right trolls, several have achieved large follower numbers and even a degree of fame. Finally there are fearmonger trolls, that propagate scares, and a small number of commercial trolls. The Russian language accounts also divide on similar lines, perhaps posing as individuals with opinions about Ukraine or western politics. These categories were proposed by Darren Linvill and Patrick Warren, from Clemson University. In the word clouds below you can see the hashtags we found left and right trolls using.

Left Troll Hashtags
Right Troll Hashtags

Mehmet E. Bakir has created some interactive graphs enabling us to explore the data. In the network diagram at the start of the post you can see the network of mention/retweet/reply/quote counts we created from the highly followed accounts in the set. You can click through to an interactive version, where you can zoom in and explore different troll types.

In the graph below, you can see activity in different languages over time (interactive version here, or interact with the embedded version below; you may have to scroll right). It shows that the Russian language operation came first, with English language operations following after. The timing of this part of the activity coincides with Russia's interest in Ukraine.

In the graph below, also available here, you can see how different types of behavioural strategy pay off in terms of achieving higher numbers of retweets. Using Linvill and Warren's manually annoted data, Mehmet built a classifier that enabled us to classify all the accounts in the dataset. It is evident that the political trolls have by far the greatest impact in terms of retweets achieved, with left trolls being the most successful. Russia's interest in the Black Lives Matter campaign perhaps suggests that the first challenge for agents is to win a following, and that exploiting divisions in society is an effective way to do that. How that following is then used to influence minds is a separate question. You can see a pre-print of our paper describing our work so far, in the context of the broader picture of partisanship, propaganda and post-truth politics, here.

Categories: Blogroll

Teaching computers to understand the sentiment of tweets

As part of the EU SoBigData project, the GATE team hosts a number of short research visits, between 2 weeks and 2 months, for all kinds of data scientists (PhD students, researchers, academics, professionals) to come and work with us and to use our tools and/or datasets on a project involving text mining and social media analysis. Kristoffer Stensbo-Smidt visited us in the summer of 2018 from the University of Copenhagen, to work on developing machine learning tools for sentiment analysis of tweets, and was supervised by GATE team member Diana Maynard and by former team member Isabelle Augenstein, who is now at the University of Copenhagen. Kristoffer has a background in Machine Learning but had not worked in NLP before, so this visit helped him understand how to apply his skills to this kind of domain.

After his visit, Kristoffer wrote up an excellent summary of his research. He essentially tested a number of different approaches to processing text, and analysed how much of the sentiment they were able to identify. Given a tweet and an associated topic, the aim is to ascertain automatically whether the sentiment expressed about this topic is positive, negative or neutral. Kristoffer experimented different word embedding-based models in order to test how much information different word embeddings carry for the sentiment of a tweet. This involved choosing which embeddings models to test, and how to transform the topic vectors. The main conclusions he drew from the work were that in general, word embeddings contain a lot of useful information about sentiment, with newer embeddings containing significantly more. This is not particularly surprising, but shows the importance of advanced models for this task.

Categories: Blogroll

3rd International Workshop on Rumours and Deception in Social Media (RDSM)

June 11, 2019 in Munich, Germany
Collocated with ICWSM'2019AbstractThe 3rd edition of the RDSM workshop will particularly focus on online information disorder and its interplay with public opinion formation.

Social media is a valuable resource for mining all kind of information varying from opinions to factual information. However, social media houses issues that are serious threats to the society. Online information disorder and its power on shaping public opinion lead the category of those issues. Among the known aspects are the spread of false rumours, fake news or even social attacks such as hate speech or other forms of harmful social posts. In this workshop the aim is to bring together researchers and practitioners interested in social media mining and analysis to deal with the emerging issues of information disorder and manipulation of public opinion. The focus of the workshop will be on themes such as the detection of fake news, verification of rumours and the understanding of their impact on public opinion.  Furthermore, we aim to put a great emphasis on the usefulness and trust aspects of automated solutions tackling the aforementioned themes.
Workshop Theme and TopicsThe aim of this workshop is to bring together researchers and practitioners interested in social media mining and analysis to deal with the emerging issues of veracity assessment, fake news detection and manipulation of public opinion. We invite researchers and practitioners to submit papers reporting results on these issues. Qualitative studies performing user studies on the challenges encountered with the use of social media, such as the veracity of information and fake news detection, as well as papers reporting new data sets are also welcome. Finally, we also welcome studies reporting the usefulness and trust of social media tools tackling the aforementioned problems.

Topics of interest include, but are not limited to:
  • Detection and tracking of rumours.
  • Rumour veracity classification.
  • Fact-checking social media.
  • Detection and analysis of disinformation, hoaxes and fake news.
  • Stance detection in social media.
  • Qualitative user studies assessing the use of social media.
  • Bots detection in social media.
  • Measuring public opinion through social media.
  • Assessing the impact of social media in public opinion.
  • Political analyses of social media.
  • Real-time social media mining.
  • NLP for social media analysis.
  • Network analysis and diffusion of dis/misinformation.
  • Usefulness and trust analysis of social media tools.
  • AI generated fake content (image / text)

Workshop Program Format

We will have 1-2 experts in the field delivering keynote speeches. We will then have a set of 8-10 presentations of peer-reviewed submissions, organised into 3 sessions by subject (the first two sessions about online information disorder and public opinion and the third session about the usefulness and trust aspects). After the session we also plan to have a group work (groups of size 4-5 attendances) where each group will sketch a social media tool for tackling e.g. rumour verification, fake news detection, etc. The emphasis of the sketch should be on aspects like usefulness and trust. This should take no longer than 120 minutes (sketching, presentation/discussion time).  We will close the workshop with a summary and take home messages (max. 15 minutes). Attendance will be open to all interested participants.

We welcome both full papers (5-8 pages) to be presented as oral talks and short papers (2-4 pages) to be presented as posters and demos.

Workshop Schedule/Important Dates
  • Submission deadline: March 25th 2019
  • Notification of Acceptance: April 12th 2019
  • Camera-Ready Versions Due: April 26th 2019
  • Workshop date: June 11, 2019  
 Submission Procedure
We invite two kinds of submissions:

-  Long papers/Brief Research Report (max 8 pages + 2 references)
-  Demos and poster (short papers) (max 4 pages + 2 references)
Proceedings of the workshop will be published jointly with other ICWSM workshops in a special 
issue of Frontiers in Big Data.

Papers must be submitted electronically in PDF format or any format that is supported by the 
submission site through (click on "Submit your manuscript"). 
Note, submitting authors should choose one of the specific track organizers as their preferred Editor.
You can find detailed information on the file submission requirements here:
Submissions will be peer-reviewed by at least three members of the programme
committee. The accepted papers will appear in the proceedings published at

Workshop Organizers
Programme Committee (Tentative)
  • Nikolas Aletras, University of Sheffield, UK
  • Emilio Ferrara, University of Southern California, USA
  • Bahareh Heravi, University College Dublin, Ireland
  • Petya Osenova, Ontotext, Bulgaria
  • Damiano Spina, RMIT University, Australia
  • Peter Tolmie, Universität Siegen, Germany
  • Marcos Zampieri, University of Wolverhampton, UK
  • Milad Mirbabaie, University of Duisburg-Essen, Germany
  • Tobias Hecking, University of Duisburg-Essen, Germany
 Invited Speaker(s)
To be announced
SponsorsThis workshop is  supported by the European Union under grant agreement No. 654024, SoBigData. 

And the EU co-funded horizon 2020 project that deals with algorithm-supported verification of digital content

Categories: Blogroll

SoBigData funded travel grant for short-term visiting Scholar

As a part of SoBigData's Transnational Access (TNA) activities, the Department of Computer Science at Sheffield University is keen to host scholars from non-UK universities who would like to visit Sheffield to undertake a short period of research as part of a scheme to promote international cooperation and the dissemination of knowledge. Grants are made available to cover 1-2 month research for scholars at non-UK universities/organisations. During the visit scholars will join in one of the following research projects:

       • Social media part of speech tagging in multiple languages           — Part of Speech is one of the most widely used linguistic features to analyse social media content. The project aims to build models to tag social media content with the universal POS tag set.
       • Social media named entity recognition in multiple languages           — The presentation of named entities in social media is generally different from the presentation of named entities in news articles. NER systems trained on news articles cannot perform well in social media analysis. The aim of this project is to build NER models for social media in different European languages
       • Sentiment Analysis for Twitter posts          — Sentiment analysis is one of the basic components used to analyse societal debates. This project aims to build a sentiment analysis model based on short and noisy twitter posts.

 What is covered (up to 4500 euros): Return flight/train tickets to Sheffield Accommodation during the visiting period Daily subsistence GATE Summer School Mentor from GATE members 
 Deadlines: Application before: 30 March 2019 Notification: within 2 months after submission

 Eligibility Requirements: Candidates must:
       • have PhD degree or be enrolled in a doctoral programme offered by an educational institution recognised by that country’s authorities
       • not be enrolled as a student or worked in a higher education institution of the United Kingdom
       • resume studies/work in their home country after the end of the grant period

 How to apply:Applicants should apply though SoBigData TransNationalAccess ( p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px 'Helvetica Neue'} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px 'Helvetica Neue'; min-height: 14.0px} p.p3 {margin: 0.0px 0.0px 0.0px 0.0px; font: 12.0px 'Helvetica Neue'; color: #dca10d} span.s1 {color: #000000} span.s2 {text-decoration: underline}
Submit completed application form ( to
Any question related to the projects please contact: Xingyi Song (
Categories: Blogroll

FAQ on ICML 2019 Code Submission Policy

Machine Learning Blog - Wed, 2018-12-19 16:52

ICML 2019 has an option for supplementary code submission that the authors can use to provide additional evidence to bolster their experimental results. Since we have been getting a lot of questions about it, here is a Frequently Asked Questions for authors.

1. Is code submission mandatory?

No. Code submission is completely optional, and we anticipate that high quality papers whose results are judged by our reviewers to be credible will be accepted to ICML, even if code is not submitted.

2. Does submitted code need to be anonymized?

ICML is a double blind conference, and we expect authors to put in reasonable effort to anonymize the submitted code and institution. This means that author names and licenses that reveal the organization of the authors should be removed.

Please note that submitted code will not be made public — eg, only the reviewers, Area Chair and Senior Area Chair in charge will have access to it during the review period. If the paper gets accepted, we expect the authors to replace the submitted code by a non-anonymized version or link to a public github repository.

3. Are anonymous github links allowed?

Yes. However, they have to be on a branch that will not be modified after the submission deadline. Please enter the github link in a standalone text file in a submitted zip file.

4. How will the submitted code be used for decision-making?

The submitted code will be used as additional evidence provided by the authors to add more credibility to their results. We anticipate that high quality papers whose results are judged by our reviewers to be credible will be accepted to ICML, even if code is not submitted. However, if something is unclear in the paper, then code, if submitted, will provide an extra chance to the authors to clarify the details. To encourage code submission, we will also provide increased visibility to papers that submit code.

5. If code is submitted, do you expect it to be published with the rest of the supplementary? Or, could it be withdrawn later?

We expect submitted code to be published with the rest of the supplementary. However, if the paper gets accepted, then the authors will get a chance to update the code before it is published by adding author names, licenses, etc.

6. Do you expect the code to be standalone? For example, what if it is part of a much bigger codebase?

We expect your code to be readable and helpful to reviewers in verifying the credibility of your results. It is possible to do this through code that is not standalone — for example, with proper documentation.

7. What about pseudocode instead of code? Does that count as code submission?

Yes, we will count detailed pseudocode as code submission as it is helpful to reviewers in validating your results.

8. Do you expect authors to submit data?

We understand that many of our authors work with highly sensitive datasets, and are not asking for private data submission. If the dataset used is publicly available, there is no need to provide it. If the dataset is private, then the authors can submit a toy or simulated dataset to illustrate how the code works.

9. Who has access to my code?

Only the reviewers, Area Chair and Senior Area Chair assigned to your paper will have access to your code. We will instruct reviewers, Area Chair and Senior Area Chair to keep the code submissions confidential (just like the paper submissions), and delete all code submissions from their machine at the end of the review cycle. Please note that code submission is also completely optional.

10. I would like to revise my code/add code during author feedback. Is this permitted?

Unfortunately, no. But please remember that code submission is entirely optional.

The detailed FAQ as well other Author and Style instructions are available here.

Kamalika Chaudhuri and Ruslan Salakhutdinov
ICML 2019 Program Chairs

Categories: Blogroll

Open Call for SoBigData-funded Transnational Access!

The SoBigData project invites researchers and professionals to apply to participate in Short-Term Scientific Missions (STSMs) to carry forward their own big data projects. The Natural Language Processing (NLP) group at the University of Sheffield are taking part in this initiative and invite all applications.

Funding is available for STSMs (2 weeks to 2 months) of up to 4500 euros, covering daily subsistence, accommodation and flights. These bursaries are awarded on a competitive basis.

Research areas are varied but include studies involving societal debate, online misinformation and rumour analysis. A key topic is analysis of social media and newspaper articles to understand the state of public debate in terms of what is being discussed, how it is being discussed, who is discussing it, and how this discussion is being influenced. The effects of online disinformation campaigns (especially hyper-partisan content) and the use of bot accounts to perpetrate this disinformation are also of particular interest.

Applications are welcomed for visits from 1 November 2018 and 31 July 2019!

For specific details, eligibility criteria, and to apply, click here!

Categories: Blogroll

A Deep Neural Network Sentence Level Classification Method with Context Information

Today we're looking at the work done within the group which was reported in EMNLP2018: "A Deep Neural Network Sentence Level Classification Method with Context Information", authored by Xingyi Song, Johann Petrak and Angus Roberts, all of the University of Sheffield.
Xingyi, S., Petrak, J. & Roberts, A. A Deep Neural Network Sentence Level Classification Method with Context Information. in EMNLP2018 – 2018 Conference on Empirical Methods in Natural Language Processing 00, 0-000 (2018).
Understanding complex bodies of text is a difficult task, especially those in which the context of a statement can greatly influence its meaning. While methods exist that examine the context surrounding a phrase, the authors present a new approach that makes use of much larger contexts than these. This allows for greater confidence in the results of such a method, especially when dealing with complicated subject matter. Medical records are one such area in which complex judgements on appropriate treatments are made across several sentences. It is vital therefore to fully understand the context of each individual statement to be able to collate meaning and accurately understand the sentiment of the entire body of text and the conclusion that should be drawn from it
Although grounded in its use in the medical domain, this new technique can be demonstrated to be more widely applicable. An evaluation of the technique in non-medical domains showed a solid improvement of over six percentage points over its nearest competitor technique despite requiring 33% less training time.This technique examines not only the subject sentence, but also context on either side of it. This embedding is encoded using an adapted FOFE technique that allows for large contexts without crippling amounts of additional computation.
But how does it work? At its core, this novel method analyses not only the target sentence but also an amount of text on either side of it. This context is encoded using an adapted Fixed-size Ordinally Forgetting Encoding (FOFE), turning it from a variable length context into a fixed length embedding. This is processed along with the target, before being concatenated and post-processed to produce an output. 
Experimentation on this new technique was then performed, in comparison to peer techniques. These results showed markedly improved performance compared to LSTM-CNN methods, despite taking almost the same amount of time. The performance of this new Context-LSTM-CNN technique even surpassed an L-LSTM-CNN method despite a substantial reduction in required time. Average test accuracy and training time. Best values are marked as bold, standard deviations in parentheses In conclusion, a new technique is presented, Context-LSTM-CNN, that combines the strength of LSTM and CNN with the lightweight context encoding algorithm, FOFE. The model shows a consistent improvement over either a non-context based model and a LSTM context encoded model, for the sentence classification task.
Categories: Blogroll

ICML 2019: Some Changes and Call for Papers

Machine Learning Blog - Tue, 2018-11-27 22:52

The ICML 2019 Conference will be held from June 10-15 in Long Beach, CA — about a month earlier than last year. To encourage reproducibility as well as high quality submissions, this year we have three major changes in place.

There is an abstract submission deadline on Jan 18, 2019. Only submissions with proper abstracts will be allowed to submit a full paper, and placeholder abstracts will be removed. The full paper submission deadline is Jan 23, 2019.

This year, the author list at the paper submission deadline (Jan 23) is final. No changes will be permitted after this date for accepted papers.

Finally, to foster reproducibility, we highly encourage code submission with papers. Our submission form will have space for two optional supplementary files — a regular supplementary manuscript, and code. Reproducibility of results and easy accessibility of code will be taken into account in the decision-making process.

Our full Call for Papers is available here.

Kamalika Chaudhuri and Ruslan Salakhutdinov
ICML 2019 Program Chairs

Categories: Blogroll

Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms

This summer, we presented some of our latest work at SEMANTiCS 2018 in Vienna: "Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms".
Zhang, Z., Petrak, J. & Maynard, D. Adapted TextRank for Term Extraction: A Generic Method of Improving Automatic Term Extraction Algorithms. in SEMANTiCS 2018 – 14th International Conference on Semantic Systems 00, 0-000 (2018).

This work has been carried out in the context of the EU KNOWMAK project, where we're developing tools for multi-topic classification of text against an ontology, in order to attempt to map the state of European research output in key technologies.
Automatic Term Extraction (ATE) is a fundamental technique used in computational linguistics for recognising terms in text. Processing the collected terms in a text is a key step in understanding the content of the text.  There are many different ATE methods, but these all tend to work well only in a one specific domain.  In other words, there is no universal method which produces consistently good results, and so we have to choose an appropriate method for the domain being targeted.
In this work, we have developed a novel method for ATE which addresses two major limitations: the fact that no single ATE method consistently performs well across all domains, and the fact that the majority of ATE methods are unsupervised. Our generic method, AdaText, improves the accuracy of existing ATE methods, using existing lexical resources to support them, by revising the TextRank algorithm.After being given a target text, AdaText:
  1. Selects a subset of words based on their semantic relatedness to a set of seed words or phrases relevant to the domain, but not necessarily representative of the terms within the target text. 
  2. It then applies an adapted TextRank algorithm to create a graph for these words, and computes a text-level TextRank score for each selected word. 
  3. Finally, these scores are used to revise the score of a term candidate previously computed by an ATE method. 
This technique was trialled using a variety of parameters (such as the threshold of semantic similarity to select words, as described in step two) over two distinct datasets (GENIA and ACLv2, comprising Medline abstracts and abstracts from ACL respectively). We also tested it with a wide variety of state of the art ATE methods, including modified TFIDF, CValue, Basic, RAKE, Weirdness, LinkProbability, X2, GlossEx and PositiveUnlabeled.

The figures show a sample of performances in different datasets and using different ATE techniques. The base performance of the ATE method is represented by the black horizontal line. The horizontal axis represents the semantic similarity threshold used in step 1. The vertical axis shows average P@K for all five Ks considered.
This new generic combination approach can consistently improve the performance of the ATE method by 25 points, which is a significant increase. However, there is still room for improvement. In future work, we aim to optimise the selection of words from the TextRank graph, work on expanding TextRank to a graph of both words and phrases, and to explore how the size and source of the seed lexicon affects the performance of AdaText.  

Categories: Blogroll
Syndicate content