Skip navigation.
Semantic Software Lab
Concordia University
Montréal, Canada


Vizualisations of Political Hate Speech on Twitter

Recently there's been some media interest in our work on abuse toward politicians. We performed an analysis of abusive replies on Twitter sent to MPs and candidates in the months leading up to the 2015 and 2017 UK elections, disaggregated by gender, political party, year, and geographical area, amongst other things. We've posted about this previously, and there's also a more technical publication here. In this post, we wanted to highlight our interactive visualizations of the data, which were created by Mark Greenwood. The thumbnails below give a flavour of them, but click through to access the interactive versions.
Abusive RepliesSunburst diagrams showing the raw number of abusive replies sent to MPs before the 2015 and 2017 elections. Rather than showing all candidates, these only show the MPs who were elected (i.e. the successful candidates). These nicely show the proportion of abusive replies sent to each party/gender combination but don't give any feeling per MP the proportion of replies which were abusive. Interactive version here!
Increase in AbuseAn overlapping bar chart showing how the percentage of abuse received per party/gender by MPs has increased between 2015 and 2017. For each party/gender two bars are drawn. The height of the bar in the party colour represents the percentage of replies which were abusive in 2017. The height of the grey bar (drawn at the back) is the percentage of replies which were abusive in 2015 and the width shows the change in volume of abusive replies (i.e. the width is calculated by dividing the 2015 raw abusive reply count by that from 2017 to give a percentage which is then used to scale the width of the bar). So height shows change in proportion, width shows increase in volume. There is also a simple version of this graph which only shows the change in proportion (i.e. the widths of the two bars are the same). Original version here.
Geographical Distribution of AbuseA map showing the geographical distribution of abusive replies. The map of the UK is divided into the NUTS 1 regions, and each region is coloured based on the percentage of abusive replies sent to MPs who represent that region. Data from both 2015 and 2017 can be displayed to see how the distribution of abuse has changed. Interactive version here!
Categories: Blogroll

How difficult is it to understand my web pages? Using GATE to compute a complexity score for Web text.

The Web Science Summer School, which took place from 30 July - 4 August at the L3S Research Centre in Hannover, Germany, gave students a chance to learn about a number of tools and techniques related to web science. As part of this, team member Diana Maynard gave a keynote talk about applying text mining techniques to real-world applications such as sentiment and hate speech detection, and political social media analysis, followed by a 90 minute practical GATE tutorial where the students learnt to use ANNIE, TwitIE and sentiment analysis tools. The keynotes and tutorials throughout the week were complemented with group work, where the students were tasked with the question: “Can more meaningful indicators for text complexity be extracted from web pages ?”. Here follows the account of one student team, who in the space of only 4 hours, managed to use GATE to complete the task – an extremely creditable performance given their very brief exposure to GATE.

After some discussion, our team decided to focus on a very practical problem: the readability metrics commonly used to assess the difficulty of a text do not account for the target audience or the narrative context. We believed a simple approach employing GATE could offer greater insights into how to identify the relevant features associated with text complexity. Everyone had an intuitive understanding of text complexity; it was when trying to match these ideas into an objective framework that issues arose. Particular definitions on complexity, understandability, comprehensibility, and readability were mixed and matched when approaching this issue. 
In our team vision, the complexity of a document is based not only on the structure of the sentences but also in the context of its narrative and the ease with which the targeted audience can understand it In our model, the complexity score of a text is linked to the context of the text’s narrative. This means texts about certain narrative contexts (topics) are inherently harder to understand than other texts. How hard it is to understand a particular text is also related to the capabilities of the reader. Thus, texts on specific narrative contexts can be characterized to create a score of how hard to understand they will be for certain audiences. 
To do this, we proposed the following process: 
  1. Create an instance lexicon for content complexity
    • Collect a set of texts from different narrative contexts that the audience may be expected to read, e.g. celebrity news, political news, sports news, medical information leaflets, coursebook fragments.
    • Identify the relevant entities in those texts, i.e. persons, locations, organizations, percentages, dates, and technical terms
    • Assess the complexity of each text by using crowdsourcing, e.g. have a sample of UK young adults assess the difficulty of the texts via ratings or procedures like CLOZE. 
    • Assign a complexity value to each entity in the lexicon based on the complexity values of the text it appeared in and its relevance to those texts.
  2. Assess the complexity of new texts
    • Identify the relevant entities in the text
    • Employ the entity complexity lexicon to compute an estimate value for the new text.
During the allocated time, our team completed the first stage by creating an entity lexicon. We employed GATE to identify entities within a 11-webpage corpus.

Running the Term-Raider plugin to identify the entities in the texts. The corpus was composed by 9 Wikipedia pages and 2 academic articles, and an independent scoring (1-10 scale) of the pages’ complexity was given by 4 team reviewers. Then, the entities were identified for each document by running the ANNIE and TermRaider plugins in the GATE GUI. 

Employing ANNIC to search for entities linked to organizations, locations, persons, dates or percentages within the texts. These entities were given a complexity score by computing the average complexity of the pages they appeared in. We obtained a set of 5312 entities which were exported to an xml file. 
Result extract exported in XML format from the TermRaider plugin Once duplicates had been accounted for, our lexicon was composed of 906 weighted pairs. 
Extract from the named entity lexicon after adding the weights based on the complexity scores of the pages they appeared inThis lexicon was used to calculate a complexity for a new page set, which showed significant divergence from the base (readability) score we were given at the start. 
Comparison of the scores assigned by the lexicon (1-10) and the complexity score given to us as a base (0-1)

In general, the entities in a text are associated with the text narrative contexts, e.g. celebrity news will include celebrity names and places, while scientific literature will reference percentages, ratios and error estimates. In our model, an annotation of the complexity of a sample of pages from several narrative contexts could be used to determine a complexity value for relevant entities based on the complexity scores of the pages in which it appears, which can then be used to estimate complexity scores for new pages. 
Given the time constrains we had, many of the activities were done based on naïve algorithms and within the limits of our resources. We have some further ideas on how this approach could be further explored. First, we believe that any complexity score should take into account the audience capability. In this case, the researcher should appreciate that determining the characteristics of the population they wish to explore is just as important as determining the narrative context and structure of the text. Asking teenagers to read mathematical formulae will yield different complexity scores from if the audience were GPs or older adults. 
An objective way of scoring the complexity of a text is the use of comprehension testing process like CLOZE, where every 5th word is replaced with a blank space which respondents are asked to then fill. Such a procedure can be used in crowdsourcing platforms like Mechanical Turk to create complexity lexicons for specific audiences: sample texts of diverse narrative contexts (topics) would be selected to be assessed by the crowd, which would tell us in how complex particular groups of people find certain texts (e.g. UK teenagers find maths text really difficult and tweets easy, but the complexity scores may reverse for older Mexican maths professors when given the same texts).
Another aspect that could be easily improved is the use of centrality metrics like TextRank to determine which named entities are actually relevant to the text, based on their frequency and position within the narrative. Finally, a ranking algorithm like Page-Rank can be adapted to obtain the complexity scores of the entity lexicon in a way that permits to identify relevant entities by employing clustering algorithms. 
Team: Damianos Melidis, L3S Hannover, GermanyLatifah Alshammari, University of Bath, UKFernando Santos Sanchez, University of Southampton, UKAhmed Al-Ghez, University of Goettingen, GermanyFatmah Bamashmoos, University of Bristol, UK

Slides from the group presentationp.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 10.0px Tahoma; color: #4787ff; -webkit-text-stroke: #4787ff} span.s1 {font-kerning: none; color: #000000; -webkit-text-stroke: 0px #000000} span.s2 {text-decoration: underline ; font-kerning: none; -webkit-text-stroke: 0px #4787ff} p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 16.0px 'Times New Roman'; -webkit-text-stroke: #000000} span.s1 {font-kerning: none}
Categories: Blogroll

Students use GATE and Twitter to drive Lego robots—again!

At the university's Headstart Summer School in July 2018, 42 secondary school students (age 16 and 17) from all over the UK (see below for maps) were taught to write Java programs to control Lego robots, using input from the robots (such as the sensor for detecting coloured marks on the floor) as well as operating the motors to move and turn.  The Department of Computer Science provided a Java library for driving the robots and taught the students to use it.

After they had successfully operated the robots, we ran a practical session on 10 and 11 July on "Controlling Robots with Tweets".  We presented a quick introduction to natural language processing (using computer programs to analyse human languages, such as English) and provided them with a bundle of software containing a version of the GATE Cloud Twitter Collector modified to run a special GATE application with a custom plugin to use the Java robot library to control the robots.

The bundle came with a simple "gazetteer" containing two lists of keywords:

and a basic JAPE grammar (set of rules) to make use of it.  JAPE is a specialized programming language used in GATE to match regular expressions over annotations in documents, such as the "Lookup" annotations created whenever the gazetteer finds a matching keyword in a document. (The annotations are similar to XML tags, except that GATE applications can create them as well as read them and they can overlap each other without restrictions.  Technically they form an annotation graph.)

The sample rule we provided would match any keyword from the "turn" list followed by any keyword from the "left" list (with optional other words in between, so that "turn to port", "take a left", "turn left" all work the same way) and then run the code to turn the robot's right motor (making it turn left in place).

We showed them how to configure the Twitter Collector, follow their own accounts, and then run the collector with the sample GATE application.  Getting the system set up and working took a bit of work, but once the first few groups got their robot to move in response to a tweet, everyone cheered and quickly became more interested.  They then worked on extending the word lists and JAPE rules to cover a wider range of tweeted commands.

Some of the students had also developed interesting Java code the previous day, which they wanted to incorporate into the Twitter-controlled system.  We helped these students add their code to their own copies of the GATE plugin and re-load it so the JAPE rules could call their procedures.

We first ran this project in the Headstart course in July 2017; we made improvements for this year and it was a success again, so we plan to include it in Headstart 2019 too.
The following maps show where all the students and the female students came from.

This work is supported by the European Union's Horizon 2020 project SoBigData (grant agreement no. 654024).  Thanks to Genevieve Gorrell for the diagram illustrating how the system works.
Categories: Blogroll

Deep Learning in GATE

Few can have failed to notice the rise of deep learning over the last six or seven years, and its role in our emergence from the AI winter. Thanks in part to the increased speed offered by GPUs*, neural net approaches came into their own and out from under the shadow of the support vector machine, offering more scope than that and other previously popular methods, such as Random Forests and CRFs, for continued improvement as training data volumes increase. Natural language processing has traditionally been a multi-step endeavour, perhaps beginning with tokenization and parsing and working up to semantic processing such as question answering. In addition to being labour-intensive, this approach is also limiting, as each abstraction can only access the current step, and thus throws away potentially valuable information from previous steps. Deep learning offers the possibility to overcome these limitations by bringing a much greater number of parameters into play (much greater flexibility). Deep neural nets (DNNs) may learn end-to-end solutions, starting with raw data and producing sophisticated output. Furthermore they can encode much more complex dependencies than those we have seen in less parameterizable approaches--in other words, much more elaborate reasoning. And while we step back from the need to break down involved problems into pieces ourselves, a promising line of work finds that DNN "skills" are also "transferable"--models may for example be pre-trained on generic data, providing a basic language understanding that can then be put to use in other specialized contexts (multi-task learning).
For these reasons, deep learning is widely seen as key to continuing progress on a wide range of artificial intelligence tasks including natural language processing, so of course it is of great interest to us here in the GATE team. Classic GATE tasks such as entity recognition and sentence classification could be advanced by utilizing an approach with greater potential to learn a discriminative model, given sufficient training data. And by supporting the substitution of words with "embeddings" (DNN-derived vectors that capture relationships between words) trained on readily available unlabelled general or domain-specific data, we can bring some of the benefits of deep learning even to cases where training data are meagre by deep learning standards. Deep learning is therefore likely to be of benefit in any task but the most trivial, as long as you have the skills and a reasonable amount of data.
The Learning Framework is our ongoing project bringing current machine learning technologies to GATE, enabling users to leverage GATE's ecosystem of text processing offerings to create features to train learners, and to include these learners in text processing pipelines. The guiding vision for the Learning Framework has always been to offer an accessible interface that enables GATE users to get up and running quickly with machine learning, whilst at the same time supporting the most current and interesting of technologies. When it comes to deep learning, meeting these twin objectives is a little more challenging, but we have stepped up to the plate!

Deep learning framework in the GATE GUI
Previous machine learning algorithms would work their magic with comparatively little in the way of tweaking required. Deep learning is, however, an entirely different beast in this respect. In fact, it's more like an entire zoo! As discussed above, the advantage to DNNs is their massive flexibility, but this seriously stretches GATE's previous assumptions about how machine learning works. An integration needs to support the design of an architecture (a "shape" of neural net) and the tuning of many parameters, including dropout, optimization strategy, learning rate, momentum, and many more. All of these factors are critical in obtaining a good performance. The integration is still under (very) intensive development, but it is already possible to get something running relatively quickly with deep learning in GATE. Here are some current highlights:
  • Two of the most-used frameworks for Deep Learning can be used: PyTorch and Keras, both Python-based;
  • Support for both Linux and MacOS (Windows is not yet supported);
  • A range of template architectures, which may produce acceptable results out of the box (though in many cases it will be necessary for the user to adapt the architecture, the parameters of the architecture, or other aspects of the DNN solution);
  • The possibility to work with an initial GATE-created model both inside and outside of GATE.
We encourage anyone who is interested to give it a try and to talk to us about it. There will always be more to add (current challenges include drop-out, gradient clipping, L1/L2 weight regularization, attention, modified weight initialization, char-augmented LSTMs and LSTM-CRF architectures, to name a few) but much is achievable already. This is one of relatively few efforts globally to sift the essence out of this highly active research field and transform it into something relatively high level and generalizable across a range of NLP tasks, making state of the art technologies accessible to non-specialists. There's some documentation available here.

At the same time, we've been applying deep learning in our research in several ways. In a forthcoming EMNLP paper, team member Xingyi Song and co-authors use the fixed-size, ordinally-forgetting (FOFE) approach to combine LSTM and CNN neural net architectures in a more computationally efficient way than previously, in order to make better use of context in sentence classification tasks. Together with researchers at KCL and South London and Maudsley NHS Trust, he's also demonstrated the value of this technology in the context of detection of suicidal ideation in medical records.
Furthermore, we have successfully used LSTMs for veracity verification of rumours spread in social media such as in Twitter. Our approach makes use of only the tweet content, which it passes through LSTM units that learn to distinguish between true, false and unverifiable rumours. However, the unique part of our approach is that prior to passing the tweet to the LSTM layer, it first looks within the tweet for some recurring information that is typically used by others to spread rumours, and makes adjustments on the input--words carrying useful information are kept as they are, and others are downgraded in terms of contribution. This is achieved through attention layer implementation. We evaluated our approach on the RumourEval 2017 test data and achieved over 60% accuracy, which is currently the state-of-the-art performance for this task.
*Graphics Processing Units; technology driven by the demands of computer gamers that has been used to speed up deep learning approaches by as much as 250 times compared with CPUs.
Title artwork from
Categories: Blogroll

What matters most to people around the world? Using the GATE social media toolkit to investigate wellbeing.

As part of the EU SoBigData project, the GATE team hosts a number of short research visits, between 2 weeks and 2 months, for all kinds of data scientists (PhD students, researchers,  academics, professionals) to come and work with us and use our tools and/or datasets on a project involving text mining and social media analysis. One such visitor was Economics PhD student Giuliano Resce from the University of Roma Tre in Italy. During his month-long visit, he worked with Diana Maynard on a project collecting and analysing millions of public tweets in 7 different languages, in order to understand the different societal priorities of people in different countries of the OECD. The work explored the different opinions on Twitter of people around the world about societal issues such as the environment, housing and life satisfaction.

OECD Better Life Index

Giuliano first used the GATE Twitter Collector to collect a set of tweets, and then processed them with the GATE social media analysis toolkit, using GATE Mimir to investigate the results. Topics were determined using the initial set of OECD topics, in 7 languages, which we then expanded for each language into a set of keywords for each topic using first existing lists from the GATE political tweets analyser and then Word2Vec to find more related keywords to those.
Better Life Index Topic frequency at county level in Twitter (percentage)
The ensuing analysis of the tweets then enabled Giuliano to redesign Composite Indices for the OECD’s Better Life Index, a measure of well-being which gives a detailed overview of the social, economic and environmental performances of different countries. In turn, this redesign helps to better reflect the actual needs of the people. The idea is that the aggregate of millions of tweets may provide a representation of the different priorities among the eleven topics of the Better Life Index. By combining topic performances and related Twitter trends, they produced new evidence about the relationship between people’s priorities and policy makers’ activity in the BLI framework.

Rank in Composite BLI using local Twitter trends as Weights and using Equal Weights
A paper about the work has been published in the Journal of Technological Forecasting & Social Change.
More information about SoBigData TransNational Access Research visits
Categories: Blogroll

GATE and JSON: Now Supporting 280 Character Tweets!

We first added support for reading tweets stored as JSON objects to GATE in version 8, all the way back in 2014. This support has proved exceptionally useful both internally to help our own research but also to the many researchers outside of Sheffield who use GATE for analysing Twitter posts. Recent changes that Twitter have made to the way they represent Tweets as JSON objects and the move to 280 character tweets has led us to re-develop our support for Twitter JSON and to also develop a simpler JSON format for storing general text documents and annotations.

This work has resulted in two new (or re-developed plugins); Format: JSON and Format Twitter. Both are currently at version 8.6-SNAPSHOT and are offered in the default plugin list to users of GATE 8.6-SNAPSHOT.

The Format: JSON plugin contains both a document format and export support for a simple JSON document format inspired by the original Twitter JSON format. Essentially each document is stored as a JSON object with two properties text and entities. The text field is simply the text of the document, while the entities contains the annotations and their features. The format of this field is that same as that used by Twitter to store entities, namely a map from annotation type to an array of objects each of which contains the offsets of the annotation and any other features. You can load documents using this format by specifying text/json as the mime type. If your JSON documents don't quite match this format you can still extract the text from them by specifying the path through the JSON to the text element as a dot separated string as a parameter to the mime type. For example, assume the text in your document was in a field called text but this wasn't at the root of the JSON document but inside an object named document, then you would load this by specifying the mime type text/json;text-path=document.text. When saved the text and any annotations would, however, by stored at the top level. This format essentially mirrors the original Twitter JSON, but we will now be freezing this format as a general JSON format for GATE (i.e. it won't change if/when Twitter changes the way they store Tweets as JSON).

As stated earlier the new version of our Format: Twitter plugin now fully supports Twitters new JSON format. This means we can correctly handle not only 280 character tweets but also quoted tweets. Essentially a single JSON object may now contain multiple tweets in a nested hierarchy. For example, you could have a retweet of a tweet which itself quotes another tweet. This is represented as three separate tweets in a single JSON object. Each top level tweet is loaded into a GATE document and covered with a Tweet annotation. Each of the tweets it contains are then added to the document and covered with a TweetSegment annotation. Each TweetSegment annotation has three features textPath, entitiesPath, and tweetType. The latter of these tells you the type of tweet i.e. retweet, quoted etc. whereas the first two give the dotted path through the JSON object to the fields from which text and entities were extracted to produce that segment. All the JSON data is added as nested features on the top level Tweet annotation. To use this format make sure to use the mime type text/x-json-twitter when loading documents into GATE.

So far we've only talked about loading single JSON objects as documents, however, usually you end up with a single file containing many JSON objects (often one per line) which you want to use to populate a corpus. For this use case we've added a new JSON corpus populator.

This populator allows you to select the JSON file you want to load, set the mime type to use to process each object within the file, and optionally provide a path to a field in the object that should be used to set the document name. In this example I'm loading Tweets so I've specified /id_str so that the name of the document is the ID of the tweet; paths are / separated list of fields specifying the root to the relevant field and must start with a /.

The code for both plugins is still under active development (hence the -SNAPSHOT version number) while we improve error handling etc. so if you spot any issues or have suggestions for features we should add please do let us know. You can use the relevant issue trackers on GitHub for either the JSON or Twitter format plugins.
Categories: Blogroll

A Real World Reinforcement Learning Research Program

Machine Learning Blog - Fri, 2018-07-06 11:10

We are hiring for reinforcement learning related research at all levels and all MSR labs. If you are interested, apply, talk to me at COLT or ICML, or email me.

More generally though, I wanted to lay out a philosophy of research which differs from (and plausibly improves on) the current prevailing mode.

Deepmind and OpenAI have popularized an empirical approach where researchers modify algorithms and test them against simulated environments, including in self-play. They’ve achieved significant success in these simulated environments, greatly expanding the reportoire of ‘games solved by reinforcement learning’ which consisted of the singleton backgammon when I was a graduate student. Given the ambitious goals of these organizations, the more general plan seems to be “first solve games, then solve real problems”. There are some weaknesses to this approach, which I want to lay out next.

  • Broken API One issue with this is that multi-step reinforcement learning is a broken API in the sense that it creates an interface for problem definitions that is unsolvable via currently popular algorithm families. In particular, you can create problems which are either ‘antishaped’ so local rewards mislead w.r.t. long term rewards or keylock problems, as are common in Markov Decision Process lower bounds. I coded up simple versions of these problems a couple years ago and stuck them on github now to be extra crisp. If you try to apply policy gradient or Q-learning style algorithms on these problems they commonly run into exponential (in the number of states) sample complexity. As a general principle, APIs which create exponential sample complexity are bad—they imply that individual applications require taking advantage of special structure in order to succeed.
  • Transference Another significant issue is the degree of transference between solutions in simulation and the real world. “Transference” here potentially happens at several levels.
    • Do the algorithms carry over? One of the persistent issues with simulation-based approaches is that you don’t care about sample complexity that much—optimal performance at acceptable computational complexities is the typical goal. In real world applications, this is somewhat absurd—you really care about immediately doing something reasonable and optimizing from there.
    • Do the simulators carry over? For every simulator, there is a fidelity question which comes into play when you want to transfer a policy learned in the simulator into action in the real world. Real-time ray tracing and simulator quality more generally are advancing, but I’m not ready yet to trust a self-driving care trained in a simulated reality. An accurate simulation of the physics is unclear—friction for example is known-difficult, and more generally the representative variety of exogenous events in an open world seems quite difficult to implement.
  • Solution generality When you test and discover that an algorithm works in a simulated world, you know that it works in the simulated world. If you try it in 30 simulated worlds and it works in all of them, it can still easily be the case that an algorithm fails on the 31st simulated world. How can you achieve confidence beyond the number of simulated worlds that you try and succeed on? There is some sense by which you can imagine generalization over an underlying process generating problems, but this seems like a shaky justification in practice, since the nature of the problems encountered seems to be a nonstationary development of an unknown future.
  • Value creation Solutions of a ‘first A, then B’ flavor naturally take time to get to the end state where most of the real value is set to be realized. In the years before reaching applications in the real world, does the funding run out? We certainly hope not for the field of research but a danger does exist. Some discussion here including the comments is relevant.

What’s an alternative?

Each of the issues above is addressable.

  • Build fundamental theories of what are statistically and computationally tractable sub-problems of Reinforcement Learning. These tractable sub-problems form the ‘APIs’ of systems for solving these problems. Examples of this include simpler (Contextual Bandits), intermediate (learning to search, and move advanced (Contextual Decision Process).
  • Work on real-world problems. The obvious antidote to simulation is reality, driving both the need to create systems that work in reality as well as a research agenda around reality-centered issues like performance at low sample complexity. There are some significant difficulties with this—reinforcement style algorithms require interactive access to learn which often drives research towards companies with an infrastructure. Nevertheless, offline evaluation on real-world data does exist and the choice of emphasis in research directions is universal.
  • The combination of fundamental theories and a platform which distills learnings so they are not forgotten and always improved upon provides a stronger basis for expectation of generalization into the next problem.
  • The shortest path to creating valuable applications in the real world is to simply work on creating valuable applications in the real world. Doing this in a manner guided by other elements of the research program is just good sense.

The above must be applied in moderation—some emphasis on theory, some emphasis on real world applications, some emphasis on platforms, and some emphasis on empirics. This has been my research approach for a little over 10 years, ever since I started working on contextual bandits.

Let’s call the first research program ’empirical simulation’ and the second research program ‘real fundamentals’. The empirical simulation approach has a clear strong advantage in that it creates impressive demos, which creates funding, which creates more research. The threshold for contribution to the empirical simulation approach may also be lower simply because it requires mastery of fewer elements, implying people can more easily participate in it. At the same time, the real fundamentals approach has clear advantages in addressing the weaknesses of the empirical simulation approach. At a concrete level, this means we have managed to define and create fundamentals through research while creating real-world applications and value radically more efficiently than the empirical simulation approach has achieved.

The ‘real fundamentals’ concept is behind the open positions above. These positions have been designed to come with both the colleagues and mandate to address the most difficult research problems along with the organizational leverage to change the world. For people interested in fundamentals and making things happen in the real world these are prime positions—please consider joining us.

Categories: Blogroll

11th GATE Training Course: Large Scale Text and Social Media Analytics with GATE

Every year for the last decade, the GATE team at Sheffield have been delivering summer courses helping people get to grips with GATE technology. One year we even ran a second course in Montreal! It's always a challenge deciding what to include. GATE has been around for almost a quarter of a century, and in that time it has organically grown to include a wide variety of technologies too numerous to cover in a week long course, and adapt to the changing needs of our users during one of the most technologically exciting periods in history. But under the capable leadership of Diana Maynard and Kalina Bontcheva, we've learned to squeeze the most useful material into the limited time available, helping beginners to get started with GATE without overwhelming them, as well as empowering more experienced users to see the potential to push it into new territory.

Recent years have seen a surge of interest in social media. These media offer potential for commercial users to deepen their understanding of their customers, and for researchers to explore and understand the ways in which these media are affecting society, as well as using social media data for various other research purposes. For this reason, we have positioned social media as a central theme for the course, which most students seem to find accessible and interesting. It provides an opportunity to showcase GATE's Twitter support, and draw examples from our own work on social media within the Societal Debates theme of SoBigData. However, there are also plenty of examples illustrating how GATE can be applied to other popular areas, such as analysis of news or medical text.

I've been teaching GATE's machine learning offering for most of the time the course has been running, and therefore I've had the opportunity to explore different ways of helping people to get a handle on what can seem an intimidating topic to those who aren't already familiar with it. Machine learning is challenging to teach to a mixed audience, because it's such a large field and the time is limited. It's also an important one though, as it's increasingly a part of the public discourse, and many students are excited to learn about the ways they can incorporate machine learning into their work using GATE. Johann Petrak has taken the lead on keeping the GATE Learning Framework up to date with the latest developments in this rapidly evolving field, and I'm always proud and excited to teach something new that's been added since the last course.

It's evident from the discussions during lunch and tea breaks that students are eager to talk to us about how they are using GATE, and how they would like to use it. I think one of the most valuable things about the course is the opportunity it provides for the students to talk to us about what they are doing with GATE, and for us to be inspired by the range of uses to which GATE is being put. Here is some of the feedback we received from students this year:

"Last week was one of the most useful courses I have done. Overall I think it was pitched really well given the range of technical abilities."

"Thank you all for such an informative and well-delivered course. I was a little worried about whether I'd be able to pick it up as I don’t have a background in programming, but I learned so much and the trainers were all very helpful and patient."

Categories: Blogroll
Syndicate content