Skip navigation.
Semantic Software Lab
Concordia University
Montréal, Canada


Pervasive Simulator Misuse with Reinforcement Learning

Machine Learning Blog - Wed, 2018-02-14 16:25

The surge of interest in reinforcement learning is great fun, but I often see confused choices in applying RL algorithms to solve problems. There are two purposes for which you might use a world simulator in reinforcement learning:

  1. Reinforcement Learning Research: You might be interested in creating reinforcement learning algorithms for the real world and use the simulator as a cheap alternative to actual real-world application.
  2. Problem Solving: You want to find a good policy solving a problem for which you have a good simulator.

In the first instance I have no problem, but in the second instance, I’m seeing many head-scratcher choices.

A reinforcement learning algorithm engaging in policy improvement from a continuous stream of experience needs to solve an opportunity-cost problem. (The RL lingo for opportunity-cost is “advantage”.) Thinking about this in the context of a 2-person game, at a given state, with your existing rollout policy, is taking the first action leading to a win 1/2 the time good or bad? It could be good since the player is well behind and every other action is worse. Or it could be bad since the player is well ahead and every other action is better. Understanding one action’s long term value relative to another’s is the essence of the opportunity cost trade-off at the core of many reinforcement learning algorithms.

If you have a choice between an algorithm that estimates the opportunity cost and one which observes the opportunity cost, which works better? Using observed opportunity-cost is an almost pure winner because it cuts out the effect of estimation error. In the real world you can’t observe the opportunity cost directly Groundhog day style. How many times have you left a conversation and thought to yourself: I wish I had said something else? A simulator is different though—you can reset a simulator. And when you do reset a simulator, you can directly observe the opportunity-cost of an action which can then directly drive learning updates.

If you are coming from viewpoint 1, using a “reset cheat” is unappealing since it doesn’t work in the real world and the goal is making algorithms which work in the real world. On the other hand, if you are operating from viewpoint 2, the “reset cheat” is a gigantic opportunity to dramatically improve learning algorithms. So, why are many people with goal 2 using goal 1 designed algorithms? I don’t know, but here are some hypotheses.

  1. Maybe people just aren’t aware that goal 2 style algorithms exist? They are out there. The most prominent examples of goal 2 style algorithms are from Learning to search and AlphaGo Zero.
  2. Maybe people are worried about the additional sample complexity of doing multiple rollouts from reset points? But these algorithm typically require little additional sample complexity in the worst case and can provide gigantic wins. People commonly use a discount factor d values future rewards t timesteps ahead with a discount of dt. Alternatively, you can terminate rollouts with probability 1 – d and value future rewards with no discount while preserving the expected value. Using this approach a rollout terminates after an expected 1/(1-d) timesteps bounding the cost of a reset and rollout. Since it is common to use very heavy discounting (e.g. d=0.9), the worst case additional sample complexity is only a small factor larger. On the upside, eliminating estimation error is can radically reduce sample complexity in theory and practice.
  3. Maybe the implementation overhead for a second family of algorithms is to difficult? But the choice of whether or not you use resets is far more important than “oh, we’ll just run things for 10x longer”. It can easily make or break the outcome.

Maybe there is some other reason? As I said above, this is head-scratcher that I find myself trying to address regularly.

Categories: Blogroll

<h2 dir="ltr" id="docs-internal-guid

Discerning Truth in the Age of Ubiquitous DisinformationInitial Reflection on My Evidence to the DCMS Enquiry on Fake News
Kalina Bontcheva (@kbontcheva)

The past few years have heralded the age of ubiquitous disinformation, which is posing serious questions over the role of social media and the Internet in modern democratic societies. Topics and examples abound, ranging from the Brexit referendum and the US presidential election to medical misinformation (e.g. miraculous cures for cancer). Social media are now routinely reinforcing their users’ confirmation bias, so often, little to no attention is paid to opposing views or critical reflections. Blatant lies often make the rounds, re-posted and shared thousands of times, and even jumping successfully sometimes in mainstream media. Debunks and corrections, on the other hand, receive comparatively little attention.

I often get asked: “So why is this happening?”

My short answer is - the 4Ps of the modern disinformation age: post-truth politics, online propaganda, polarised crowds,  and partisan media.

  1. Post-truth politics: The first societal and political challenge comes from the emergence of post-truth politics, where politicians, parties, and governments tend to frame key political issues in propaganda, instead of facts. Misleading claims are continuously repeated, even when proven untrue through fact-checking by media or independent experts (e.g. the VoteLeave claim that Britain was paying the EU £350 million a week). This has a highly corrosive effect on public trust.
  2. Online propaganda and fake news: State-backed (e.g. Russia Today), ideology-driven (e.g.  misogynistic or Islamophobic), and clickbait websites and social media accounts are all engaged in spreading misinformation, often with the intent to deepen social division and/or influence key political outcomes (e.g. the 2016 US presidential election).  
  3. Partisan media: The pressures of the 24-hour news cycle and today’s highly competitive online media landscape have resulted in lower reporting quality and opinion diversity, with misinformation, bias, and factual inaccuracies routinely creeping in.
  4. Polarised crowds: As more and more citizens turn to online sources as their primary source of news, the social media platforms and their advertising and content recommendation algorithms have enabled the creation of partisan camps and polarised crowds, characterised by flame wars and biased content sharing, which in turn, reinforces their prior beliefs (typically referred to as confirmation bias).  

On Tuesday (19 December 2017) I gave evidence in front of the Digital, Culture, Media, and Sports Committee (DCMS) as part of their enquiry into fake news (although I prefer the term disinformation) and automation (ako bots) - their ubiquity, impact on society and democracy, the role of platforms and technology in creating the problem, and briefly also - can we use existing technology to detect and neutralise the effect of bots and  disinformation.

The session lasted an hour, in which we had to answer 51 questions, spanning all these issues, so it meant each answer had to be kept very brief.   The full transcript is available here.

The list of questions was not given to us in advance, which, coupled with the need for short answers, left me with a number of additional points I would like to make. So this is the first of several blog posts where I will revisit some of these questions in more detail.

Let's get started with the first four questions (Q1 to Q4 in the transcript), which were about the availability and accuracy of technology for automatic detection of disinformation on social media platforms. In particular:
can such technology identify disinformation in real time (part of Q3) and should it be adopted by the social media platforms themselves (Q4).

TD;LR: Yes, in principle, but we are still far from having solved key socio-technical issues, so, when it comes to containing the spread of disinformation, we should not use this as yet another stick to beat the social media platforms with.

And here is why this is the case:

  • Non-trivial scalability: While some of our algorithms work in near real time on specific datasets (e.g. tweets about the Brexit referendum), applying them across all posts on all topics as Twitter would need to do, for example, is very far from trivial. Just to give a sense of the scale here - prior to 23 June 2016 (referendum day) we had to process fewer than 50 Brexit-related tweets per second, which was doable. Twitter, however, would need to process more  than 6,000 tweets per second which is a serious software engineering, computational, and algorithmic challenge.

  • Algorithms make mistakes, so while 90% accuracy intuitively sounds very promising, we must not forget the errors - 10%  in this case, or double that at 80% algorithm accuracy. On  6,000 tweets per second this 10% amounts to 600 wrongly labeled tweets per second rising to 1,200 for the lower accuracy algorithm. To make matters worse, automatic disinformation analysis often combines more than one algorithm - first to determine which story a post refers to and second - whether this is likely true, false, or uncertain. Unfortunately, when algorithms are executed in a sequence, errors have a cumulative effect.

  • These mistakes can be very costly: broadly speaking algorithms make two kinds of errors - false negatives (e.g.. disinformation is wrongly labelled as true or bot accounts wrongly identified as human) and false positives (e.g. correct information is wrongly labelled as disinformation or genuine users being wrongly identified as bots). False negatives are a problem on social platforms, because the high volume and velocity of social posts (e.g. 6,000 tweets per second on average) still leaves us with a lot of disinformation “in the wild”. If we draw an analogy with email spam - even though most of it is filtered out automatically, we are still receiving a significant proportion of spam messages. False positives, on the other hand, pose an even more significant problem, as they could be regarded as censorship. Facebook, for example, has a growing problem with some users having their accounts wrongly suspended.

Related posts on this blog: 
Categories: Blogroll

Vowpal Wabbit 8.5.0 & NIPS tutorial

Machine Learning Blog - Sun, 2017-12-03 11:45

Yesterday, I tagged VW version 8.5.0 which has many interactive learning improvements (both contextual bandit and active learning), better support for sparse models, and a new baseline reduction which I’m considering making a part of the default update rule.

If you want to know the details, we’ll be doing a mini-tutorial during the Friday lunch break at the Extreme Classification workshop at NIPS. Please join us if interested.

Edit: also announced at the Learning Systems workshop

Categories: Blogroll
Syndicate content