Blogroll
Thursday Signal - Repeat After Me: Apps Are (Currently) Myopic (Or...We've Seen This Movie Before...)
Video Chat on the Plane? Illegal? OK? Legal Gray Area?
Announcing The Fifth Annual CM Summit: Theme and Initial Lineup
Yahoo!’s Learning to Rank Challenge
[This is a refactoring of the material that should have been broken out from yesterday's discussion of the expected reciprocal rank evaluation metric.]
Yahoo! is hosting an online Learning to Rank Challenge. Sort of like a poor man’s Netflix (top prize US$8K, and requires you to make your own way to Haifa (in Israel) and register for a conference to present results in order to collect it).
Like Netflix, this is an incredibly rich data set tied to a real large-scale application. I can imagine how much editorial time went into creating it. There are over 500K real query relevance judgments!
Training DataThe supervised training data consists of 500K or so pairs of queries and documents represented as 750-dimensional vectors and provided “editorial grades” of relevance from 0 (completely irrelevant) to 4 (near perfect match). There’s also a smaller set of 20K or so pairs with some overlapping and some new features to evaluate adaptation.
Unfortunately, there’s no raw data, just the vectors.
Test DataThe contest involves ranking a set of documents with respect to a query, where the input is the set of document-query vectors. An evaluation set consists of a number of query-document pairs represented as vectors, where there may be any number of potential documents for each query. The systems must return rankings of the documents for each query.
Binary Logistic Regression for RankingIn case you were curious, I’m going to tackle the learning to rank problem by generalizing LingPipe’s logistic regression models to allow probabilistic training. Then I can train editorial grade stopping probabilities directly. If I can predict those well, I can score well at ranking. I don’t see any advantage to training something like DCA compared to a binary logistic regression on a per-example basis here.
I’m not sure which regression packages that scale to this size data allow probabilistic training. It could always be faked by making an even bigger data set with positive and negative items.
No IP Transfer, But a Non-CompeteI’d like to be able to play with these data sets and publish about them and release demos based on them without transferring any intellectual property to a third party other than a publication and rights to publicity. IANAL, but I don’t think I can do that with this data.
We still haven’t gone through the legal fine print, nor is it clear it will be worth our while to do so in the end. It’s expensive to run this stuff by lawyers to look for gotchas. And my own read makes it seem like a no-go.
It doesn’t appear from a quick read that they require the winners to give them the algorithm. They do require the winners have the rights to their submission. And they will take possession of copyright on submissions (presumably not including the algorithm itself). This is great if I’m reading it right.
On the other hand, the non-compete clause (4a) immediately jumped out at me as a deal breaker:
THE TEAM WILL NOT ENTER THE ALGORITHM NOR ANY OUTPUT OR RESULTS THEREOF IN ANY OTHER COMPETITION OR PROMOTION OFFERED BY ANYONE OTHER THAN THE SPONSOR FOR ONE YEAR AFTER THE CONCLUSION OF THE CONTEST PERIOD;
Not only are they shouting, I’m not sure what constitutes “the algorithm”. It’d be absurd to tie up LingPipe’s logistic regression by entering a contest. Yet it’s not clear what more there is to “THE ALGORITHM”.
And if we release something as part of LingPipe, our royalty-free license lets people do whatever they want with it. So I don’t see how we could comply.
The other issue that’s problematic for a company is (4g):
agrees to indemnify and hold the Contest Entities and their respective subsidiaries, affiliates, officers, directors, agents, co-branders or other partners, and any of their employees (collectively, the “Contest Indemnitees”), harmless from any and all claims, damages, expenses, costs (including reasonable attorneys’ fees) and liabilities (including settlements), brought or asserted by any third party against any of the Contest Indemnitees due to or arising out of the Team’s Submissions or Additional Materials, or any Team Member’s conduct during or in connection with this Contest, including but not limited to trademark, copyright, or other intellectual property rights, right of publicity, right of privacy and defamation.
One doesn’t like to put one’s company on the line for a contest that’ll at most net US$8K (less travel and conference registration) and minor publicity.
Google Public Data Explorer
Google Public Data Explorer [google.com] is yet the latest entry in the ongoing race to democratize data access and its representation for lay people. Similar to Many Eyes, Swivel, Tableau Public and many others, Google Public aims to make large datasets easy to explore, visualize and communicate. As a unique feature, the charts and maps are able to animate over time, so that any meaningful time-varying data changes become easier to understand. The goal is for students, journalists, policy makers and everyone else to play with the tool to create visualizations of public data, link to them, or embed them in their own webpages. Embedded charts are also updated automatically, so they always show the latest available data.
Google is not a particularly new player in this realm. After Google acquired Trendalyzer from the Gapminder foundation several years ago, a simple Chart API and more powerful Data Visualization API appeared that allowed for the generation of powerful, interactive information dashboards. More recently, Google added visualization graphs to search results. Other valuable visualization jewels in the now impressive Google treasure collection include: Google Insights for Search, Google Trends, Google Wonder Wheel, Google Flu Trends and Google Zeitgeist.
More detailed information is available on the official Google Blog. In the blog post, you will also find an interesting downloadable dataset of the 80 most popular data and statistics search topics, based on the aggregation of billions of queries people typed into Google search.
Weds. Signal: Get Me a Mobile Strategy or You're Fired!
Quick Hit: Mobile Report From Weisel
Chapelle, Metzler, Zhang, Grinspan (2009) Expected Reciprocal Rank for Graded Relevance
In this post, I want to discuss the evaluation metric, expected reciprocal rank (ERR), which is the basis of Yahoo!’s Learning to Rank Challenge. The ERR metric was introduced in:
- Chapelle, Olivier, Donald Metzler, Ya Zhang, and Pierre Grinspan. 2009. Expected reciprocal rank for graded relevance. In CIKM.
The metric was designed to model a user’s search behavior better than previous metrics, many of which are discussed in the paper and related to expected reciprocal rank.
The Cascade ModelExpected reciprocal rank is based on the cascade model of search (there are citations in the paper). The cascade model assumes a user scans through ranked search results in order, and for each document, evaluates whether the document satisfies the query, and if it does, stops the search.
This model presupposes that a single document can satisfy a search. This is a reasonable model for question answering (e.g. [belgium capital], [raveonettes concert dates]) or for navigation queries (e.g. [united airlines], [new york times]).
In a cascade model, a highly ranked document that is likely to satisfy the search will dominate the metric.
But the cascade model is a poor model for the kinds of search I often find myself doing, which is more research survey oriented (e.g. ["stochastic gradient" "logistic regression" (regularization OR prior)], [splice variant expression RNA-seq]). In these situations, I want enough documents to feel like I’ve covered a field up to whatever need I have.
Editorial GradesTo make a cascade model complete, we need to model how likely it is that a given document will satisfy a given user query. For the learning to rank challenge, this is done by assigning each query-document pair an “editorial grade” from 0 to 4, with 0 meaning irrelevant and 4 meaning highly relevant. These are translated into probabilities of the document satsifying the search by mapping a grade to , resulting in the following probabilities:
Grade Satisfaction Probability 0 0 1 1/16 2 3/16 3 7/16 4 15/16That is, when reviewing a search result for a query with an editorial grade of 3, there is a 7/16 chance the user will be satisfied with that document and hence terminate the search, and a 9/16 chance they will continue with the next item on the ranked list.
Expected Reciprocal RankExpected reciprocal rank is just the expectation of the reciprocal of the position of a result at which a user stops. Suppose for a query , a system returns a ranked list of documents , where the probability that document satisfies the user query is given by the transform of the editorial grade assigned to the query-document pair, which we write . If we let be a random variable denoting the rank at which we stop, the metric is the expectation of ,
Stopping at rank involves being satisfied with document , the probability of which is . It also involves not having been satisified with any of the previous documents ranked , the probability of which is . This is all multiplied by , because it’s the inverse stopping rank whose expectation is being computed.
For instance, suppose my system returns documents D1, D2, and D3 for a query Q, where the editorial grades for the document-query pairs are 3, 2, and 4 respectively. The expected reciprocal rank is computed by
Rank k 1/Rank Grade p(satisfied by doc k) p(stop at doc k) 1 1/1 3 7/16 7/16 2 1/2 2 3/16 3/16 * (1 – 7/16) 3 1/3 4 15/16 15/16 * (1 – 3/16) * (1 – 7/16)For instance, to stop at rank 2, I have to not be satisfied by the document at rank 1 and be satisfied by the document at rank 2.
We then just multiply the reciprocal ranks by the stop probabilities to get:
ERR = 1/1 * 7/16 + 1/2 * 3/16 * (1 - 7/16) + 1/3 * 15/16 * (1 - 3/16) * (1 - 7/16) = 0.63 Document Relevance IndependenceOne problem remaining for this metric (and others) is correlated documents. I often find myself doing a web search and getting lots and lots of hits that are essentially wrappers around the same PDF paper (e.g. CiteSeer, Google Scholar, ACM, several authors’ publication pages, departmental publications pages, etc). Assigning each of these an independent editorial grade does not make sense, because if one satisfies my need, they all will, and if one doesn’t satisfy my need, none of them will. They’re not exact duplicate pages, though, so it’s not quite just a deduplication problem. And of course, this is just the extreme end of correlation among results.
Lots of queries are ambiguous. For instance, consider the query [john langford], which is highly ambiguous. There’s an Austin photographer, a realtor, the machine learning blogger, etc. I probably want one of these, not all of them (unless I’m doing a research type search), so my chance of stopping at a given document is highly dependent on which John Langford is being mentioned. This is another example of non-independence of stopping criteria.
Reciprocal Rank vs. RankOne nice feature of the expected reciprocal rank metric is that it always falls between 0 and 1, with 1 being best. This means that the loss due to messing up any single example is bounded. This is like 0/1 loss for classifiers, where the maximum loss for misclassifying a given example is 1.
Log loss for classifiers is the negative log probability of the correct response. This number isn’t bounded. Similarly, were we to measure expected rank rather than expected reciprocal rank, the loss for a single example would not be bounded other than by the number of possible matching docs. It actually seems to make more sense to me to use ranks because of their natural scale and ease of combination; expected rank measures the actual number of pages a user can be expected to consider.
With reciprocal rank, as the results lists get longer, the tail is inverse weighted. That is, finding a document at rank 20 adds at most 1/20 to the result. This makes a list of 20 documents with all zero grade documents (completely irrelevant) have ERR of 0, and a list of 20 docs with the first 19 zero grade and the last grade 4 (perfect) has ERR of 1/20 * 15/16. If the perfect result is at position k and all previous results are completely irrelevant, the ERR is 1/k. This means there’s a huge difference between rank 1 and 2 (ERR = 1, vs. ERR = 0.5), with a much smaller absolute difference between rank 10 and rank 20 (ERR = 0.10 vs. ERR=0.05). This matters when we’re averaging the ERR values for a whole bunch of examples.
With ranks, there’s the huge problem of no results being found, which provides a degenerate expected rank calculation. Somehow we need the expected rank calculation to return a large value if there are no relevant results. Maybe if everyone’s evaluating the same documents it won’t matter.
Language Model Generated Injection Attacks: Cool/Disturbing LingPipe Application
Joshua Mason emailed us with a link to his (with a bunch of co-authors) recent ACM paper “English Shellcode” (http://www.cs.jhu.edu/~sam/ccs243-mason.pdf). Shell code attacks can attempt to seize control of a computer by masquerading as data. The standard defense is to look for tell-tale patterns in the data that reflect the syntax of assembly language instructions. It is sort of like spam filtering.The filter would have to reject strings that looked like:
“7IIQZjJX0B1PABkBAZB2BA2AA0AAX8BBPux”
which would not be too hard if you knew to expect language data.
Mason et al changed the code generation process so that lots of variants of the injection are tried but filtered against a language model of English based on the text of Wikipedia and Project Gutenberg.The result is an injection attack that looks like:
“There is a major center of economic activity, such as Star Trek, including The Ed Sullivan Show. The former Soviet Union.”
This is way better than I would have thought possible and it is going to be very difficult to filter. It would be interesting to see how automatic essay grading software would score the above. It is gibberish, but sophisticated sounding gibberish.
And it used LingPipe for the language processing.
I am a firm believer in the white hats publicizing exploits before black hats deploy them surreptitiously. This one could be a real problem however.
Breck
Obama Loves Infographic Movies (and Edward Tufte)
Yesterday, I tweeted the quite surprising news that information graphics guru Edward Tufte will be joining the Recovery Independent Advisory Panel [edwardtufte.com], which aims to track and explain $787 billion in recovery stimulus funds in the US. I was convinced this news event had been given sufficient attention.
Little did I know. I just discovered the Obama campaign actually loves infographics so much, they recently have featured an infographic-style movie on their official website [barackobama.com]. While not being shockingly original, except of maybe the dramaturgical twist around the 0:38 mark, it nevertheless is nice to see those electoral campaign dollars at work in the area of (political) infographics. They even consider data graphs to be so persuasive that the public is invited to download the 'V' shaped job loss bar graph (PDF) as a Twitter icon.
Watch the movie below.
Tuesday Signal: The Internet Is A Human Right (And Spending Is Up. Yippee!)
Tim Berners-Lee: The Year Open Data Went Worldwide (TED Talk)
The very short TED talk by internet pioneer Tim Berners-Lee shows a few of the interesting results where open data gets mashed up in various compelling data visualizations. The talk includes many examples the avid infosthetics reader should already know, including Where Does my Money Go?, California Stimulus Map, data.gov.uk Newspaper and OpenStreetMap Edits.
Other projects, not blogged about before on this site, include Making Water a Matter of Race [time.com], Afghanistan Election Data [afghanistanelectiondata.org] and Haiti OpenStreetMap Edits [itoworld.blogspot.com].
Watch the movie below.
See also another recent TED talk about a powerful data visualization tool developed by Microsoft: "Live Labs Pivot: A Massive Interactive Zoom on Data".
Accusing Google's Business Practice through (another) Infographic Movie
It seems the now classic infographic movie "Google Master Plan" from early 2007 is now finally outdated, as the main allegations have been recently updated by another infographically-style film, aptly titled The Beast File: Google [abc.net.au] (movie not viewable outside Australia, but watch a YouTube version below).
The new movie, which itself seems to be inspired by the visual zooming effects from the presentation software Prezi, defines Google as an advertising giant whose main goal is to track users and deliver targeted ads.
For the appropiate counter argumentation, you can read the following post at the official Google Operating System blog.
You can watch the movie below.
Database of Intentions Chart - Version 2, Updated for Commerce
Monday Signal: Block Those Ads!
The Database of Intentions Is Far Larger Than I Thought
Clavilux 2000: Generative Music Visualization Composition
Clavilux 2000 [jonasheuer.de] is a subtle music visualization installation that represents the playing of sounds by way of a simultaneous animation that can be interpreted.
For every note played on the keyboard, a stripe appears of which the dimensions, position and color correspond to the way the particular key was stroke. The length and vertical position of stripe is mapped unto the velocity, while the stripe's width reflects the length of each note. By mapping the color wheel on the circle of 5ths, the colors give the viewer (and listener) an impression of the harmonic relations. Notes belonging to one specific tonality correspond to colors from one specific area of the color wheel. Therefore each key has its own color scheme and "wrong" notes stand out in contrasting colors. The more different tonalities a music piece has, the more colorful the resulting visualization will be.
As all the stripes do not disappear, the resulting representation is able to convey insights about the composition as well as the specific performance: Which notes were played the most? Which were the loudest notes? Which range of the keys was played mostly? How harmonically constant was the music?
Watch the visualization in action below.
Live Labs Pivot: A Massive Interactive Zoom on Data (TED Talk)
"Viewing information and data in this way, is a lot like swimming in a living information infographic." During his very impressive TED talk, Gary Flake, Technical Fellow at Microsoft, demos the novel and still experimental Pivot [getpivot.com] technology. Pivot is a completely new way to browse and arrange massive amounts of images and data online. Built on the Seadragon zooming technology, it enables spectacular zooming in and out of web databases, and the discovery of patterns and links invisible in standard web browsing.
"Right now, in this world, we think about data as being this curse, we talk about the curse of information overload, drowning in data. What if we can turn that upside down, so that instead of navigating from one thing to the next, we get used to the habit of being able to go from many things to many things and then being able to see the patterns that were otherwise hidden."
Watch the video of the talk below.
What will it enable in the world of data visualization that was not possible before?
DaVis'10: Deadline Extended!
You might have seen our previous announcement of DAVis, the 5th International Symposium on Design and Aesthetics in Visualisation, co-located with the much bigger IEEE IV010 conference in London.
For all those of you who missed the quite tight submission deadline, here is some very good news:
Submissions are still possible until March 21! Make sure to check out the rules about the style, maximum length, important dates, etc. at the IV'10 website (click on "Papers"). Here some answers to the questions we received so far: we accept long and short papers and posters, although long papers are much preferred. After successfully passing the double-blind peer-review process, all accepted submissions will be published as full academic publications in the official conference proceedings by IEEE and thus be available online for all to see.
Again - we are looking forward to your submissions and do not hesitate to get in touch, if there might be any open questions left!
This post was written by Moritz Stefaner, a researcher and freelance practitioner on the crossroads of design and information visualization. Occasionally, he blogs at well-formed-data.net.


