Skip navigation.
Home
Semantic Software Lab
Concordia University
Montréal, Canada

Blogroll

That Was Fast: TellApart Implements A Searchblog Suggestion

Searchblog - Thu, 2010-09-02 13:37
Earlier this week I mused out loud about retargeting, suggesting that perhaps it's time for marketers to not just chase folks around the web in hopes they might irritate us into submission, but rather offer us the chance to politely say "Not right now, thanks." One of Searchblog's readers... (Go to Searchblog Main)
Categories: Blogroll

Google and AOL Renew Pact

Searchblog - Thu, 2010-09-02 12:26
One of the best cards in AOL's difficult hand has been its search deal with Google, which was up for renewal this year. This morning the two companies announced (ahead of a December deadline) that they were staying together, though I can only imagine the folks at Bing didn't... (Go to Searchblog Main)
Categories: Blogroll

Ping: "Facebook and Twitter meet iTunes" Except...

Searchblog - Thu, 2010-09-02 00:34
...as far as I can tell, they in fact don't ever meet. You can't leverage your networks on Facebook and Twitter in Ping. It's another closed Apple system, another Apple universe in a gilded gift box. It's not that Apple hates the web, it's just that Apple is better... (Go to Searchblog Main)
Categories: Blogroll

US Open Tennis Real-Time Data Visualization

Information aesthetics - Wed, 2010-09-01 20:56


On the heels of the many real-time sports visualizations that appeared alongside the recent FIFA soccer worldcup, the US Open Pointstream [usopen.org] presents an original 3D-like way of exploring the statistical data generated during all the live tennis matches of one of the most famous sports events in the world.

Users are able to select individual matches which occurred in the past or are still in progress. A "Momentum Meter" shows who is on top of the match, while a series of filters at the bottom (e.g. ace, double foult, netpoint, breakpoint, ...) allow for deeper analysis of the data. Visually, each player is distinguished by the color green or blue. Each ring represents a set, going from the inside to the outside. Each bar represents a point, with its height according to the serving speed.

Beautiful or useful?

Categories: Blogroll

blekko Explains Itself: Exclusive Video (Update: Exclusive Invite)

Searchblog - Wed, 2010-09-01 01:21
And in case you are wondering what the big deal is, besides all the data you can mine, to my mind, it's the ability to cull the web - to "slash" the stuff you don't care about out of your search results. ... And to bone up on the various merits of the service, here are a few key links: Blekko: A Search Engine Which Is Also A Killer SEO Tool (SEL) TechCrunch Review: The Blekko Search Engine Prepares To Launch (TC) A new search engine Blekko search: first impressions (Economist) (Go to Searchblog Main)
Categories: Blogroll

Online Learning Algorithms that Work Harder

NLPers - Tue, 2010-08-31 21:09
It seems to be a general goal in practical online learning algorithm development to have the updates be very very simply.  Perceptron is probably the simplest, and involves just a few adds.  Winnow takes a few multiplies.  MIRA takes a bit more, but still nothing hugely complicated.  Same with stochastic gradient descent algorithms for, eg., hinge loss.

I think this maybe used to make sense.  I'm not sure that it makes sense any more.  In particular, I would be happier with online algorithms that do more work per data point, but require only one pass over the data.  There are really only two examples I know of: the StreamSVM work that my student Piyush did with me and Suresh, and the confidence-weighted work by Mark Dredze, Koby Crammer and Fernando Pereira (note that they maybe weren't trying to make a one-pass algorithm, but it does seem to work well in that setting).

Why do I feel this way?

Well, if you look even at standard classification tasks, you'll find that if you have a highly optimized, dual threaded implementation of stochastic gradient descent, then your bottleneck becomes I/O, not learning.  This is what John Langford observed in his Vowpal Wabbit implementation.  He has to do multiple passes.  He deals with the I/O bottleneck by creating an I/O friendly, proprietary version of the input file during the first past, and then careening through it on subsequent passes.

In this case, basically what John is seeing is that I/O is too slow.  Or, phrased differently, learning is too fast :).  I never thought I'd say that, but I think it's true.  Especially when you consider that just having two threads is a pretty low requirement these days, it would be nice to put 8 or 16 threads to good use.

But I think the problem is actually quite a bit more severe.  You can tell this by realizing that the idealized world in which binary classifier algorithms usually get developed is, well, idealized.  In particular, someone has already gone through the effort of computing all your features for you.  Even running something simple like a tokenizer, stemmer and stop word remover over documents takes a non-negligible amount of time (to convince yourself: run it over Gigaword and see how long it takes!), easily much longer than a silly perceptron update.

So in the real world, you're probably going to be computing your features and learning on the fly.  (Or at least that's what I always do.)  In which case, if you have a few threads computing features and one thread learning, your learning thread is always going to be stalling, waiting for features.

One way to partially circumvent this is to do a variant of what John does: create a big scratch file as you go and write everything to this file on the first pass, so you can just read from it on subsequent passes.  In fact, I believe this is what Ryan McDonald does in MSTParser (he can correct me in the comments if I'm wrong :P).  I've never tried this myself because I am lazy.  Plus, it adds unnecessary complexity to your code, requires you to chew up disk, and of course adds its own delays since you now have to be writing to disk (which gives you tons of seeks to go back to where you were reading from initially).

A similar problem crops up in structured problems.  Since you usually have to run inference to get a gradient, you end up spending way more time on your inference than your gradients.  (This is similar to the problems you run into when trying to parallelize the structured perceptron.)

Anyway, at the end of the day, I would probably be happier with an online algorithm that spent a little more energy per-example and required fewer passes; I hope someone will invent one for me!
Categories: Blogroll

Banks, Risk Disclosure and Text Analytics

Life Analytics Blog - Tue, 2010-08-31 04:52

A UK-based MSc student of Kingston Business School - Christos Gkemitzis had an idea for his MSc project which immediately caught my attention : Use Text Analytics methods to annual reports given by Banks and extract metrics on how these Banks handle their Credit and Interest rate risk as explained in these reports and then test several hypotheses ( do Banks of a higher risk profile disclose bigger amount of risk-related information compared to those having lower risk profile?) and also identify any correlations :
  • between the size of the Bank and volume of risk disclosures
  • between the risk of the Bank's profile and volume of risk disclosures
  • between the profitability of the Bank and volume of risk disclosures

Essentially the problem is to -automatically- identify mentions of credit risk but in a specific way :
1) Identify sentences mentioning risk refer to the present, past or future2) Identify positive, negative or neutral sentiment mentions about Credit Risk3) Identify qualitative versus quantitative information regarding the Bank's Credit Risk

For example consider the following text which is part of an actual Bank report :
"A substantial increase of credit risk and provisions is also expected, as from 2009 on, theeconomy will be entering a period of low growth."
The sentence above contains qualitative information ("substantial increase of credit risk and provisions") and negative Sentiment referring to the future ("also expected" and "will be entering a period of...").
while the following sentence :
"The Group’s ongoing efforts to manage efficiently credit risk led the level of loan losses to 3.3% in December 2008"
contains quantitative information ("level of loan losses to 3.3%") with a positive sentiment about Credit Risk handling in the past.
After receiving some PDF samples of Bank reports from Christos, i began feeding these reports to the GATE Text Analysis toolkit in order to assess the feasibility of such analysis. After some tutorials through Skype, Christos -who had no prior knowledge of programming- started using the toolkit on his own in a very short amount of time. Here is a snapshot of GATE in action for the analysis :





The snapshot shows how GATE correctly identified a part of text that communicates a negative sentiment for Credit Risk in a qualitative manner for the future (notice that "QualitativeBadNewsFuture" is checked).
After running GATE in many documents, Christos had the necessary metrics (=how many mentions of different Risk types exist in a document) to test his hypotheses using a 2-tailed Wilcoxon test. To identify correlations, Spearman coefficient was also used.
Since this is work which has not been submitted yet, it is not permitted to post the findings of this research. The post shows however another application of Text Analytics and the many sources of unstructured information that could be mined for knowledge.
Categories: Blogroll

Why SKOS thesauri matter – the next generation of semantic technologies

Semantic Web Company - Tue, 2010-08-31 00:44

As a matter of fact still a lot of “semantic technologies” are around which do nothing else than pure statistical analysis of text. Sure, this is better than simple full text search but there are still quite a lot of opportunities to improve search, especially when it comes to more sophisticated applications like “similarity search”, the search for similar documents to enable cross-reading or recommendation systems.

Providers of first generation semantic technologies calculate rather basic “semantic networks” by co-occurency analysis which results sometimes in  disappointing results. Bearing in mind that Google just bought a company (“Google buys Metaweb“) which has been working on one of the largest knowledge bases in the world, we could assume that some of the last miles towards a semantic search engine can be achieved by applying thesauri or other structured knowledge bases.

A demo application was recently developed by PoolParty team where one can find out how thesauri will improve search results on top of second generation semantic technologies. With PoolParty SKOS based controlled vocabularies can be managed and also can be enriched with linked data. PoolParty Tag & Content Recommender analyzes virtually any text or website to recommend corresponding tags, concepts from (in this case) STW (Standard Thesaurus für Wirtschaft), DBpedia and respective articles from Wikipedia.

STW which was developed by the German National Library of Economics (ZBW) provides vocabulary on any economic subject: about 6,000 standardized subject headings and about 18,000 entry terms to support individual keywords.

This background knowledge is used in this demo app to improve the search for similar documents dramatically:

Similarity between two documents can be calculated not only on a key-phrase basis but also on a rather conceptual basis. Even if two documents do not have one single word or phrase in common they can be identified as “similar documents”.

This can be achieved because thousands of important relations between economic subjects are represented in the domain specific thesaurus. Thus, in this special case best results are achieved with documents from economics (for instance from Econstor) but of course for other recommender systems thesauri from other domains can be used instead of STW.

Nevertheless, also this approach can be improved and this development is underway: SKOS thesauri enriched with Linked Data do an even better job. This kind of third generation semantic technologies are currently developed by LASSO project and LOD2 project, two innovative projects in the area of linked data and the semantic web.

Categories: Blogroll

On Retargeting: Fix The Conversation

Searchblog - Mon, 2010-08-30 12:28
The New York Times published a story on the practice of retargeting today, entitled "Retargeting Ads Follow Surfers to Other Sites." While not nearly as presumptively negative as the WSJ series on marketing and data, it's telling that the story is slugged with "adstalk" in the URL. Journalists and editors... (Go to Searchblog Main)
Categories: Blogroll

What is the benefit of freaking customers out?

Greg Linden's Blog - Mon, 2010-08-30 11:01
Miguel Helft and Tanzina Vega at the New York Times have a front page article today, "Retargeting Ads Follow Surfers to Other Sites", on a form of personalized web advertising now being called retargeting.

An excerpt:People have grown accustomed to being tracked online and shown ads for categories of products they have shown interest in, be it tennis or bank loans.

Increasingly, however, the ads tailored to them are for specific products that they have perused online. While the technique, which the ad industry calls personalized retargeting or remarketing, is not new, it is becoming more pervasive as companies like Google and Microsoft have entered the field. And retargeting has reached a level of precision that is leaving consumers with the palpable feeling that they are being watched as they roam the virtual aisles of online stores.

In remarketing, when a person visits an e-commerce site and looks at say, an Etienne Aigner Athena satchel on eBags.com, a cookie is placed into that person’s browser, linking it with the handbag. When that person, or someone using the same computer, visits another site, the advertising system creates an ad for that very purse.
The article later goes on to contrast this technique of following you around with products you looked at before with behavioral targeting like Google is doing, which learns your broader category interests and shows ads from those categories.

If the goal of the advertising is to be useful and relevant, though, I think both of these are missing the mark. What you want to do is help people discover something they want to buy. Since the item they looked at before obviously wasn't quite right -- they didn't buy it after all -- showing that again doesn't help. Showing closely related alternatives, items that people might buy after rejecting the first item, could be quite useful though.

As marketing exec Alan Pearlstein says at the end of the NYT article, "What is the benefit of freaking customers out?" Remarketing freaks people out. If we are going to do personalized advertising, the goal should be to have the advertising be useful, either by sharing value with consumers using coupons as Pearlstein suggests, or by helping consumers find something interesting that they wouldn't have discovered on their own.

But, publishers should be careful when working with these new ad startups. A startup has a huge incentive to maximize short-term revenue and little incentive to maximize relevance. For the startup, as long as it brings in more immediate revenue, it is perfectly fine to show annoying ads that freak customers out and drive many away. Publishers need to force the focus to be on the value of the ads to the consumer so their customers are happy, satisfied, and keep coming back.
Categories: Blogroll

Web 2 Summit Points of Control: The Map

Searchblog - Mon, 2010-08-30 01:09
(Cross posted from the Web 2 Summit Blog...) As themes for conferences go, Points of Control is one of our favorites. Our industry over the past year has been driven by increasingly direct conflicts between its major players: Apple has emerged as a major force in mobile and advertising platforms;... (Go to Searchblog Main)
Categories: Blogroll

Data Journalist David McCandless

Data Mining Blog - Sat, 2010-08-28 12:05

Nice talk by David McCandless on culture and data visualization.

An additional interesting take on military spending would be the relationship between the country in question and the statistics of the other countries in the world as military intentions are, in part, external (in part of his presentation, David talks about the importance of providing relative views of data rather than absolutes).

Categories: Blogroll

Calibrating Reviews and Ratings

NLPers - Fri, 2010-08-27 13:14
NIPS decision are going out soon, and then we're done with submitting and reviewing for a blessed few months. Except for journals, of course.

If you're not interested in paper reviews, but are interested in sentiment analysis, please skip the first two paragraphs :).

One thing that anyone who has ever area chaired, or probably even ever reviewed, has noticed is that different people have different "baseline" ratings. Conferences try to adjust for this, for instance NIPS defines their 1-10 rating scale as something like "8 = Top 50% of papers accepted to NIPS" or something like that. Even so, some people are just harsher than others in scoring, and it seems like the area chair's job to calibrate for this. (For instance, I know I tend to be fairly harsh -- I probably only give one 5 (out of 5) for every ten papers I review, and I probably give two or three 1s in the same size batch. I have friends who never give a one -- except in the case of something just being wrong -- and often give 5s. Perhaps I should be nicer; I know CS tends to be harder on itself than other fiends.) As an aside, this is one reason why I'm generally in favor of fewer reviewers and more reviews per reviewer: it allows easier calibration.

There's also the issue of areas. Some areas simply seem to be harder to get papers into than others (which can lead to some gaming of the system). For instance, if I have a "new machine learning technique applied to parsing," do I want it reviewed by parsing people or machine learning people? How do you calibrate across areas, other than by some form of affirmative action for less-represented areas?

A similar phenomenon occurs in sentiment analysis, as was pointed out to me at ACL this year by Franz Och. The example he gives is very nice. If you go to TripAdvisor and look up The French Laundry, which is definitely one of the best restaurants in the U.S. (some people say the best), you'll see that it got 4.0/5.0 stars, and a 79% recommendation. On the other hand, if you look up In'N'Out Burger, a LA-based burger chain (which, having grown up in LA, was admittedly one of my favorite places to eat in high school, back when I ate stuff like that) you see another 4.0/5.0 stars and a 95% recommendation.

So now, we train a machine learning system to predict that the rating for The French Laundry is 79% and In'N'Out Burger is 95%. And we expect this to work?!

Probably the main issue here is calibrating for expectations. As a teacher, I've figured out quickly that managing student expectations is a big part of getting good teaching reviews. If you go to In'N'Out, and have expectations for a Big Mac, you'll be pleasantly surprised. If you go to The French Laundry with expectations of having a meal worth selling your soul, your children's souls, etc., for, then you'll probably be disappointed (though I can't really say: I've never been).

One way that a similar problem has been dealt with on Hotels.com is that they'll show you ratings for the hotel you're looking at, and statistics of ratings for other hotels within a 10 mile radius (or something). You could do something similar for restaurants, though distance probably isn't the right categorization: maybe price. For "$", In'N'Out is probably near the top, and for "$$$$" The French Laundry probably is.

(Anticipating comments, I don't think this is just an "aspect" issue. I don't care how bad your palate is, even just considering the "quality of food" aspect, Laundry has to trump In'N'Out by a large margin.)

I think the problem is that in all of these cases -- papers, restaurants, hotels -- and others (movies, books, etc.) there simply isn't a total order on the "quality" of the objects you're looking at. (For instance, as soon as a book becomes a best seller, or is advocated by Oprah, I am probably less likely to read it.) There is maybe a situation-depend order, and the distance to hotel, or "$" rating, or area classes are heuristics for describing this "situation." Bit without knowing the situation, or having a way to approximate it, I worry that we might be entering a garbage-in-garbage-out scenario here.
Categories: Blogroll

Gnar Gnar Epic Apple #FAIL

Searchblog - Fri, 2010-08-27 12:41
...that was the subject of an email sent to my by my Apple-loving son when the image above showed up on the family iPad (yes, we have an iPad, my wife insisted. It's really hers, but that's another story). The story goes like this. My son had a question... (Go to Searchblog Main)
Categories: Blogroll

The Week In Signal

Searchblog - Fri, 2010-08-27 01:36
Here you go, all you 188K or so RSS readers. I know you really count on this round up, so you know what I'm doing each night around ten PM.... Friday Signal: A Pre-Weekend Potpurri Thursday Signal: Google’s About FaceBook Weds. Signal: Valuable Point of View, Well Stated, Is... (Go to Searchblog Main)
Categories: Blogroll

Is Google Objective?

Searchblog - Fri, 2010-08-27 00:52
I was struck by this headline from TechCrunch: Has Google Purged Places Of Yelp? All Signs Point To Yes. The story is rather pedestrian - yet another dispute between a content and community service with the all powerful Google. Sure, it's Yelp, but at the end of the day, it's... (Go to Searchblog Main)
Categories: Blogroll

The Recorded Future is Here

Data Mining Blog - Wed, 2010-08-25 23:50

Recorded Future is a new venture which mines the web for statements that are associated with some time expressions. It then uses this corpus to describe the future in various geographies for various topics. In addition to the application of information extraction methods, they also present this information in creative visual displays.


 

The site is plenty full of jQuery goodness, but I did find the newbie experience a little puzzling (how do I navigate to the data visualization? not clear...)

Finally, I loved this quote from a satisfied customer:

"This definitely reduces time in figuring out what may or may not be happening in the future based on what has been happening in the past. It cuts that time in half. "

Advertising Executive

[HT Sundar]


 

Categories: Blogroll

* Your Mileage May Vary

LingPipe Blog - Wed, 2010-08-25 14:38

Alex Smola just introduced a blog, Adventures in Data, that’s focusing on many of the same issues as this one. His first few posts have been on coding tweaks and theorems for gradient descent for linear classifiers, a problem Yahoo!’s clearly put a lot of effort into.

Lazy Updates for SGD Regularization, Again

In the wide world of independently discovering algorithmic tricks, Smola has a post on Lazy Updates for Generic Regularization in SGD. It presents the same technique as I summarized in my pedantically named white paper Lazy Sparse Stochastic Gradient Descent for Regularized Multinomial Logistic Regression. It turns out to be a commonly used, but not often discussed technique.

Blocking’s Faster

I wrote a long comment on that post, explaining that I originally implemented it that way in LingPipe, but eventually found it faster in real large-scale applications to block the prior updates.

Your Mileage May Vary

To finally get to the point of the post (blame my love of rambling New Yorker articles), your mileage may vary.

The reason blocking works for us is that we have feature sets like character n-grams where each text being classified produces on the order of 1/1000 of all features. So if you update by block every 1000 epochs, it’s a win. Not even counting removing the memory and time overhead of keeping the record of sparseness and increased memory locality.

Horses for Courses

The only possible conclusion is that it’s horses for courses. Each application needs its own implementation for optimal results.

As another example, I can estimate my annotation models using Gibbs sampling in BUGS in 20 or 30 minutes. I can estimate them in my custom Java program in a second or two. I only wrote the program because I needed to scale to a named-entity corpus and BUGS fell over. As yet another example, I’m finding coding up genomic data very similar to text data (edit distance, Bayesian models) and also very different (4-character languages, super-long strings).

Barry Richards, the department head of Cog Sci when I was a grad student, gave a career retrospective talk about all the work he did in optimizing planning systems. Turns out each application required a whole new kind of software. As a former logician, he found the lack of generality very dispiriting.

Not being able to resist one further idiom, this sounds like a probletunity to me.

Threading Blog Discussions

This post is a nice illustration of Fernando Pereira’s comment on Hal Daumé III’s blog about the problematic nature of threading discussions across blogs.

That Zeitgeist Again

I need to generalize my earlier post about the scientific zeitgeist to include coding tricks. The zeitgeist post didn’t even discuss my rediscoveries in the context of SGD!


Categories: Blogroll

Boosted Decision Trees for Deep Learning

Machine Learning Blog - Mon, 2010-08-23 13:18

About 4 years ago, I speculated that decision trees qualify as a deep learning algorithm because they can make decisions which are substantially nonlinear in the input representation. Ping Li has proved this correct, empirically at UAI by showing that boosted decision trees can beat deep belief networks on versions of Mnist which are artificially hardened so as to make them solvable only by deep learning algorithms.

This is an important point, because the ability to solve these sorts of problems is probably the best objective definition of a deep learning algorithm we have. I’m not that surprised. In my experience, if you can accept the computational drawbacks of a boosted decision tree, they can achieve pretty good performance.

Geoff Hinton once told me that the great thing about deep belief networks is that they work. I understand that Ping had very substantial difficulty in getting this published, so I hope some reviewers step up to the standard of valuing what works.

Categories: Blogroll
Syndicate content