Skip navigation.
Home
Semantic Software Lab
Concordia University
Montréal, Canada

Blogroll

Revealing the Impact of Super Bowl Advertising on Social Media

Information aesthetics - Fri, 2012-02-03 13:28


The interactive dashboard at Brandwatch Super Bowl [brandwatch.com] shows the true impact of the highly expensive advertising that is shown during the Super Bowl, in particular on social online media.

Each so-called 'worm' represents a unique sponsor (including brands like Pepsi, Mars, Walt Disney or H&M). The accompanying number stands for the number of tweets that were made about that brand or their products over the last 28 days (and, yes, the 'worm' who has possession of the ball is winning).

The additional display on the right reveals the complete ranking of all tracked brands, complete with time-based sparklines, the positive versus negative sentiment of the daily tweets, and the most popular keywords that were used.

As a result, one can already attempt to estimate what will be the most anticipated ads for this Sunday, in addition to their expected content.

Categories: Blogroll

Histograms — matplotlib vs. R

When possible, I like to use R for its really, really good statistical visualization capabilities. I’m doing a modeling project in Python right now (R is too slow, bad at large data, bad at structured data, etc.), and in comparison to base R, the matplotlib library is just painful. I wrote a toy Metropolis sampler for a triangle distribution and all I want to see is whether it looks like it’s working. For the same dataset, here are histograms with default settings. (Python: pylab.hist(d), R: hist(d))

I want to know whether my Metropolis sampler is working; those two plots give a very different idea. Of course, you could say this is an unfair comparison, since matplotlib is only using 10 bins, while R is using 18 here — and it’s always important to vary the bin size a few times when looking at histograms. But R’s defaults really are better: it actually uses an adaptive bin size, and the heuristic worked, choosing a reasonable number for the data. The hist() manual says it’s from Sturges (1926). It’s hard to find other computer software that cites 100 year old papers for its design decisions — and where it matters. (Old versions of R used to yell at you when you made a pie chart, citing perceptual studies that humans are really bad at interpreting them (here). This is what originally made me love R.)

Second, R is much smarter about breakpoints. In the following plots, I’ve manually set the number of bins to 10, and then 30 for each.

The second one is now OK for matplotlib — it’s good enough to figure out what’s going on — though still a little lame. Why the gaps?

The problem is that my data are discrete — they’re all integers from 1 through 19 — and I think matplotlib is naively carving up that range into bins, which sometimes lumps together two integers, and sometimes gets zero of them. I understand this is the simple naive implementation, and you could say it’s my fault that I shouldn’t have used the pylab histogram function for this type of data — but it’s really not as good as whatever R is doing, which works rather well here, and I didn’t have to waste time thinking about the internals of the algorithm. For reference, here is the correct visualization of the data (R: plot(table(d))). Note that R’s original Sturges breakpoints did make one error: the first two values got combined into one bin.

Lessons: (1) always vary the bin sizes for histograms, especially if you’re using naive breakpoint selection, and (2) don’t ignore a century’s worth of statistical research on these issues. And since it’s hard to learn a century’s worth of statistics, just use R, where they’re compiled it in for you.

Categories: Blogroll

Revealing the Energy Consumption of Each Building in New York

Information aesthetics - Thu, 2012-02-02 14:16


The remarkably detailed map [columbia.edu] developed by the Modi Research Group of the Earth Institute at Columbia University reveals the total annual building energy consumption of New York, at both the block and 'taxlot' level (which is nearly at building level).

The map was built using MapBox. The total energy consumption is expressed in kilowatt hours (kWh) per square meter of land area. The data actually was not retrieved from utility companies, but calculated via an elaborate statistical model that is based on current large-scale estimates (e.g. the average energy use by ZIP code) in addition to lower-scale, estimated parameters (like the type and size of the building). Hovering over individual blocks or lots shows more detailed information, such as the type of energy being used, for which purpose (e.g. heating and cooling, electricity or hot water) and in what quantity.

More detailed information is available here. Via NYTimes Green, WSJ Blog and Co.Exist. Thnkx Adam!


Categories: Blogroll

Reuters Social Pulse

Data Mining Blog - Thu, 2012-02-02 13:57

Briefly - I just saw that Reuters launched a site which connects the world of journalism with that of the social web: Reuters Social Pulse. This has similarities with what I've been experimenting with at the d8tplex news page which leverages bit.ly data and identifies over 500 Reuters journalists' Twitter profiles.

Categories: Blogroll

Comparing the Fundraising Performance of the US Presidential Candidates

Information aesthetics - Thu, 2012-02-02 10:17


The NYTimes released a competitive dashboard of sorts, titled "The 2012 Money Race: Compare the Candidates" [nytimes.com]. Basically, the interactive graphic allows readers to contrast the various performance parameters in terms of fundraising from 2 presidential candidates next to each other. Another recent graphic [nytimes.com] lists the hundreds of organizations and people that fund the so-called Super PACs that are officially not controlled by those very candidates.

As also explained in the accompanying press article, both infographics reveal how President Obama continues to outraise all of the candidates currently seeking the Republican nomination. It is also remarkable, however, that some checks seem to come from sources obscured from public view, like those with only a post office box for a headquarters, and no known employees.

If you like to play around with the data yourself, you should be able to find it at the Federal Election Commission website (which, in fact, also publish simple but interactive infographics of their own).

Categories: Blogroll

Automatic text analytics using DBpedia and PoolParty – A Live Demo

Semantic Web Company - Thu, 2012-02-02 06:22

Let me show you which steps have to be taken to generate a high-quality text mining application, ready to be used to annotate and to categorize any kind of text or documents covering nearly any domain. With our approach of thesaurus based text mining your documents can also be linked to the world of linked (open) data; enrich your documents with data from the LOD cloud!

Step 1. Generate a thesaurus by using a linked data source like DBpedia

As recently reported SWC has developed a tool called SKOSsy which can be used to extract seed thesauri from DBpedia. In our example I will generate a knowledge model describing the domain of “digital photography“. This step took around 15 minutes.

Step 2. Load the thesaurus into PoolParty and improve it to your needs

After the seed thesaurus has been loaded into PoolParty Thesaurus Manager you have many possibilities to enhance the knowledge model further: Add more categories, synonyms, relations etc. In this example I use the seed-thesaurus without any further improvements. This step took approximately 2 minutes.

Step 3. Generate an automatic text extractor on top of your thesaurus

This step took a couple of seconds and ended up in having generated a fast and reliable text mining application on top of PoolParty Extractor, ready to be used to enrich your documents with data from the LOD cloud.

You can try it out here: PPX Live-Demo

To try the extractor on your own, please take a look at the image above which shows a proper configuration, you have to insert the following UUID in the form: d35d4ddb-adc3-4ea5-b027-deacac03e391

Since our example is all about ‘digital photography’, we recommend to use text samples (or some fragments) like these ones to test the quality of PPX based text analytics:

Let us know what you think about this straight-forward approach and your opinion about the quality of the results. We believe that thesaurus based text mining is in many cases an alternative to some other approaches, especially if you want to to enrich your content with information from the upcoming web of data.

Of course we would be happy to generate other demos in the areas of your interest! Just get in contact with us by using our contact form.

Categories: Blogroll

Some Tweet, Some Don't - at Reuters

Data Mining Blog - Thu, 2012-02-02 00:45

A minor update to the d8taplex news site has

  1. An updated set of journalist and editor author identities - now numbering over 500
  2. An indication of the number of followers each tweeting author has
  3. An indication as to whether the journalist has recently tweeted

The later is indicated by a slightly augmented Twitter logo. Below, we can see that Dan Levine has recently tweeted.

Ultimately, I'd like to see if I can glean useful information from journalits' tweets relating to the stories they report on.

Categories: Blogroll

Google+ Is Like 401K For Search

Data Mining Blog - Tue, 2012-01-31 23:56

I'm trying to figure out which camera to go for: the GoPro or the Contour. When I search on Google for help in finding some way to compare, it was suggested to me to ask on Google+:

I then waited for someone to answer. I got nothing. Of course, I'm sure if I had the right set of connections I would have found a sweet expert opinion that could have helped with my decision and so, in a sense, it is my fault for not cultivating my network.

In the USA, a major part of the population's strategy for retirement is the 401K account. This is an account which provides the benefits of tax sheltered investment for retirement. The only catch is that it is up to you to manage your funds to ensure that you will have any money at all when you retire. There is nothing stopping you from making a bad bet and loosing it all. This is in contrast to the more staid approach of entrusting your retirement to some sort of national scheme that guarantees a pension and puts the onus on the government to ensure it.

I'm not an expert in fund management.

If Google were smart, it would have some ability to predict if my connections had the remotest chance of producing an answer (if not, why suggest that I ask the question?). Or else, perhaps it could suggest to me someone directly to ask in Google+?

Finally, let's not forget the hilarity that ensues when one plays with this:



Categories: Blogroll

Spot: Visualizing Twitter Dynamics as Particles

Information aesthetics - Tue, 2012-01-31 14:18


Spot [neoformix.com] by Jeff Clark is a comprehensive real-time Twitter visualization that uses a particle metaphor to represent unique tweets.

According to a set of user-defined keywords, the visualization gathers and displays the latest 200 tweets. The according particles are then organized in various spatial configurations to visually filter the available information by different parameters, such as: commonality of words, according to time, according to people, or categorized by the actual tool that was used to send the tweet.

The tool also allows to explore the tweets from a specific list, such as the "Top 100" of the datavis community, for instance.

More detailed information is available here.

See also Revisit and Digg Labs Swarm. Alternatively, check out Visual Digg Explorer by the same designer.

Categories: Blogroll

ICML Posters and Scope

Machine Learning Blog - Mon, 2012-01-30 23:21

Normally, I don’t indulge in posters for ICML, but this year is naturally an exception for me. If you want one, there are a small number left here, if you sign up before February.

It also seems worthwhile to give some sense of the scope and reviewing criteria for ICML for authors considering submitting papers. At ICML, the (very large) program committee does the reviewing which informs final decisions by area chairs on most papers. Program chairs setup the process, deal with exceptions or disagreements, and provide advice for the reviewing process. Providing advice is tricky (and easily misleading) because a conference is a community, and in the end the aggregate interests of the community determine the conference. Nevertheless, as a program chair this year it seems worthwhile to state the overall philosophy I have and what I plan to encourage (and occasionally discourage).

At the highest level, I believe ICML exists to further research into machine learning, which I generally think of as turning observations into useful predictions. Research is greatly varied in general, but in all cases it involves answering an interesting question for which the answer was not previously known. Interesting questions are generally natural: they can be stated easily and other people plausibly encounter them. Interesting questions are generally also ones for which there are multiple plausible wrong answers. The definition of “interesting” is otherwise hard to pin down, because it is does and must change over time.

ICML is a broad conference which incorporates the interests of many different groups of people with different tastes in the research they prefer. It’s broad enough that most people don’t appreciate all the papers. That’s ok as long as there is some higher level appreciation for which directions of research benefit the community. Some common flavors are:

  1. ML for X In general, Machine Learning is a core field of study with many applications. Often, it’s a good idea to publish within a conference focused on that area, but particularly when no such conference exists, ICML is a solid choice for a place to publish. One example of this kind of thing is Machine Learning for Sustainability, where the CCC will be giving a few travel grants. Here the core question is typically “How?” Exhibiting new things that you can do with ML provides good reference points for what is possible, provides a sense of what works, and compelling new ideas about what to work on can be valuable to the community.

    There are several ways that papers of this sort can bounce. Perhaps X is insufficiently interesting, the results are unconvincing, or the method of solution is considered too straight-forward. I consider the first and second criteria sound, but am inclined toward leniency on the third, since there is often quite a bit of work in figuring out how to frame the problem so that the solution happens to be easy.

  2. New Algorithms Often, authors find that existing learning algorithms for solving some problem are lacking in some way, so they propose new better algorithms. This is plausibly the most common category of paper at ICML, so there is quite a bit of variety. The most straight-forward version proposes a new algorithm for a well-studied problem. For these papers it’s important to have an empirical comparison to existing baselines.

    It’s easy for an empirical comparison to go wrong. Some authors use synthetic datasets which do not seem significant to me, because good results on such datasets may not transfer to real-world problems well as the real world tends to be quite a bit more complex than the synthetic processes which are natural to program. Instead, it’s important to show good results on real datasets. One problem with relying on real datasets is dataset selection—choosing the dataset for which your algorithm seems to perform best. You can avoid this by choosing datasets in some clearly unbiased manner and by evaluating on many standard datasets. Another way to fail is with a poor choice of baseline. This is tricky, because three reviewers might consider three different baselines the most natural one. Asking around a bit when developing the paper might help here, but in the end this can be a tough judgement call: Is the paper convincing enough that people interested in solving the problem should use this algorithm?

    Another class of new algorithms papers is new algorithms for new areas of machine learning, blending into the previous category. Here, there typically are relatively few (perhaps just one) dataset available and there may be no (or only implausibly bad) baselines. For papers like this, one way I’ve seen difficulties is when authors are very invested in a particular approach to solving the problem. If you have defined the problem too narrowly, broadening the definition of the problem can help you see appropriate baselines. Another difficulty I’ve observed is reviewers used to the well-studied problems reject an interesting paper because (essentially) they assume that the authors left out a good baseline which does not exist. To prevent the first, authors who ask around might get some valuable early feedback. For the second, it’s a difficulty we are aware of and will consider asking reviewers to judge on the merits of ML for X.

  3. Algorithmic studies A relatively rare but potentially valuable form of paper is an algorithmic study. Here, the authors do not propose a new algorithm, but instead do a comprehensive empirical comparison of different algorithms. The standards here are quite high—the empirical comparison needs to be first-class to convince people, so the empirical comparison comments under new algorithms apply strongly.
  4. New Theory Good theory can enlighten us about what is (or might be) possible. It can also help us build robust learning algorithms, where we design learning algorithms so that they provably solve some large class of problems. I am personally most interested in theory that helps us design new learning algorithms, but broadly interested in what is possible. I’m most interested in the question answered, while the means (and language) should only be as complex as necessary so the theory can be understood as widely as possible.

    In many areas of CS theory, double blind reviewing is rare, so theory-oriented people may be unfamiliar with it. An important consequence is that complete proofs must be included either in the paper or supplemental material so that proof checking is fully feasible.

    Another way that I’ve seen theory papers run into trouble is when it is a post-hoc justification for an algorithm. In essence, authors who choose to analyze an existing algorithm are sometimes forced to make many unnatural assumptions for the theory to be correct. There generally isn’t an easy fix if you arrive at this point.

  5. n of the above It is common for ICML papers to be multicategory. At the extreme, you might have a new algorithm which solves a new X well, empirically and theoretically. Reviewers can fall into a trap where they are most interested in 1 of the 4 questions answered above, and find 1/4 of the paper devoted to their question relatively weak compared to the paper that devotes all the pages to the same question.

    We are aware of this, and will encourage it to be taken into account.

  6. The exception The set of papers I expect to see at ICML is more diverse than the above—there are often exceptions of one sort or another. For these exceptions, it often becomes a judgment call: Does this paper significantly further research into machine learning? Papers with little potential audience probably don’t while fun/interesting/useful things that we didn’t think of do.

Further comments or questions are welcome.

Categories: Blogroll

Visualizing the Demographic Reach of Yahoo! Homepage Stories

Information aesthetics - Mon, 2012-01-30 09:44


Yahoo! recently released a dedicated data visualization website [yahoo.com] to highlight their Content Optimization and Relevance Engine (C.O.R.E.), a service that aims to personalize the Yahoo! experience depending on a collection of demographic (e.g. gender and age) and geographic (e.g. cities) variables, in combination with personal interests (e.g. Finance, Sports, Health).

The visualization, developed by Periscopic, allows to explore the relevant content of their homepage according to a set of user-selected parameters, so that one can solve questions like: "What is the most popular story for females between 18 and 24?". The shown content reaches back up to 24 hours ago. Each separate story can be further analyzed in terms of popularity over time, plus its demographic reach of the audience that actually clicked on it. Notably, the floating 3D globe of particles consists of news stories.

Via @johnmaeda.

See also Yahoo Visualizes Real-Time Email Subject Line Keywords and Destinations.

Categories: Blogroll

Google Provides Links to Journalists' G+ Profile

Data Mining Blog - Sun, 2012-01-29 11:49

Google New - which aggregates, clusters and ranks news articles from many sources - recently started including links to the Google+ profiles of journalists.

Here we see an example of a link to Shayndi Raice who, with Randall Smith, wrote this article on the Facebook IPO published by the Wall Street Journal.

However, the recall of the links is very low and, in fact, this link to Shayndi Raice is the only one currently visible on the Google News front page.


There are a number of sites which track and accumulate social identities for journalists. MuckRack, for example, has around 160 twitter identities for Reuters journalists - including Felsix Salmon who has around 50k followers - and many more profiles for the Wall Street Journal.

Over on the d8taplex news page, I've included twitter identities for Reuters contributors. I started out by simply using the twitter identities made available from the Reuters pages of the journalists. However, curiosity got the better of me and I ended up mining twitter for journalists. This has so far yielded over 500 Reuters journalists' twitter profiles (not all of which are currently available on the site, note).

In looking at these accounts, a large number are not active at all on twitter, while others, like Sam Youngman, provide commentary and even photos associated with the area of news that they cover.

Categories: Blogroll

Facebook Likes Burgers

Data Mining Blog - Sat, 2012-01-28 21:48

I found this data mining error on Silo Breaker somehow amusing:


The article does mention McDonalds in comparison to Facebook, but Silo Breaker didn't figure out that the article wasn't about McDonalds per se.

Categories: Blogroll

Why COLT?

Machine Learning Blog - Sat, 2012-01-28 20:01

By Shie and Nati

Following John’s advertisement for submitting to ICML, we thought it appropriate to highlight the advantages of COLT, and the reasons it is often the best place for theory papers. We would like to emphasize that we both respect ICML, and are active in ICML, both as authors and as area chairs, and certainly are not arguing that ICML is a bad place for your papers. For many papers, ICML is the best venue. But for many theory papers, COLT is a better and more appropriate place.

Why should you submit to COLT?

By-and-large, theory papers go to COLT. This is the tradition of the field and most theory papers are sent to COLT. This is the place to present your ground-breaking theorems and new models that will shape the theory of machine learning. COLT is more focused then ICML with a single track session. Unlike ICML, the norm in COLT is for people to sit through most sessions, and hear most of the talks presented. There is also often a lively discussion following paper presentations. If you want theory people to know of your work, you should submit to COLT.

Additionally, this year COLT and ICML are tightly co-located, with joint plenary sessions (i.e. some COLT papers will be presented in a plenary session to the entire combined COLT/ICML audience, as will some ICML papers), and many other opportunities for exposure to the wider ICML audience. And so, by submitting to COLT, you have the potential of reaching both the captive theory audience at COLT and the wider ML audience at ICML.

The advantages of sending to COLT:

  1. Rigorous review process.

    The COLT program committee is comprised entirely of established, mostly fairly senior, researchers. Program committee members read and review papers themselves, or potentially use a sub-reviewer that they know personally and carefully select for the paper, but still check and maintain responsibility for the review. Your paper will get reviewed by at least three program committee members, who will likely be experts on the topics covered by the paper. This is in contrast to ICML (and most other ML conferences) were area chairs (of similar seniority to the COLT program committee) only manage the review process, but reviewers are assigned based on load-balancing considerations and the primary reviewing is done by a very wide set of reviewers, frequently students, who are often not the most relevant experts.

    COLT reviews are typically detailed and technical details are checked. The reviewing process is less rushed and program committee members (and sub-reviewers were appropriate) are expected to do a careful job on each and every paper.

    All papers are then discussed by the program committee, and there is generally significant and meaningful discussions on papers. This also means the COLT reviewing process is far from having a “single point of failure”, as the paper will be carefully considered and argued for by multiple (senior) program committee members. We believe this yields a more consistently high quality program, with much less randomness in the paper selection process, which in turn translates to high respect for accepted COLT papers.

  2. COLT is not double blind, but also not exactly single blind. Program committee members have access to the author identities (as do area chairs in ICML), as this is essential in order to select sub-reviewers. However, the author names do not appear on the papers, both in order to reduce the effect of first impressions, and to allow program committee members to utilize reviewers who are truly blind to the author’s identities.

    It should be noted that the COLT anonimization guidelines are a bit more relaxed, which we hope makes it easier to create an anonimized version for conference submission (authors are still allowed to, and even encouraged, to post their papers online, with their names on them of course).

  3. COLT does not have a dedicated rebuttal phase. Frankly, with the higher quality, less random, reviews, we feel it is not needed, and the hassle to authors and program committee members is not worth it. However, the tradition in COLT, which we plan to follow, is to contact authors as needed during the review and discussion process to ask for clarification on issues that came up during review. In particular, if a concern is raised on the soundness or other technical aspect of a paper, the authors will be contacted to give them a chance to set things straight. But no, there is no generic author response where authors can argue and plead for acceptance.
Categories: Blogroll

YapMap: Breck’s Fun New Project to Improve Search

LingPipe Blog - Fri, 2012-01-27 21:20
I have signed on as chief scientist at YapMap. It is a part time position that grew out of me being on their advisory board for the past 3 years. Try the search interface for the forums below: Automotive Forums YapMap search for Low Carb Breakfast on Diabetes Daily A screen shot of the interface: [...]
Categories: Blogroll

More quick links

Greg Linden's Blog - Fri, 2012-01-27 19:29
More of what has caught my attention lately:
  • Laptops with Kinect sensors are coming. Worth paying attention to, gesturing in air to issue commands, a very different UX could be built on top of this ([1] [2])

  • "Each streaming subscriber is worth only $2.40 in profit each quarter to Netflix, compared to $17.32 for each DVD subscriber. The old business was very lucrative. The new business kind of sucks." ([1])

  • "You're not going to get content owners to license ... for less than what they get from the cable companies ... [if you will] use that cheap content to destroy the cable companies' business model." ([1])

  • "Federal officials approached Google with evidence of its employees' wrongdoing ... Google agreed to pay $500 million to ... ward off criminal charges against the company." ([1])

  • Google is spending nearly $1B every quarter buying new servers and data centers. That buys a lot of machines. ([1] [2])

  • Education startups are suddenly very, very hot. ([1] [2] [3] [4])

  • "Tiny directional antennas at the top of each rack ... send and receive data. A central controller monitors traffic patterns, finds network bottlenecks, configures the antennas and turns on the wireless links when more bandwidth is required ... The design sped up traffic by at least 45 percent." ([1])

  • "Wimpy cores are fine, but if you go down to the wimpiest range, your gains really have to be enormous if you want to consider all the aggravation -- and the hit to their productivity -- that your software engineers face." ([1])

  • A Facebook engineer explains why is actually the right thing for Facebook to produce buggy code ([1])

  • "How sex, bombs, and burgers shaped our world" ([1])

  • "There is a monolithic view that this generation of technology I.P.O.'s is completely broken." ([1])

  • Just three engineers built and run Instagram, which has 14 million users, 150 million photos, several terabytes of data, and hundreds of machines. ([1] [2])

  • Startup founders "say that if they'd known when they were starting their company about the obstacles they'd have to overcome, they might never have started it." ([1])

  • Two 17-year-olds used a weather balloon to send a little Lego astronaut and a video camera 15 miles into the stratosphere. Very fun. ([1])
Categories: Blogroll

Bitly Users Vote with their Clicks on Vatican Scandal

Data Mining Blog - Thu, 2012-01-26 21:37

By far the biggest story right now according to clicks on the bit.ly links to articles published by Reuters (as shown on d8taplex) is:


This story doesn't make it on to Google News front page or even their page of world news. It's not on the BBC's front page or in its European news section. Could be quite simply due to the fact that Reuters got the news late and everyone has moved on, but I see an article on HuffPo on this (from the same source) posted 3 hours ago.

 

Categories: Blogroll

The State of the Union Address 2012 - Infographically Enhanced

Information aesthetics - Wed, 2012-01-25 16:41


Similar to the original approach in 2011, this year's State of the Union was made available online in a so-called 'enhanced' version, which basically consisted of a split-screen video that shows President Obama giving his speech on one side, and a large collection contextual information and facts, as well as infographics, on the other. In other words: 1 hour and 5 minutes worth of high-level political facts, captured in 102 unique slides, of which about 26 can be labelled as visualization of some kind.

You can watch the 'enhanced' version of the State of the Union 2012 below or at the White House website [whitehouse.gov] itself.

Importantly, there are already some critical reviews available, about this year's infographics [thewhyaxis.info], but also about those of last year [fastfedora.com]. These reviews are quite worthwhile to read, as they point out some potential discrepancies in terms of the graphical representation, and the narrative it tries to convey.

See also Obama Presidency by Numbers: Contrasting Statements with Statistics and Obama Loves Infographics.

Categories: Blogroll

Time Maps: Morphing a Country According to its Travel Time

Information aesthetics - Wed, 2012-01-25 15:30


In the current age, we tend to think in time rather than absolute distance when estimating our itineraries. Accordingly, the beautiful Timemaps [timemaps.nl] by Vincent Meertens of Graph[s]ic shows the required travel times within The Netherlands by public transportation through morphing its silhouette along a colorful, circular time measure.

Users are able to select any train station location (by clicking inside the map), and time of day (via a slider). As a result, the map will expand at night, and shrink in the morning due to the availability of trains. The color coding corresponds to the number of hours (see legend below the map). If all goes well, the map should even be made available on iOS and Android.

See also Travel Time Tube Map and Worldmapper.

Via @ajdant.

Categories: Blogroll

Text Analytics for Telecommunications - Part 1

Life Analytics Blog - Wed, 2012-01-25 06:37
As discussed in the previous post, performing Text Analytics for a language for which no tools exist is not an easy task. The Case Study which i will present in the 9th European Text Analytics Summit is about analyzing and understanding thousands of Non-English FaceBook posts and Tweets for Telco Brands and their Topics, leading to what is known as Competitive Intelligence.

The Telcos used for the Case Study  are Telenor, MT:S and VIP Mobile which are located in Serbia. The analysis aims to identify  the perception of Customers for each of the  three Companies mentioned and understand the Positive and Negative elements of each Telco as this is captured from the Voice of the Customers - Subscribers.

By analyzing several thousands of Tweets and FaceBook posts and comments we can have a first glimpse of Competitive Intelligence. For example when we wish to identify which words frequently occur with mentions about postpaid packages this is what we find  :




Red boxes show Telco Brands - notice "mts" and "mtsa" which point to the same Telco, namely mt:s.  Blue boxes indicate similar words that should be merged.  From a first look at the results above we see that : 
a) mt:s is found more frequently when users mention PostPaid packages.

b) Telenor and VIP Mobile are not found as frequently as MT:S in PostPaid package conversations.

c) We see several  problems from insufficient pre-processing : Kredit and Kredita (=credit) should merge into one word, the same applies for telefona - telefon, internet - interneta and mts - mtsa.



Notice that we can perform the same High-level analysis for several Telco Topics such as Network, Billing, Customer Care, Promotions, Questions of subscribers and so on. The next task is to identify the reason(s) why MT:S was found to have more mentions about PostPaid packages. Note that at this point we do not know why this is so : It could be the fact that MT:S prices of prepaid packages are high, very cheap or something else is happening that needs to be identified.


The Serbian Language poses extra work because it is a highly inflected language : Even the ending  of  Brand names change according to the usage.  Consider the following :

U mts-u (at mts)
Sa mts-om (With mts)
Bez mts-a (Without mts)


It is evident that a highly inflected language explodes our feature space and for this reason R can come to the rescue with some success. We can use R for changing several synonyms to one word, removing (Serbian) stop words, removing URLs and performing several other pre-processing steps that are necessary prior to an extensive analysis. More on the next post.
Categories: Blogroll
Syndicate content