Skip navigation.
Semantic Software Lab
Concordia University
Montréal, Canada


A World of Terror: the Impact of Terror in the World

Information aesthetics - Thu, 2014-08-28 13:11

A World of Terror [] by Periscopic shows the reach, frequency and impact of about 25 terrorism groups around the world.

The visualization exists of 25 smartly organized pixel plots that are displayed as ordered small multiples. Ranging from Al-Qa'ida and the Taliban to less known organizations like Boko Haram, the plots reveal which ones are more deadly, are more recently active, or have been historically more active. In addition, all data can be filtered over time.

The data is based on the Global Terrorism Database (GTD), the most comprehensive and open-source collection of terrorism data available.

Categories: Blogroll

Metrico: a Puzzle Action Game based on Infographics

Information aesthetics - Wed, 2014-08-27 06:57

Metrico [], designed by Dutch game design studio Digital Dreams, is a recently released video game for the Playstation Vita.

Described as an "atmospheric puzzle action game with a mindset of its own", it's visual style has been completely based on the world of infographics. In essence, the concept of infographics seem to work as a gameplay environment not just because of its pretty aesthetics, but also because of its natural interaction with (visual) data.

Consequently, in Metrico, each action is quantified and explicitly shown, such as the number of times an avatar needs to jump up and down or shoots a projectile. Metrico's goal is thus similar to most infographics: enticing users to make sense of a complex system.

Via Wired. Watch the gameplay trailer below.

Categories: Blogroll

The Many Factors Influencing Breast Cancer Incidence

Information aesthetics - Mon, 2014-08-25 12:47

A Model of Breast Cancer Causation [], designed by 'do good with data' visualization studio Periscopic illustrates many of the factors that can lead to breast cancer and how they may interact with others.

The interactive circos graph is meant to demonstrate the complexity of breast cancer causation, in terms of educating the general public as well as possibly stimulating new scientific research in this direction. Users can explore the different influencing factors by domain, predicted correlation strength as well as the quality of the data evidence behind.

See also:
- Health InfoScape: Illustrating the Relationships between Disease Conditions
- Visualizing the Major Health Issues Facing Americans Today

Categories: Blogroll

How we Sleep (and How we Awake after an Earthquake)

Information aesthetics - Mon, 2014-08-25 12:31

Since we already know in what angle people put their face when taking a selfie in different cities, we now also know how they sleep differently: Which Cities Get the Most Sleep? [] by interactive graphics editor Stuart A. Thompson of the Wall Street Journal compares the sleeping habits of citizens of different cities.

On the topic of sleep, Jawbone also just released an interesting graph revealing how the recent Napa earthquake affected the sleep of local residents []. Indeed, the distance to the epicenter seems to correlate to the number of people who awoke, and the time it took for them to get back to sleep.

As the visualizations are based on a vast dataset released by Jawbone, the makers of a digitized wristband that tracks motion and sleep behavior, the data is not necessarily representative for the whole general population.

Categories: Blogroll

In Out, In Out, Shake It All About

Code from an English Coffee Drinker - Sat, 2014-08-23 05:04
In the very abstract sense text analysis can be divided into three main tasks; load some text, process it, export the result. Out of the box GATE (both the GUI and the API) provides excellent support for both loading documents and processing them, but until now we haven't provided many options when it comes to exporting processed documents.

Traditionally GATE has provided two methods of exporting processed documents; a lossless XML format that can be reloaded into GATE but is rather verbose, or the "save preserving format" option which essentially outputs XML representing the original document (i.e. the annotations in the Original markups set) plus the annotations generated by your application. Neither of these options were particularly useful if you wanted to pass the output on to some other process and, without a standard export API, this left people having to write custom processing resources just to export their results.

To try and improve the support for exporting documents recent nightly builds of GATE now include a common export API in the gate.DocumentExporter class. Before we go any further it is worth mentioning that this code is in a nightly build so is subject to change before the next release of GATE. Having said that I have now used it to implement exporters for a number of different formats so I don't expect the API to change drastically.

If you are a GATE user, rather than a software developer, than all you need to know is that an exporter is very similar to the existing idea of document formats. This means that they are CREOLE resources and so new exporters are made available by loading a plugin. Once an exporter has been loaded then it will be added to the "Save as..." menu of both documents and corpora and by default exporters for GATE XML and Inline XML (i.e. the old "Save preserving format) are provided even when no plugins have been loaded.

If you are a developer and wanting to make use of an existing exporter, then hopefully the API should be easy to use. For example, to get hold of the exporter for GATE XML and to write a document to a file the following two lines will suffice:
DocumentExporter exporter =

exporter.export(document, file);There is also a three argument form of the export method that takes a FeatureMap that can be used to configure an exporter. For example, the annotation types the Inline XML exporter saves is configured this way. The possible configuration options for an exporter should be contained in it's documentation, but possibly the easiest way to see how it can be configured is to try it from the GUI.

If you are a developer and want to add a new export format to GATE, then this is fairly straightforward; if you already know how to produce other GATE resources then it should be really easy. Essentially you need to extend gate.DocumentExporter to provide an implementation of it's one abstract method. A simple example showing an exporter for GATE XML is given below:
@CreoleResource(name = "GATE XML Exporter",
tool = true, autoinstances = @AutoInstance, icon = "GATEXML")
public class GateXMLExporter extends DocumentExporter {

public GateXMLExporter() {
super("GATE XML", "xml", "text/xml");

public void export(Document doc, OutputStream out, FeatureMap options)
throws IOException {
try {
DocumentStaxUtils.writeDocument(doc, out, "");
} catch(XMLStreamException e) {
throw new IOException(e);
}As I said earlier this API is still a work in progress and won't be frozen until the next release of GATE, but the current nightly build now contains export support for Fast Infoset compressed XML (I've talked about this before), JSON inspired by the format Twitter uses, and HTML5 Microdata (an updated version of the code I discussed before). A number of other exporters are also under development and will hopefully be made available shortly.

Hopefully if you use GATE you will find this new support useful and please do let us have any feedback you might have so we can improve the support before the next release when the API will be frozen.
Categories: Blogroll

The Feltron Annual Report of 2013 on Communication

Information aesthetics - Thu, 2014-08-21 10:29

Each year, Nicholas Felton releases an personal year report, and the one of 2013 [] was just released. These reports always stand out because of the immense sense of data-centric detail, and an always original infographic style.

This year, the report focuses on communication data, as it aspires to uncover patterns and insights within a large collection of tracked conversations, SMS, telephone calls, email, Facebook messages and even physical mail.

See also the annual reports of:
- 2012
- 2010 and 2011
- 2010 (about his father's life)
- 2009
- 2008
- 2007
- 2006
- 2005

Categories: Blogroll

US Domestic Migration Charted as Ordered Stacked Area Graphs

Information aesthetics - Mon, 2014-08-18 16:06

The interactive infographic Where We Came From, State by State [] by Gregor Aisch, Robert Gebeloff and Kevin Quely reveals how US citizens have moved between different US states since the year 1900.

The migration data is based on Census data, which was used to compare the state of residence versus the state of birth of a representative sample of Census forms. The visualization technique resembles that of organically shaped, stacked area graphs, also coined as stream graphs or ThemeRiver.

See also:
- Ebb and Flow of Movies
- lastgraph
- 2008 Movie Box Revenue
- What People in Tokyo are Doing on a Tuesday
- Memetracker: Tracking News Phrases over the Web
- DailyRadar TrendMap: Interactive Stacked Line Graph of Popular Trends
- Twitter Activity during the 2012 European Football Tournament

Categories: Blogroll

oneSecond: Printing Every Tweet Created During a Single Second

Information aesthetics - Wed, 2014-08-13 12:36

#oneSecond [] by graphic design student Philipp Adrian aggregates all the tweets sent at exactly 14:47:36 GMT of 9 November 2012.

The 5522 Twitter messages are categorized and ordered in 4 different books. Every user is part of each book but dependent on the categorization her position within the book changes.

Accordingly, the book "My Message is..." contains the content of each message, ordered by its language. The size and order of the tweet is derived from the number of followers (recipients).

The book "My Color is..." shows each user's Twitter account color, ordered by the timezone the tweet was send in.

The book "My Description is..." shows how each user describes himself on his profile, of which the size and order is derived from the Klout score.

Finally, the book "My Name is..." lists the avatar that each user chose to represent him or herself, ordered by the number of tweets the user sent.

Categories: Blogroll

Charting Culture: 2000 Years of Cultural History in 5 Minutes

Information aesthetics - Tue, 2014-08-12 14:59

Charting Culture [] shows the geographical movements of over 120,000 individuals who were notable enough in their life-times that the dates and locations of their births and deaths were recorded.

The animation commences around 600 bc and ends in 2012, and tracks the life of people like Leonardo da Vinci or Jett Travolta -- son of the actor John Travolta. It presents each person's birth place as a blue dot and their death as a red dot. Developed by Mauro Martino, research manager of the Cognitive Visualization Lab in IBM's Watson Group, the animated map is based on data retrieved from the Google-owned knowledge base, Freebase, a community-curated database of well-known people, places, and things.

More scientific information can be found in the Science paper "A network framework of cultural history", which was spearheaded by Maximilian Schich and his team.

Watch the movie below.

Categories: Blogroll

Amsterdam City Dashboard: a City as Urban Statistics

Information aesthetics - Mon, 2014-08-11 11:37

Amsterdam City Dashboard [] presents the city of Amsterdam through the lens of data, including demographic statistics, traffic reports, noise readings or political messages.

The small collection of information graphics are divided in distinct domains, such as transport, environment, statistics, economy, social, cultural and security. All data is shown in near real-time, based on blocks of 24 hours. Larger dots and darker colors symbolize higher values, whereas an interactive map provides a geographic reference.

Based on the Linked Data API from the CitySDK project, this dashboard should be easily transferable to the data repositories from other cities.

See also City Dashboard: Aggregating All Spatial Data for Cities in the UK.

Categories: Blogroll

Quick links

Greg Linden's Blog - Tue, 2014-08-05 12:29
What caught my attention lately:
  • Great idea for walking directions: "At times, we do not [want] the fastest route ... When walking, we generally prefer tiny streets with trees over large avenues with cars ... [We] suggest routes that are not only short but also emotionally pleasant." ([1] [2] [3])

  • Cool idea for a drone that autonomously flies a small distance above and behind you while filming in HD ([1] [2])

  • "OkCupid doesn’t really know what it’s doing. Neither does any other website. It’s not like people have been building these things for very long, or you can go look up a blueprint or something. Most ideas are bad. Even good ideas could be better. Experiments are how you sort all this out." ([1] [2])

  • "Amazon’s cloud revenue now runs almost on par with VMware (VMW), which posted revenue of $5.2 billion last year" ([1])

  • Walmart is getting more aggressive about competing with Amazon on personalization and recommendations ([1])

  • It's important to realize that Amazon could have been a small bookstore on the Web ([1])

  • A lot of us thought the Amazon logo was phallic when it was introduced (worse, it was animated and actually grew from left-to-right). Remarkably, it's lived on for 14 years now. ([1])

  • A big problem with layoffs is not only do you lose some of the people you intended to layoff, but also some of your best employees will pick that time to leave. People with good options won't wait around to experience the chaos and fear; they'll just leave. ([1])

  • "A brand-name USB stick [claims to be] a computer keyboard [device] ... [and then] opens a command window on an attached computer and enters commands that cause it to download and install malicious software." ([1])

  • Financial services and poor computer security: "Our assumption was that, generally speaking, the financial sector had its act together much more" ([1] [2])

  • "NSA employees [were] passing around nude photos that were intercepted in the course of their daily work" ([1] [2])

  • Google Cloud googler says, "It should always be cheaper to run in the cloud no matter what your workload" but that the pricing isn't there yet ([1])

  • Details on Google's remarkably large and fast data warehouse ([1] [2])

  • Cool augmented reality game intended to be played as a passenger in a moving car that creates the terrain and enemies you see in the game based on the stores and buildings around you in the real world ([1])

  • "Astronomers of the 2020s will be swimming in petabytes of data streaming from space and the ground ... [such as] a 3,200-megapixel camera, which will produce an image of the entire sky every few days and over 10 years will produce a movie of the universe, swamping astronomers with data that will enable them to spot everything that moves or blinks in the heavens, including asteroids and supernova explosions." ([1])

  • Data are or data is: "'datum' isn't a word we ever use. So it makes no sense to use the plural when the singular doesn't exist." ([1])

  • The "If Google was a guy" series from CollegeHumor is hilarious (but probably NSFW) ([1] [2] [3])

  • Funny Dilbert comics on a Turing test for management ([1] [2])

  • Cathartic Xkcd comic on defending your thesis ([1])
Categories: Blogroll

Open Machine Learning Workshop, August 22

Machine Learning Blog - Sat, 2014-07-26 09:14

On August 22, we are planning to have an Open Machine Learning Workshop at MSR, New York City taking advantage of CJ Lin and others in town for KDD.

If you are interested, please email msrnycrsvp at and say “I want to come” so we can get a count of attendees for refreshments.

Categories: Blogroll

From Taxonomies over Ontologies to Knowledge Graphs

Semantic Web Company - Tue, 2014-07-15 04:57

With the rise of linked data and the semantic web, concepts and terms like ‘ontology’, ‘vocabulary’, ‘thesaurus’ or ‘taxonomy’ are being picked up frequently by information managers, search engine specialists or data engineers to describe ‘knowledge models’ in general. In many cases the terms are used without any specific meaning which brings a lot of people to the basic question:

What are the differences between a taxonomy, a thesaurus, an ontology and a knowledge graph?

This article should bring light into this discussion by guiding you through an example which starts off from a taxonomy, introduces an ontology and finally exposes a knowledge graph (linked data graph) to be used as the basis for semantic applications.

1. Taxonomies and thesauri

Taxonomies and thesauri are closely related species of controlled vocabularies to describe relations between concepts and their labels including synonyms, most often in various languages. Such structures can be used as a basis for domain-specific entity extraction or text categorization services. Here is an example of a taxonomy created with PoolParty Thesaurus Server which is about the Apollo programme:

The nodes of a taxonomy represent various types of ‘things’ (so called ‘resources’): The topmost level (orange) is the root node of the taxonomy, purple nodes are so called ‘concept schemes’ followed by ‘top concepts’ (dark green) and ordinary ‘concepts’ (light green). In 2009 W3C introduced the Simple Knowledge Organization System (SKOS) as a standard for the creation and publication of taxonomies and thesauri. The SKOS ontology comprises only a few classes and properties. The most important types of resources are: Concept, ConceptScheme and Collection. Hierarchical relations between concepts are ‘broader’ and its inverse ‘narrower’. Thesauri most often cover also non-hierarchical relations between concepts like the symmetric property ‘related’. Every concept has at least on ‘preferred label’ and can have numerous synonyms (‘alternative labels’). Whereas a taxonomy could be envisaged as a tree, thesauri most often have polyhierarchies: a concept can be the child-node of more than one node. A thesaurus should be envisaged rather as a network (graph) of nodes than a simple tree by including polyhierarchical and also non-hierarchical relations between concepts.

2. Ontologies

Ontologies are perceived as being complex in contrast to the rather simple taxonomies and thesauri. Limitations of taxonomies and SKOS-based vocabularies in general become obvious as soon as one tries to describe a specific relation between two concepts: ‘Neil Armstrong’ is not only unspecifically ‘related’ to ‘Apollo 11′, he was ‘commander of’ this certain Apollo mission. Therefore we have to extend the SKOS ontology by two classes (‘Astronaut’ and ‘Mission’) and the property ‘commander of’ which is the inverse of ‘commanded by’.

The SKOS concept with the preferred label ‘Buzz Aldrin’ has to be classified as an ‘Astronaut’ in order to be described by specific relations and attributes like ‘is lunar module pilot of’ or ‘birthDate’. The introduction of additional ontologies in order to expand expressivity of SKOS-based vocabularies is following the ‘pay-as-you-go’ strategy of the linked data community. The PoolParty knowledge modelling approach suggests to start first with SKOS to further extend this simple knowledge model by other knowledge graphs, ontologies and annotated documents and legacy data. This paradigm could be memorized by a rule named ‘Start SKOS, grow big’.

3. Knowledge Graphs

Knowledge graphs are all around (e.g. DBpedia, Freebase, etc.). Based on W3C’s Semantic Web Standards such graphs can be used to further enrich your SKOS knowledge models. In combination with an ontology, specific knowledge about a certain resource can be obtained with a simple SPARQL query. As an example, the fact that Neil Armstrong was born on August 5th, 1930 can be retrieved from DBpedia. Watch this YouTube video which demonstrates how ‘linked data harvesting’ works with PoolParty.

Knowledge graphs could be envisaged as a network of all kind things which are relevant to a specific domain or to an organization. They are not limited to abstract concepts and relations but can also contain instances of things like documents and datasets.

Why should I transform my content and data into a large knowledge graph?

The answer is simple: to being able to make complex queries over the entirety of all kind of information. By breaking up the data silos there is a high probability that query results become more valid.

With PoolParty Semantic Integrator, content and documents from SharePoint, Confluence, Drupal etc. can be tranformed automatically to integrate them into enterprise knowledge graphs.

Taxonomies, thesauri, ontologies, linked data graphs including enterprise content and legacy data – all kind of information could become part of an enterprise knowledge graph which can be stored in a linked data warehouse. Based on technologies like Virtuoso, such data warehouses have the ability to serve as a complex question answering system with excellent performance and scalability.

4. Conclusion

In the early days of the semantic web, we’ve constantly discussed whether taxonomies, ontologies or linked data graphs will be part of the solution. Again and again discussions like ‘Did the current data-driven world kill ontologies?‘ are being lead. My proposal is: try to combine all of those. Embrace every method which makes meaningful information out of data. Stop to denounce communities which don’t follow the one or the other aspect of the semantic web (e.g. reasoning or SKOS). Let’s put the pieces together – together!


Categories: Blogroll

The perfect candidate

Machine Learning Blog - Mon, 2014-07-14 15:04

The last several years have seen a phenomonal growth in machine learning, such that this earlier post from 2007 is understated. Machine learning jobs aren’t just growing on trees, they are growing everywhere. The core dynamic is a digitizing world, which makes people who know how to use data effectively a very hot commodity. In the present state, anyone reasonably familiar with some machine learning tools and a master’s level of education can get a good job at many companies while Phd students coming out sometimes have bidding wars and many professors have created startups.

Despite this, hiring in good research positions can be challenging. A good research position is one where you can:

  • Spend the majority of your time working on research questions that interest.
  • Work with other like-minded people.
  • For several years.

I see these as critical—research is hard enough that you cannot expect to succeed without devoting the majority of your time. You cannot hope to succeed without personal interest. Other like-minded people are typically necessary in finding the solutions of the hardest problems. And, typically you must work for several years before seeing significant success. There are exceptions to everything, but these criteria are the working norm of successful research I see.

The set of good research positions is expanding, but at a much slower pace than the many applied scientist types of positions. This makes good sense as the pool of people able to do interesting research grows only slowly, and anyone funding this should think quite hard before making the necessary expensive commitment for success.

But, with the above said, what makes a good candidate for a research position? People have many diverse preferences, so I can only speak for myself with any authority. There are several things I do and don’t look for.

  1. Something new. Any good candidate should have something worth teaching. For a phd candidate, the subject of your research is deeply dependent on your advisor. It is not necessary that you do something different from your advisor’s research direction, but it is necessary that you own (and can speak authoritatively) about a significant advance.
  2. Something other than papers. It is quite possible to persist indefinitely in academia while only writing papers, but it does not show a real interest in what you are doing beyond survival. Why are you doing it? What is the purpose? Some people code. Some people solve particular applications. There are other things as well, but these make the difference.
  3. A difficult long-term goal. A goal suggests interest, but more importantly it makes research accumulate. Some people do research without a goal, solving whatever problems happen to pass by that they can solve. Very smart people can do well in research careers with a random walk amongst research problems. But people with a goal can have their research accumulate in a much stronger fashion than a random walk through research problems. I’m not an extremist here—solving off goal problems is fine and desirable, but having a long-term goal makes a long-term difference.
  4. A portfolio of coauthors. This shows that you are the sort of person able to and interested in working with other people, as is very often necessary for success. This can be particularly difficult for some phd candidates whose advisors expect them to work exclusively with (or for) them. Summer internships are both a strong tradition and a great opportunity here.
  5. I rarely trust recommendations, because I find them very difficult to interpret. When the candidate selects the writers, the most interesting bit is who the writers are. Letters default positive, but the degree of default varies from writer to writer. Occasionally, a recommendation says something surprising, but do you trust the recommender’s judgement? In some cases yes, but in many cases you do not know the writer.

Meeting the above criteria within the context of a phd is extraordinarily difficult. The good news is that you can “fail” with a job that is better in just about every way

Anytime criteria are discussed, it’s worth asking: should you optimize for them? In another context, Lines of code is a terrible metric to optimize when judging programmer productivity. Here, I believe optimizing for (1), (2), (3), and (4) are all beneficial and worthwhile for phd students.

Categories: Blogroll

Bing hearts World Cup 2014, Google - not so much

Data Mining Blog - Sat, 2014-07-12 15:19

While Google has been doing a great job of their front page animations (today's is very nice, illustrating how Brazil and The Netherlands are on their way to Russia for 2018), Bing appears to be far more attentive to actually answering questions about the competition. For example:

Compared to Bing's

Google's answer brings up some interesting news articles, but Bing brings up stats on the teams and even a prediction of who will win (Cortana - which is driving these predictions - has been doing a perfect job of predicting game outcomes).

Categories: Blogroll

More quick links

Greg Linden's Blog - Thu, 2014-07-10 10:03
More of what caught my attention lately:
  • Crazy cool and the first time I've seen ultrasound used for device-to-device communication outside of research: "Chromecast will be able to pair without Wi-Fi, or even Bluetooth, via an unusual method: ultrasonic tones." ([1])

  • A 3D printer that can print in "any weldable material" including titanium, aluminum, and stainless steel ([1])

  • "You teach Baxter [an inexpensive industrial robot] how to do something by grabbing an arm and showing it what you want, sort of like how you would teach a child to paint" ([1])

  • When trying to use the wisdom of the crowds, you're better off using only the best part of the crowd. ([1])

  • "Americans now appear to trust internet news about as much as newspapers and television news ... not because confidence in internet news is rising, but because confidence in TV news and newspapers has plummeted over the years." ([1])

  • "Microsoft is basically 'done' with Windows 8.x. Regardless of how usable or functional it is or isn't, it has become Microsoft's Vista 2.0 -- something from which Microsoft needs to distance itself." ([1])

  • Google Flights now lets you see everywhere you can fly out of a city (including limiting to non-stops only) and how much it would cost ([1] [2] [3] [4])

  • "Entering the fulfillment center in Phoenix feels like venturing into a realm where the machines, not the humans, are in charge ... The place radiates a non-human intelligence, an overarching brain dictating the most minute movements of everyone within its reach." ([1])

  • Google's location history feature is both fascinating and frightening. If you own an Android device, go to location history, set it to 30 days, and see the detail on where you have been. While it's true that many have this kind of data, it may surprise you to see it all at once.

  • "Vodafone, one of the world's largest mobile phone groups, has revealed the existence of secret wires that allow government agencies to listen to all conversations on its networks, saying they are widely used in some of the 29 countries in which it operates in Europe and beyond." ([1])

  • Many "users actually do not attach any significant economic value to the security of their systems" ([1] [2])

  • "Ensuring that our patent system 'promotes the progress of science,' rather than impedes it, consistent with the constitutional mandate underlying our intellectual property system" ([1])

  • Smartphones may have hit the limit on how much improvements to screen resolution matter, meaning they will have to compete on other features (like sensors or voice recognition) ([1])

  • "Project Tango can see the world around it in 3D. This would allow developers to make augmented-reality apps that line up perfectly with the real world or make an app that can 3D scan an object or environment." ([1])

  • The selling point of smartwatches is paying $200 to not have to pull your phone out of your pocket, and that might be a tough sell. ([1])

  • "As programmers will tell you, the building part is often not the hardest part: It's figuring out what to build. 'Unless you can think about the ways computers can solve problems, you can't even know how to ask the questions that need to be answered'" ([1])

  • "[No] lectures, discussion sections, midterms ... a pre-test for each subject area ... given a mentor with a graduate degree in the field ... [and] textbooks, tutorials, and other resources. Eventually, they're assessed on how well they understand the concepts." ([1])

  • "A naked mole rat has never once been observed to develop cancer" ([1])

  • Hilarious Colbert Report on the Hachette mess, particularly loved the bit on "Customers who enjoyed this also bought this" at 3:00 in the video ([1])

  • Humor from The Onion: "We want $100 from you, so we’re just going to take it. As a cable subscriber, you really have no other option here" ([1])

  • Humor from the Borowitz Report: "It never would have occurred to me that an enormous corporation with the ability to track over half a billion customers would ever exploit that advantage in any way." ([1])
Categories: Blogroll

Becoming a Data Scientist : A RoadMap

Life Analytics Blog - Thu, 2014-07-03 09:52
I receive a lot of questions regarding which books one should read to become a Data Miner / Data Scientist. Here is a suggested reading list and also a proposed RoadMap (apart from the requirement of having an appropriate University degree) in becoming a Data Scientist. 
Before going further, it appears that a Data Scientist should possess an awful lot of skills : Statistics, Programming, Databases, Presentation Skills, Knowledge of Data Cleaning and Transformations.
The skills that ideally you should acquire are as follows :

1) Sound Statistical Understanding and Data Pre-Processing
2) Know the Pitfalls : You must be aware of the Biases that could affect you as an analyst and  also the common mistakes made during Statistical Analysis
3) Understand how several Machine Learning / Statistical Techniques work.
4) Time Series Forecasting
5) Computer Programming (R, Java, Python, Scala)
6) Databases (SQL and NoSQL Databases)
7) Web Scraping (Apache Nutch, Scrapy, JSoup)
8) Text Data

Statistical Understanding :  A good Introductory Book is Fundamental Statistics for the Behavioral Sciences by Howell. Also IBM SPSS for Introductory Statistics - Use and Interpretation and IBM SPSS For Intermediate Statistics by Morgan et al. Although all of the books (especially the two latter) are heavy on  IBM SPSS Software they are able to provide a good introduction to key statistical concepts while the  books by Morgan et al give a methodology to use with a practical example of analyzing the High-Scool and Beyond Dataset.

Data Pre-Processing : I must re-iterate the importance of thoroughly checking and identifying problems within your Data. Data Pre-processing guards against the possibility of feeding erroneous data to a Machine Learning / Statistical Algorithm but also transforms data in such a way so that an algorithm can extract/identify patterns more easily. Suggested Books :

  •  Data Preparation for Data Mining by Dorian Pyle
  • Mining Imperfect Data: Dealing with Contamination and Incomplete Records by Pearson
  • Exploratory Data Mining and Data Cleaning by Johnson and Dasu

Know the Pitfalls : There are many cases of Statistical Misuse and biases that may affect your work even if -at times- you do not know it consciously. This has happened to me in various occasions. Actually, this blog contains a couple of examples of Statistical Misuse even though i tried (and keep trying) to highlight limitations due to the nature of Data as much as i can. Big Data is another technology where caution is warranted. For example, see : Statistical Truisms in the Age of Big Data and The Hidden biases of Big Data.

Some more examples :

-Quora Question : What are common fallacies or mistakes made by beginners in Statistics / Machine Learning / Data Analysis

-Identifying and Overcoming Common Data Mining Mistakes by SAS Institute

The following Book is suggested :

  • Common Errors in Statistics (and how to avoid them) by P. Good and J. Harding

In case you are into Financial Forecasting i strongly suggest reading Evidence-Based Technical Analysis by David Aronson which is heavy on how Data Mining Bias (and several other cognitive biases) may affect your Analysis . 

Understand how several Machine Learning / Statistical Algorithms work : You must be able to understand the pros and cons of each algorithm. Does the algorithm that you are about to try handle noise well? How Does it scale? What kind of optimizations can be performed? Which are the necessary Data transformations? Here is an example for fine-tuning Regression SVMs:

Practical Selection of SVM Parameters and Noise Estimation for SVM Regression 

Another book which deserves attention is Applied Predictive Modelling by Khun, Johnson which also gives numerous examples on using the caret R Package which -among other things- has extended Parameter Optimization capabilities.

When it comes to getting to know Machine Learning/ Statistical Algorithms I'd suggest the following books  :

  • Data Mining : Practical Machine Learning Tools and Techniques by Witten and Frank
  • The Elements of Statistical Learning by Friedman, Hasting, Tibishirani 

Time Series Forecasting : In many situations you might have to identify and predict trends from Time Series Data. A very good Introductory Book is Forecasting : Principles and Practice by Hyndman and Athanasopoulos which contains sections on Time Series Forecasting. Time Series Analysis and its Applications with R Examples by Shumway and Stoffer is another book with Practical Examples and R Code as the title suggests.

In case you are interested more about Time Series Forecasting i would also suggest ForeCA (Forecastable Component Analysis) R package written by Georg Goerg -working at Google at the moment of writing- which tells you how forecastable a Time Series is (Ω = 0:white noise, therefore not forecastable, Ω=100: Sinusoid, perfectly forecastable).

Computer Programming Knowledge: This is another essential skill. It allows you to use several Data Science Tools/APIs that require -mainly- Java and Python skills. Scala appears to be also becoming an important Programming Language for Data Science. R Knowledge is considered a "must". Having prior knowledge of Programming gives you the edge if you wish to learn n new Programming Language. You should also constantly be looking for Trends on programming language requirements (see Finding the right Skillset for Big Data Jobs). It appears that -currently- Java is the most sought Computer Language, followed by Python and SQL. It is also useful looking at Google Trends but interestingly "Python" is not available as a Programming Language Topic at the moment of writing. 

Database Knowledge : In my experience this is a very important skill to have. More often than not, Database Administrators (or other IT Engineers) that are supposed to extract Data for you are just too busy to do that. That means that you must have the knowledge to connect to a Database, Optimize a Query and perform several Queries/Transformations to get the Data that you want on a format that you want.

Web Scraping: It is a useful skill to have. There are tons of useful Data which you can access if you know how to write code to access and extract information from the Web. You should get to know  HTML Elements and XPath.  Some examples of Software that can be used for this purpose : 

-Apache Nutch

Text Data: Text Data contain valuable information : Consumer Opinions, Sentiment, Intentions to name just a few. Information Extraction and Text Analytics are important Technologies that a Data Scientist should ideally know.

Information Extraction :


Text Analytics

-The "tm" R Package

The following Books are suggested :

  • Introduction to Information Retrieval by Manning, Raghavan and Schütze
  • Handbook of Natural Language Processing by Indurkhya, Damerau (Editors)
  • The Text Mining HandBook - Advanced Approaches in Analyzing Unstructured Data by Feldman and Sanger

Finally here are some Books that should not be missed by any Data Scientist :

  • Data Mining and Statistics for Decision Making by Stéphane Tufféry (A personal favorite)
  • Introduction to Data Mining by Tan, Steinbach, Kumar 
  • Applied Predictive Modelling by Khun, Johnson
  • Data Mining with R - Learning with Case Studies by Torgo
  • Principles of Data Mining by Bramer

Categories: Blogroll

Interesting papers at ICML 2014

Machine Learning Blog - Tue, 2014-06-24 19:56

This year’s ICML had several papers which I want to read through more carefully and understand better.

  1. Chun-Liang Li, Hsuan-Tien Lin, Condensed Filter Tree for Cost-Sensitive Multi-Label Classification. Several tricks accumulate to give a new approach for addressing cost sensitive multilabel classification.
  2. Nikos Karampatziakis and Paul Mineiro, Discriminative Features via Generalized Eigenvectors. An efficient, effective eigenvalue solution for supervised learning yields compelling nonlinear performance on several datasets.
  3. Nir Ailon, Zohar Karnin, Thorsten Joachims, Reducing Dueling Bandits to Cardinal Bandits. An effective method for reducing dueling bandits to normal bandits that extends to contextual situations.
  4. Pedro Pinheiro, Ronan Collobert, Recurrent Convolutional Neural Networks for Scene Labeling. Image parsing remains a challenge, and this is plausibly a step forward.
  5. Cicero Dos Santos, Bianca Zadrozny, Learning Character-level Representations for Part-of-Speech Tagging. Word morphology is clearly useful information, and yet almost all ML-for-NLP applications ignore it or hard-code it (by stemming).
  6. Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, Robert Schapire, Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. Statistically efficient interactive learning is now computationally feasible. I wish this one had been done in time for the NIPS tutorial
  7. David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, Martin Riedmiller, Deterministic Policy Gradient Algorithms. A reduction in variance from working out the deterministic limit of policy gradient make policy gradient approaches look much more attractive.

Edit: added one that I forgot.

Categories: Blogroll

An ICML proposal: yearly surveys

Machine Learning Blog - Wed, 2014-06-18 08:55

I’d like to propose that ICML conducts a yearly survey similar to the one from 2010 or 2012 which is reported to all.

The key reason for this is information: I expect everyone participating in ICML has some baseline interest in how ICML is doing. Everyone involved has personal anecdotal information, but we all understand that a few examples can be highly misleading.

Aside from satisfying everyone’s joint curiousity, I believe this could improve ICML itself. Consider for example reviewing. Every program chair comes in with ideas for how to make reviewing better. Some succeed, but nearly all are forgotten by the next round of program chairs. Making survey information available will help quantify success and correlate it with design decisions.

The key question to ask for this is “who?” The reason why surveys don’t happen more often is that it has been the responsibility of program chairs who are typically badly overloaded. I believe we should address this by shifting the responsibility to a multiyear position, similar to or the same as a webmaster. This may imply a small cost to the community (<$1/participant) for someone’s time to do and record the survey, but I believe it’s a worthwhile cost.

I plan to bring this up with IMLS board in Beijing, but would like to invite any comments or thoughts.

Categories: Blogroll

Quick links

Greg Linden's Blog - Thu, 2014-06-05 15:50
What caught my attention lately:
  • Fun data: "How to tell someone's age when all you know is her name" ([1])

  • "The possibility of proper tricorder technology in the future, scanning a bit of someone's blood and telling you if they have any diseases or anomalous genetic conditions" ([1])

  • Will self-driving vehicles appear first in trucking? ([1])

  • "Apple's moves into the world of fashion and wearable computing" ([1])

  • "Few people try to or want to use tablets like laptops" ([1] [2])

  • "While managers do indeed add value to a company, there’s no particular reason to believe that they add more value to a company than the people who report to them ... [You want] an organization where fairly-compensated people work together as a team, rather than trying to work out the best way to make money for themselves at the expense of their colleagues." ([1])

  • "Each meeting ... spawns even more meetings ... The solution ... reduce default meeting length from 60 to 30 minutes ... limit meetings to seven or fewer participants ... agendas with clear objectives ... materials ... distributed in advance .. on-time start ... early ending, especially if the meeting is going nowhere ... remove ... unnecessary supervisors." ([1])

  • Fun article on the history of the modern office: "The cubicle was actually intended to be this liberating design, and it basically became perverted" ([1])

  • "We were wrong about the first-time shoppers. They did mind registering. They resented having to register when they encountered the page. As one shopper told us, 'I'm not here to enter into a relationship. I just want to buy something.'" ([1])

  • Private investment in broadband infrastructure is actually dropping in the US ([1])

  • "Not only are packets being dropped, but all those not being dropped are also subject to delay. ... They are deliberately harming the service they deliver to their paying customers ... Shouldn't a broadband consumer network with near monopoly control over their customers be expected, if not obligated, to deliver a better experience than this?" ([1])

  • Fascinating data on cancer shows a surprising lack of linear relationship between aging and cancer ([1] [2])

  • "A wayward spacecraft ISEE-3/ICE was returning to fly past Earth after many decades of wandering through space. It was still operational, and could potentially be sent on a new mission, but NASA no longer had the equipment to talk to it ... crowdfunding project ... commandeer the spacecraft ... awfully long shot ... They are now in command of the ISEE-3 spacecraft." ([1])

  • I love the caption on this comic: "Somebody please do this and post it on YouTube so I can live vicariously through your awesomeness." ([1])

  • Hilarious SMBC comic on privacy and technology ([1])

  • Great SMBC comic: "Wanna play the Bayesian drinking game?" ([1])

  • Hilarious John Oliver segment on net neutrality ([1]) directs people to FCC website to comment, crashing FCC website ([2])

  • Very funny, from The Onion: "New Facebook Feature Scans Profile To Pinpoint Exactly When Things Went Wrong" ([1])
Categories: Blogroll
Syndicate content