Skip navigation.
Home
Semantic Software Lab
Concordia University
Montréal, Canada

Blogroll

Vowpal Wabbit 7.8 at NIPS

Machine Learning Blog - Sat, 2014-12-06 21:46

I just created Vowpal Wabbit 7.8, and we are planning to have an increasingly less heretical followup tutorial during the non-“ski break” at the NIPS Optimization workshop. Please join us if interested.

I always feel like things are going slow, but in the last year, but there have been many changes overall. Notes for 7.7 are here. Since then, there are several areas of improvement as well as generalized bug fixes and refactoring.

  1. Learning to Search: Hal completely rewrote the learning to search system, enough that the numbers here are looking obsolete. Kai-Wei has also created several advanced applications for entity-relation and dependency parsing which are promising.
  2. Languages Hal also created a good python library, which includes call-backs for learning to search. You can now develop advanced structured prediction solutions in a nice language. Jonathan Morra also contributed an initial Java interface.
  3. Exploration The contextual bandit subsystem now allows evaluation of an arbitrary policy, and an exploration library is now factored out into an independent library (principally by Luong with help from Sid and Sarah). This is critical for real applications because randomization must happen at the point of decision.
  4. Reductions The learning reductions subsystem has continued to mature, although the perfectionist in me is still dissatisfied. As a consequence, it’s now pretty easy to program new reductions, and the efficiency of these reductions has generally improved. Several news ones are cooking.
  5. Online Learning Alekh added an online SVM implementation based on LaSVM. This is known to parallelize well via the para-active approach.

This project has grown quite a bit—there are about 30 different people contributing to VW since the last release, and there is now a VW meetup (December 8th!) in the bay area that I wish I could attend.

Categories: Blogroll

Allreduce (or MPI) vs. Parameter server approaches

Machine Learning Blog - Fri, 2014-11-28 18:01

In the last 7 years or so there has been quite a bit of work on parallel machine learning approaches, enough that I felt like a summary might be helpful both for myself and others. In each case, I put in the earliest known citation. If I missed something please comment.

One basic dividing line between parallel approaches is single-machine vs. multi-machine. Multi-machine approaches offer the potential for much greater improvements than single-machine approaches, but generally suffer from a lower bandwidth between components of the parallelized process.

Amongst single machine approaches, GPU-based learning is the dominant form of parallelism. For many algorithms, this can provide an easy 10x speedup, with the limits being programming (GPUs are special), the amount of GPU RAM (12GB for a K40), the bandwidth to the GPU interface, and your algorithms needing care as new architectures come out. I’m not sure who first started using GPUs for machine learning.

Another important characteristic of parallel approaches is deterministic vs. nondeterministic. When you run the same algorithm twice, do you always get the same result? Leon Bottou tells me that he thinks reproducibility is worth a factor of 2. I personally would rate it somewhat higher, just because debugging is such an intrinsic part of using machine learning algorithms and the debuggability of nondeterministic algorithms is greatly impaired.

  1. MPI gradient aggregation (See here (2007).) Accumulate gradient statistics in parallel and use a good solver to find a good solution. There are two weaknesses here:
    1. Batch solvers are slow compared to online gradient descent approaches, at least for the first pass.
    2. Large datasets typically do not sit in MPI clusters. There are good reasons for this—MPI clusters are typically not designed for heavy data work.
  2. Map-Reduce statistical query algorithms. The first paper (2007) of this sort was single machine, but it obviously applied to map-reduce clusters of the time starting the Mahout project. This addressed the second problem of the MPI approach, but not the first (batch algorithms are slow), and created a new problem (iteration and communication are slow in a map-reduce setting).
  3. Parameter averaging. (see here (2010)). Use online learning algorithms and then average parameter values. This dealt with both of the drawbacks of the MPI approach as applied to convex learning, but is limited to convex(ish) approaches and may take a long time to converge on datasets where a second order optimization is needed. Iteration in a map-reduce paradigm remains awkward/slow in many cases.
  4. Graph-based approaches. (see here (2010)). Learning algorithms that are represented by graphs can be partitioned across compute nodes and then operated on with parallel algorithms. This allows models larger than the state of a single machine. This addresses many learning algorithms that can be represented this way, but there are many that cannot be effectively represented this way as well.
  5. Parameter server approaches. (see here (2010)). This is distinct from graph based approaches in that parameter storage and updating is broken out as a primary component. This allows models larger than the state of a single machine. Parameter server approaches require nondeterminism to be performant. There has been quite a bit of follow-up work on parameter server approaches including shockingly inefficient systems(2012) and more efficient systems(2014) although they remain less efficient than GPUs.
  6. Allreduce approaches. (see here (2011)) Allreduce is an MPI-primitive which allows normal sequential code to work in parallel, implying very low programming overhead. This allows both parameter averaging, gradient aggregation, and iteration. The fundamental drawbacks are poor performance under misbalanced loads and difficulty with models that exceed working memory in size. A refined version of this approach has been used for speech recognition (2014).
  7. GPU+MPI approaches. (see here (2013)) GPUs are good and MPI is good, so GPU+MPI should be good. It is good, although there are caveats related to the phenomenal amount of computation a GPU provides compared to the communication available, even with a good interconnect. See the speech recognition paper above for a remediation.

Most of these papers are about planting a flag rather than determining what the best approach to parallelization is. This makes determining how to parallelize learning algorithms rather unclear. My present approach remains case-based.

  1. Don’t do it for the sake of parallelization. Have some other real goal in mind that justifies the effort. Parallelization is both simple and subtle which makes it unrewarding unless you really need it. Strongly consider the programming complexity of approaches if you decide to proceed.
  2. If you are locked into a particular piece of hardware or cluster software, then you don’t have much choice—make the best of it. For some people this is an MPI cluster, Hadoop, or an SQL cluster.
  3. If your data can easily be copied onto a single machine, then a GPU based approach seems like a good idea. GPU programming is nontrivial, but many people have done it at this point.
  4. If your data is of a multimachine scale you must do some form of cluster parallelism.
    1. Graph-based approaches can be the right answer when your graph is not too deeply interconnected.
    2. Allreduce-based approaches appear effective and easy to use in many other cases. I wish every cluster came with an allreduce library.
      1. If you are parsing limited (i.e. for linear representations) then a CPU cluster is fine.
      2. If you are compute limited, then a cluster of GPUs seems the way to go.

The above leaves out parameter server approaches, which is controversial since a huge amount of effort has been invested in parameter server approaches and reasonable people can disagree. The caveats that matter for me are:

  1. It might be that the right way to parallelize in a cluster has nothing to do with the right way to parallelize on a single machine, but this seems implausible.
  2. Success/effort appears relatively low. It’s surprising that you can effectively compete with mature parameter server approaches on compute heavy tasks using a single GPU.

Note that I’m not claiming parameter servers are useless—I think they could be effective if applied in situations where you cannot prebalance the compute load of parallel elements. But the extent to which this applies in a datacenter computation seems both manageable and a flaw of the datacenter that will be reduced with time.

Categories: Blogroll

Automatic Semantic Tagging for Drupal CMS launched

Semantic Web Company - Fri, 2014-11-28 09:30

REEEP [1] and CTCN [2] have recently launched Climate Tagger, a new tool to automatically scan, label, sort and catalogue datasets and document collections. Climate Tagger now incorporates a Drupal Module for automatic annotation of Drupal content nodes. Climate Tagger addresses knowledge-driven organizations in the climate and development arenas, providing automated functionality to streamline, catalogue and link their Climate Compatible Development data and information resources.

Climate Tagger for Drupal is a simple, FREE and easy-to-use way to integrate the well-known Reegle Tagging API [3], originally developed in 2011 with the support of CDKN [4], (now part of the Climate Tagger suite as Climate Tagger API) into any web site based on the Drupal Content Management System [5]. Climate Tagger is backed by the expansive Climate Compatible Development Thesaurus, developed by experts in multiple fields and continuously updated to remain current (explore the thesaurus at http://www.reegle.info/glossary). The thesaurus is available in English, French, Spanish, German and Portuguese. And can connect content on different portals published in these different languages.

Climate Tagger for Drupal can be fine-tuned to individual (and existing) configuration of any Drupal 7 installation by:

  • determining which content types and fields will be automatically tagged
  • scheduling “batch jobs” for automatic updating (also for already existing contents; where the option is available to re-tag all content or only tag with new concepts found via a thesaurus expansion / update)
  • automatically limit and manage volumes of tag results based on individually chosen scoring thresholds
  • blending with manual tagging

click to enlarge

“Climate Tagger [6] brings together the semantic power of Semantic Web Company’s PoolParty Semantic Suite [7] with the domain expertise of REEEP and CTCN, resulting in an automatic annotation module for Drupal 7 with an accuracy never seen before” states Martin Kaltenböck, Managing Partner of Semantic Web Company [8], which acts as the technology provider behind the module.

Climate Tagger is the result of a shared commitment to breaking down the ‘information silos’ that exist in the climate compatible development community, and to provide concrete solutions that can be implemented right now, anywhere” said REEEP Director General Martin Hiller. “Together with CTCN and SWC laid the foundations for a system that can be continuously improved and expanded to bring new sectors, systems and organizations into the climate knowledge community.”

For the Open Data and Linked Open Data communities, a Climate Tagger plugin for CKAN [9] has also been published, which was developed by developed by NREL [10] in cooperation with CTCN’s support, harnessing the same taxonomy and expert vetted thesaurus behind the Climate Tagger, helping connect open data to climate compatible content through the simultaneous use of these tools.

REEEP Director General Martin Hiller and CTCN Director Jukka Uosukainen will be talking about Climate Tagger at the COP20 side event hosted by the Climate Knowledge Brokers Group in Lima [11], Peru, on Monday, December 1st at 4:45pm.

Further reading and downloads

About REEEP:

REEEP invests in clean energy markets in developing countries to lower CO2 emissions and build prosperity. Based on strategic portfolio of high impact projects, REEEP works to generate energy access, improve lives and economic opportunities, build sustainable markets, and combat climate change.

REEEP understands market change from a practice, policy and financial perspective. We monitor, evaluate and learn from our portfolio to understand opportunities and barriers to success within markets. These insights then influence policy, increase public and private investment, and inform our portfolio strategy to build scale within and replication across markets. REEEP is committed to open access to knowledge to support entrepreneurship, innovation and policy improvements to empower market shifts across the developing world.

About the CTCN

The Climate Technology Centre & Network facilitates the transfer of climate technologies by providing technical assistance, improving access to technology knowledge, and fostering collaboration among climate technology stakeholders. The CTCN is the operational arm of the UNFCCC Technology Mechanism and is hosted by the United Nations Environment Programme (UNEP) in collaboration with the United Nations Industrial Development Organization (UNIDO) and 11 independent, regional organizations with expertise in climate technologies.

About Semantic Web Company

Semantic Web Company (SWC, http://www.semantic-web.at) is a technology provider headquartered in Vienna (Austria). SWC supports organizations from all industrial sectors worldwide to improve their information and data management. Their products have outstanding capabilities to extract meaning from structured and unstructured data by making use of linked data technologies.

Categories: Blogroll

Introducing the Linked Data Business Cube

Semantic Web Company - Fri, 2014-11-28 07:09

With the increasing availability of semantic data on the World Wide Web and its reutilization for commercial purposes, questions arise about the economic value of interlinked data and business models that can be built on top of it. The Linked Data Business Cube provides a systematic approach to conceptualize business models for Linked Data assets. Similar to an OLAP Cube, the Linked Data Business Cube provides an integrated view on stakeholders (x-axis), revenue models (y-axis) and Linked Data assets (z-axis), thus allowing to systematically investigate the specificities of various Linked Data business models.

 

Mapping Revenue Models to Linked Data Assets

By mapping revenue models to Linked Data assets we can modify the Linked Data Business Cube as illustrated in the figure below.

The figure indicates that with increasing business value of a resource the opportunities to derive direct revenues rise. Assets that are easily substitutable generate little incentives for direct revenues but can be used to trigger indirect revenues. This basically applies to instance data and metadata. On the other side, assets that are unique and difficult to imitate and substitute, i.e. in terms of competence and investments necessary to provide the service, carry the highest potential for direct revenues. This applies to assets like content, service and technology. Generally speaking, the higher the value proposition of an asset – in terms of added value – the higher the willingness to pay.

Ontologies seem to function as a “mediating layer” between “low-incentive assets” and “high-incentive assets”. This means that ontologies as a precondition for the provision and utilization of Linked Data can be capitalized in a variety of ways, depending on the business strategy of the Linked Data provider.

It is important to note that each revenue model has specific merits and flaws and requires certain preconditions to work properly. Additionally they often occur in combination as they are functionally complementary.

Mapping Revenue Models to Stakeholders

A Linked Data ecosystem is usually comprised of several stakeholders that engage in the value creation process. The cube can help us to elaborate the most reasonable business model for each stakeholder.

Summing up, Linked Data generates new business opportunities, but the commercialization of Linked Data is very context specific. Revenue models change in accordance to the various assets involved and the stakeholders who take use of them. Knowing these circumstances is crucial in establishing successful business models, but to do so it requires a holistic and interconnected understanding of the value creation process and the specific benefits and limitations Linked Data generates at each step of the value chain.

Read more: Asset Creation and Commercialization of Interlinked Data

Categories: Blogroll

Blogger Stole My Thumbnails

Code from an English Coffee Drinker - Thu, 2014-11-13 05:19
I really hoped I was done writing posts about how Blogger was messing with our blogs, but unfortunately here I am writing another post. It seems that something in the way images are added to our posts has changed and this in turn means that Blogger no longer generates and includes a thumbnail for each post in the feeds produced for a blog. Whilst this doesn't effect the look of your blog it will make it look less interesting when viewed in a feed reader or blog list on another blog as only the title of the posts will appear.

The problem appears to be that when you upload and embed an image, for some reason Blogger is omitting the http: from the beginning of the image URL. This means that all the image URLs now start with just //. This is perfectly valid (it's defined in RFC 1808) and is often referred to as a scheme relative URL. What this means is that the URL is resolved relative to the scheme (http, https, ftp, file, etc.) of the page in which it is embedded.

I'm guessing this changes means that blogger is looking at switching some blogs to be served over https instead of http. Leaving the scheme off the image URL means that the image will be served using the same scheme as the page regardless of what that is. The problem though is that the code that blogger uses to generate the post feeds seem to ignore images that don't start http: meaning that no thumbnails are generated.

For now the easy fix is to manually add the http: to the image URLs (or at least of the image you want to use as the thumbnail for the post). Of course it would be better if blogger just fixed their code to spot these URLs properly and include them in the post feed.

Updated 18th November 2014: It looks as if this problem has been resolved as I've just tried posting on one of my other blogs and the http: bit is back.
Categories: Blogroll

Becky’s and My Annotation Paper in TACL

LingPipe Blog - Wed, 2014-10-29 17:05
Finally something published after all these years working on the problem: Rebecca J. Passonneau and Bob Carpenter. 2014. The Benefits of a Model of Annotation. Transactions of the Association for Comptuational Linguistics (TACL) 2(Oct):311−326. [pdf] (Presented at EMNLP 2014.) Becky just presented it at EMNLP this week. I love the TACL concept and lobbied Michael Collins […]
Categories: Blogroll

Sequence Data Mining for Health Applications

Life Analytics Blog - Thu, 2014-10-16 07:48
An often overlooked type of Analysis is  Sequence Data Mining (or Sequential Pattern Mining).


Sequence Data Mining is a type of Analysis which aims in extracting patterns sequences of  Events. We can also see Sequence Data Mining as an Associations Discovery Analysis with a Temporal Element.

Sequence Data Mining has many potential applications (Web Page Analytics, Complaint Events, Business Processes) but here today we will show an application for Health. I believe that this type of Analysis will become even more important as wearable technology will be used even more and therefore more Data of this kind will be generated.

Consider the following hypothetical scenario : 
A 30-year old Male patient complaints about several symptoms which -for simplicity reasons- we will name them as Symptom1, Symptom2, Symptom3,etc.

His Doctor tries to identify what is going on and after the patient takes all necessary Blood work and finds no problems. After thorough evaluation the Doctor believes that his patient suffers from Chronic Fatigue Syndrome. Under the Doctor's supervision the patient will record his symptoms along with different supplements to understand more about his condition. Several events (e.g a Visit to the Gym, a stressful Event) will also be taken under consideration to see if any patterns emerge.
-How Can we easily record Data for the scenario above?-Can we extract sequences of events that occur more frequently than mere chance?-Can we identify which sequences of Events / Food / Medication may potentially lead to specific Symptoms or to a lack of Symptoms?

Looking the problem through the eyes of a Data Scientist, We have :
A series of Events that happen during a day : A Stressful event, A sedentary day, Cardio workouts, Weight Lifting, Abrupt Weather Deterioration, etc
A Number of Symptoms : Headaches, "Brain Fog", Mood problems, Insomnia, Arthralgia, etc.

Let's begin with Data Collection. We first suggest to the patient to use an Android app called MyLogsPro (or some other equivalent application) to easily input information as this happens :

   So if the patient feels a specific Symptom he will press the relevant Symptom button on his  mobile device. The same applies for any events that have happened and any Food or Medication taken. As the day passes we have the following data collected :


The snapshot shows what happened starting on the 20th of August 2014, where our patient has logged the intake of Medication (at 08:22 AM) and/or Supplements upon waking up then a Food entry was added at 08:47. At 11:06 the patient had a Symptom and immediately reached his phone and pressed the relevant Symptom (Symptom No 4) button.
After many days of Data Collection we decide that its time to analyze this information. We export the data from the application as a csv file which looks as follows :


We will use KNIME to read the csv file, change the contents of the entries accordingly so that an Algorithm can read the events and then perform Sequence Data Mining. We have the following layout :


 The File Reader reads the .csv file, then during the Pre-processing block (shown in yellow), a String Manipulation node which removes colon (:) from time field (e.g 12:10 becomes 1210). The Sorter sorts the data according to date then time as the second field and a Java snippet uses replaceAll() function to remove all leading zeros from Time field (e.g 0010 becomes 10).
The R Snippet loads the CSPADE Algorithm and then uses this Algorithm to extract pattern of sequences.

After executing the stream we get the following output :

The information consists of two outputs : The first one is a list of sequences along with their support and the second one contains the output from rule induction which gives us two more useful metrics (namely the lift and the confidence for each rule).

We immediately notice an interesting entry on the first output :

Medication1->Symptom2

and on the second output we see that this particular rule has a lift of 1.4 and 0.8 confidence.

However, as Data Scientists we should always double-check the extracted knowledge and must be aware of pitfalls. Let's see some examples (list not exhaustive) :

1) The algorithm does not account for time as it should : As an example, consider the following entries :

10/09/14,08:00,Medication1
10/09/14,08:05,Symptom2

We assume that Medication1 is taken by mouth and needs 60 minutes to be properly dissolved and that these entries occur frequently enough in that order in our data set. Even though the algorithm might show a statistically significant pattern , it is not logical to hypothesize that Medication1 could be related to Symptom2. The Analyst should first examine each of these entries to see which proportion of the records has a time difference of at least -say- or greater than 60 minutes.

Apart from the example shown above we must consider the opposite effect. Consider this entry :

10/09/14,08:00,Medication1
...
...
...
10/09/14,21:05,Symptom2

In other words : Is it possible that a Medication taken in the morning to generate a Symptom 12 hours later?


2) The algorithm is not able to account for the compounding effect of a Medication. For example, the patient might have low levels of Taurine and for this level to be replenished, an x amount of days of Taurine supplementation is needed. The algorithm cannot account for this possibility.


 3) The patient should also input entries of "No Symptoms". It is not clear however when this should be done (e.g at the end of each day? assess every 6 hours and add 2 entries accordingly?)


However, this does not mean that a Sequence Mining algorithm should not be used under these circumstances. This technique can generate several potentially interesting hypotheses which Doctors and/or Researchers may wish to pursue further.
 




Categories: Blogroll

Conference on Digitial Experimentation

Machine Learning Blog - Sat, 2014-10-11 16:30

I just attended CODE. The set of people interested in digital experimentation have very diverse backgrounds encompassing theory, machine learning, social science, economics, and industry so this seems like a good subject for a new conference. I hope it continues.

I found several talks interesting.

  • Eytan Bakshy talked about PlanOut which is language/platform for flexibly specifying experiments.
  • Ron Kohavi talked about EXP which is a heavily used A/B testing platform.
  • Susan Athey talked about long term vs short term metrics which seems both important to address, a constant problem, and not yet systematically solved.

There was a panel about the ongoing Facebook experimentation controversy. The issue here is complex. My understanding is that Facebook users have some expected ownership of the content they create, and hence aren’t comfortable with the content being used in unexpected ways. On the other hand, experimentation is so necessary to the functioning of all large modern internet sites that banning it or slowing down the process by a factor of a million (as some advocated) would badly degrade the future of these sites in practice.

My belief is that what’s lacking is education and trust. W.r.t. education, people need to understand that experimentation is unavoidable when trying to figure out how to optimize an enormously complex system, as there is just no other way to systematically make 1000 right decisions as is necessary for basic things like choosing the best homepage/search result/etc… W.r.t. trust, companies are not particularly good at creating trust in general, but finding the right mechanism for doing so seems critical. I would point out Vanguard as a company that managed to successfully create trust by design.

Categories: Blogroll

The Longform Manifesto

Data Mining Blog - Fri, 2014-09-26 00:37

Sometimes a title for a blog posts suggests itself to me which seems so self contained that it takes real effort to actual write the post ('Machine Intelligence, not Machine Learning is the Next Big Thing' is another in this line). The idea behind the (or a) Longform Manifesto is as follows. I have become aware of late of the sense of deterioration that is associated with the mobile 'revolution' and the info snacking, casual gaming and interupt driven lifestyle that it has entailed. The behaviours are perfectly illustrated in this scene from Portlandia:

 

With a daughter who has now come of technological age (she has a cell phone) it has become important to me to remind myself what content consumption was like before this mobile mess appeared.

We read books, we watched movies, we listened to music. But, of course, we haven't stopped doing that. Rather, we have started all this other stuff, and the problem is that this is influencing how we approach longform content. I find myself watching bits of movies, or listening to bits of music or reading parts of essays.

The Longform Manifesto, through the definition of longform content and the discipline and commitment needed to consume it as it was meant to be consumed, helps to dillute and remove the behaviour degrading influence of mobile technology. Someone should write it.

Categories: Blogroll

The Longform Manifesto

Data Mining Blog - Fri, 2014-09-26 00:37

Sometimes a title for a blog posts suggests itself to me which seems so self contained that it takes real effort to actual write the post ('Machine Intelligence, not Machine Learning is the Next Big Thing' is another in this line). The idea behind the (or a) Longform Manifesto is as follows. I have become aware of late of the sense of deterioration that is associated with the mobile 'revolution' and the info snacking, casual gaming and interupt driven lifestyle that it has entailed. The behaviours are perfectly illustrated in this scene from Portlandia:

 

With a daughter who has now come of technological age (she has a cell phone) it has become important to me to remind myself what content consumption was like before this mobile mess appeared.

We read books, we watched movies, we listened to music. But, of course, we haven't stopped doing that. Rather, we have started all this other stuff, and the problem is that this is influencing how we approach longform content. I find myself watching bits of movies, or listening to bits of music or reading parts of essays.

The Longform Manifesto, through the definition of longform content and the discipline and commitment needed to consume it as it was meant to be consumed, helps to dillute and remove the behaviour degrading influence of mobile technology. Someone should write it.

Categories: Blogroll

No more MSR Silicon Valley

Machine Learning Blog - Fri, 2014-09-19 19:44

This news report is correct, the Microsoft Research Silicon Valley center has been cut. The New York lab has not been directly affected although obviously cross-lab collaborations are impacted, and we sympathize deeply with those involved. Most of the rest of MSR is not directly affected.

I’m not privy to the rationale behind the decision, but in my opinion there are some very strong people in the various groups (Algorithms, Architecture, Distributed Systems, Privacy, Software tools, Web Search), and I expect offers have started raining on them. In my experience, this is harrowing in the short term, yet I know that most of my previous colleagues ended up happier after the troubles hit Yahoo! Research 2 1/2 years ago.

Categories: Blogroll

Visualizing Publicly Available US Government Data Online

Information aesthetics - Fri, 2014-09-19 03:11


Brightpoint Consulting recently released a small collection of interactive visualizations based on open, publicly available data from the US government. Characterized by a rather organic graphic design style and color palette, each visualization makes a socially and politically relevant dataset easily accessible.

The custom chore diagram titled Political Influence [brightpointinc.com] highlights the monetary contributions made by the top Political Action Committees (PAC) for the 2012 congressional election cycle, for the House of Representatives and the Senate.

The hierarchical browser 2013 Federal Budget [brightpointinc.com] reveals the major flows of spending in the US government, at the federal, state, and local level, such as the relationship of spending between education and defense.

The circular flow chart United States Trade Deficit [brightpointinc.com] shows the US Trade Deficit over the last 11 years by month. The United States sells goods to the countries at a the top, while vice versa, the countries at the bottom sell goods to the US. The dollar amount in the middle represents the cumulative deficit over this period of time.

Categories: Blogroll

The Disappearing Planet: Comparing the Extinction Rates of Animals

Information aesthetics - Thu, 2014-09-18 15:05


The subtly designed A Disappearing Planet [propublica.org] by freelance data journalist Anna Flagg reveals the extinction rates of animals, caused by a variety of human-caused effects, including climate change, habitat destruction and species displacement.

Divided into mammals, reptiles, amphibians and birds, the interactive bar graph allows users to browse horizontally through the vast amount of species by order and family, and vertically by genus.

Species in risk are highlighted in red, so that dense clusters denote related families (e.g. bears, parrots, turtles) that are specially threatened over the next 100 years.

Categories: Blogroll

Scottish Independence : Bing Predicts 'No'

Data Mining Blog - Thu, 2014-09-18 11:58

Bing's prediction team has a feature live on the site right now that predicts Scotland will not become an independant nation as a result of today's referendum.

Categories: Blogroll

Scottish Independence : Bing Predicts 'No'

Data Mining Blog - Thu, 2014-09-18 11:58

Bing's prediction team has a feature live on the site right now that predicts Scotland will not become an independant nation as a result of today's referendum.

Categories: Blogroll

GitHut: the Universe of Programming Languages across GitHub

Information aesthetics - Fri, 2014-09-12 10:37


GitHut [githut.info], developed by Microsoft data visualization designer Carlo Zapponi, is an interactive small multiples visualization revealing the complexity of the wide range of programming languages used across the repositories hosted on GitHub.

GitHub is a web-based repository service which offers the distributed revision control and source code management (SCM) functionality of Git, enjoying more than 3 million users.

Accordingly, by representing the distribution and frequency of programming languages, one can observe the continuous quest for better ways to solve problems, to facilitate collaboration between people and to reuse the effort of others.

Programming languages are ranked by various parameters, ranging from the number of active repositories to new pushes, forks or issues. The data can be filtered over discrete moments in time, while evolutions can be explored by a collection of timelines.

Categories: Blogroll

Pi Visualized as a Public Urban Art Mural

Information aesthetics - Wed, 2014-09-10 15:23


Visualize Pi [tumblr.com] is a mural project that aimed to use popular mathematics to connect Brooklyn students to the community with a visualization of Pi. It was funded by a successful KickStarter project as proposed by visual artist artist Ellie Balk, The Green School Students, staff and Assistant Principal Nathan Affield.

The mural seems to consist of different parts. A reflective line graph, reminiscent of a sound wave, represents the number Pi (3.14159...) by way of colors that are coded by the sequence of the prime numbers found in Pi (2,3,5,7), as well as height.

Additionally, a golden spiral was drawn based on the Fibonacci Sequence, as an exploration of the relationship between the golden ratio and Pi. The number Pi was represented in a color-coded graph within the golden spiral. In this, the numbers are seen as color blocks that vary in size proportionately within the shrinking space of the spiral, representing the 'shape' of Pi.

"By focusing on the single, transcendental concept of Pi across courses, the mathematics department plans to not only deepen student understanding of shape and irrational number, but more importantly, connect these foundational mental schema for students while dealing with the concrete issues of neighborhood beautification and how proportion can inform aesthetic which can in turn improve quality of life."

A few more similar urban / public visualization projects can be found at Balk's project page, e.g. showing weather patterns, emotion histograms or sound waves.

Via @mariuswatz .

Categories: Blogroll

The Key Players in the Middle East and their Relationships

Information aesthetics - Wed, 2014-09-10 14:48


Whom Likes Whom in the Middle-East? [informationisbeautiful.net] by David McCandless and UniversLab is a forced-network visualisation of key players & notable relationships in the Middle East.

Next to its expressive aesthetic, the interactive features allow users to highlight individual nodes and its direct connections to others, as well as filter between the kind of possible relationships, such as "hate", "strained", "good" or "love".

Reminds me a bit of Mapping the Relationships between the Artists who Invented Abstraction.

Categories: Blogroll

SEMANTiCS – the emergence of a European Marketplace for the Semantic Web

Semantic Web Company - Mon, 2014-09-08 06:34

SEMANTiCS conference celebrated its 10th anniversary this September in Leipzig. And this year’s venue has been capable of opening a new age for the Semantic Web in Europe – a marketplace for the next generation of semantic technologies was born.

As Phil Archer stated in his key note, the Semantic Web is now mature, and academia and industry can be proud of the achievements so far. And exactly that fact gave the thread for the conference: Real world use cases demonstrated by industry representatives, new and already running applied projects presented by the leading consortia in the field and a vivid academia showing the next ideas and developments in the field. So this years SEMANTiCS conference brought together the European Community in Semantic Web Technology – both from academia and industry.

  • Papers and Presentations: 45 (50% of them industry talks)
  • Posters: 10 (out of 22)
  • A marketplace with 11 permanent booths
  • Presented Vocabularies at the 1st Vocabulary Carnival: 24
  • Attendance: 225
  • Geographic Coverage: 21 countries

This year’s SEMANTiCS was co-located and connected with a couple of other related events, like the German ISKO, the Multilingual Linked Open Data for Enterprises (MLODE 2014) and the 2nd DBpedia Community Meeting 2014. This wisely connected gatherings brought people together and allowed transdisciplinary exchange.

Recapitulatory speaking: This SEMANTiCS has opened up new sights on Semantic Technologies, when it comes to

  • industry use
  • problem solving capacity
  • next generation development
  • knowledge about top companies, institutes and people in the sector
Categories: Blogroll

Visits: Mapping the Places you Have Visited

Information aesthetics - Thu, 2014-09-04 08:02


Visits [v.isits.in] automatically visualizes personal location histories, trips and travels by aggregating geotagged one's Flickr collection with a Google Maps history. developed by Alice Thudt, Dominkus Baur and prof. Sheelagh Carpendale, the map runs locally in the browser, so no sensitive data is uploaded to external servers.

The timeline visualization goes beyond the classical pin representation, which tend to overlap and are relatively hard to read. Instead, the data is shown as 'map-timelines', a combination of maps with a timeline that convey location histories as sequences of maps: the bigger the map, the longer the stay. This way, the temporal sequence is clear, as the trip starts with the map on the left and continues towards the right.

A place slider allows the adjusting of the map granularity, reaching from street-level to country-level.

Read the academic research here [PDF], or watch a explanatory video below.

Categories: Blogroll
Syndicate content