Skip navigation.
Semantic Software Lab
Concordia University
Montréal, Canada


Document Classification with Lucene

LingPipe Blog - Thu, 2014-04-10 09:25
As promised in my last post, this post shows you how to use Lucene’s ranked search results and document store to build a simple classifier. Most of this post is excerpted from Text Processing in Java, Chapter 7, Text Search with Lucene. The data and source code for this example are contained in the source […]
Categories: Blogroll

Automatically Generating HTML5 Microdata

Code from an English Coffee Drinker - Mon, 2014-04-07 19:21
The majority of the code I write as part of my day job revolves around trying to extract useful semantic information from text. Typical examples of what is referred to as "semantic annotation" include spotting that a sequence of characters represents the name of a person, organization or location and probably then linking this to an ontology or some other knowledge source. While extracting the information can be a task in itself, usually you want to do something with the information often to enrich or make the original text document more accessible in some way. In the applications we develop this usually entails indexing the document along with the semantic annotations to allow for a richer search experiance and I've blogged about this approach before. Such an approach assumes, however, that the consumer of the semantic annotations will be a human, but what if another computer programme wants to make use of the information we have just extracted. The answer is to use some form of common machine readable encoding.

While there is already an awful lot of text in the world, more and more is being produced everyday, usually in electronic form, and usually published on the internet. Given that we could never read all this text we rely on search engines, such as Google, to help us pinpoint useful or interesting documents. These search engines rely on two main things to find the documents we are interested in, the text and the links between the documents, but what if we could tell them what some of the text actually means?

In the newest version of the HTML specification (which is a work in progress usually referred to as HTML5) web pages can contain semantic information encoded as HTML Microdata. I'm not going to go into the details of how this works as there is already a number of great descriptions available, including this one.

HTML5 Microdata is, in essence, a way of embedding semantic information in a web page, but it doesn't tell a human or a machine what any of the information means, especially as different people could embed the same information using different identifiers or in different ways. What is needed is a common vocabulary that can be used to embed information about common concepts, and currently many of the major search engines have settled on using as the common vocabulary.

When I first heard about, back in 2011, I thought it was a great idea, and wrote some code that could be used to embed the output of a GATE application within a HTML page as microdata. Unfortunately the approach I adopted was, to put it bluntly, hacky. So while I had proved it was possible the code was left to rot in a dark corner of my SVN repository.

I was recently reminded of HTML5 microdata and in particular when one of my colleges tweeted a link to this interesting article. In response I was daft enough to admit that I had some code that would allow people to automatically embed the relevant microdata into existing web pages. It wasn't long before I'd had a number of people making it clear that they would be interested in me finishing and releasing the code.

I'm currently in a hotel in London as I'm due to teach a two day GATE course starting tomorrow (if you want to learn all about GATE then you might be interested in our week long course to be held in Sheffield in June) and rather than watching TV or having a drink in the bar I thought I'd make a start on tidying up the code I started on almost three years ago.

Before we go any further I should point out that while the code works the current interface isn't the most user friendly. As such I've not added this to the main GATE distribution as yet. I'm hopping that any of you who give it a try can leave me feedback so I can finish cleaning things up and integrate it properly. Having said that here is what I have so far...

I find that worked examples usually help convey my ideas better than prose so, lets start with a simple HTML page:
<title>This is a test document</title>
<h1>This is a test document</h1>
Mark Greenwood works in Sheffield for the University of Sheffield.
He is currently in a Premier Inn in London, killing time by working on a GATE plugin to allow annotations to be embedded within a HTML document using HTML microdata and the model.
</html>As you can see this contains a number of obvious entities (people, organizations and locations) that could be described using the vocabulary (people, organizations and locations are just some of the concepts) and which would be found by simply running ANNIE over the document.

Once we have sensible annotations for such a document, probably from running ANNIE, and a mapping between the annotations and their features and the vocabulary then it is fairly easy to produce a version of this HTML document with the annotations embedded as microdata. The current version of my code generate the following file:
<title>This is a test document</title>
<h1>This is a test document</h1>
<span itemscope="itemscope" itemtype=""><meta content="male" itemprop="gender"/><meta content="Mark Greenwood" itemprop="name"/>Mark Greenwood</span> works in <span itemscope="itemscope" itemtype=""><meta content="Sheffield" itemprop="name"/>Sheffield</span> for the <span itemscope="itemscope" itemtype=""><meta content="University of Sheffield" itemprop="name"/>University of Sheffield</span>.
He is currently in a <span itemscope="itemscope" itemtype=""><meta content="Premier" itemprop="name"/>Premier</span> Inn in <span itemscope="itemscope" itemtype=""><meta content="London" itemprop="name"/>London</span>, killing time by working on a GATE plugin to allow <span itemscope="itemscope" itemtype=""><meta content="female" itemprop="gender"/><meta content="ANNIE" itemprop="name"/>ANNIE</span> annotations to be embedded within a HTML document using HTML microdata and the model.
</html>This works nicely and the embeded data can be extracted by the search engines, as proved using the Google rich snippets tool.

As I said earlier while the code works, the current integration with the rest of GATE definitely needs improving. If you load the plugin (details below) then right clicking on a document will allow you to Export as HTML5 Microdata... but it won't allow you to customize the mapping between annotations and a vocabulary. Currently the ANNIE annotations are mapped to the vocabulary using a config file in the resources folder. If you want to change the mapping you have to change this file. In the future I plan to add some form of editor (or at least the ability to choose a different file) as well as the ability to export a corpus not just a single file.

So if you have got all the way to here then you probably want to get your hands on the current plugin, so here it is. Simply load it into GATE in the usual way and it will add the right-click menu option to documents (you'll need to use a nightly build of GATE, or a recent SVN checkout, as it uses the resource helpers that haven't yet made it into a release version).

Hopefully you'll find it useful but please do let me know what you think, and if you have any suggestions for improvements, especially around the integration the GATE Developer GUI.
Categories: Blogroll

Quick links

Greg Linden's Blog - Wed, 2014-04-02 12:42
What has caught my attention lately:
  • Dilbert on A/B testing: "Bend to my will and choose the orange button, you mindless click-puppets!" ([1])

  • Major performance increases on smartphones are disappearing, which will slow sales and reduce revenues ([1] [2])

  • Price war in cloud services ([1])

  • On Facebook buying Oculus: "The dominant reaction to the move could be summed up in three letters: WTF" ([1] [2])

  • Remember this? "Companies could cause their stock prices to increase by simply adding an 'e-' prefix to their name or a '.com' to the end, which one author called 'prefix investing'" ([1] [2])

  • VCs favor pitches from attractive men ([1] [2])

  • "We've known for a while that email providers could look into your inbox, but the assumption was that they wouldn't" ([1] [2])

  • Bad new trend: Apps that covertly mine Bitcoins for someone else ([1] [2])

  • More companies should do this: Run large scale surveys of employees to discover what makes people happy and productive ([1])

  • Combining dissimilar fields is hard, but can also lead to discovering lots of low hanging fruit (at least from where you are standing) that no one else has picked ([1])

  • Good idea from a recent Google paper: Mine the web to build up knowledge of objects that are likely and unlikely to co-occur, then use that to accept or reject candidates during object recognition ([1] [2])

  • Cool throwback idea from a recent MSR paper: Old school circuit-switched networks in the data center using cheap commodity FPGAs ([1] [2])

  • “There doesn't need to be a protective shell around our researchers where they think great thoughts" ([1] [2])

  • Surprisingly compelling results: Generate likely 3D models of facial appearance solely from DNA ([1] [2])

  • Stem cells used to grow strong muscles that repair themselves when damaged ([1])

  • The ancient Greeks and Persians had to occasionally fight off lions ([1] [2])

  • Great visualization of conditional probability ([1])

  • Galleries of hilariously useless items ([1] [2])
Categories: Blogroll

American Physical Society Taxonomy – Case Study

Semantic Web Company - Wed, 2014-04-02 07:37

Joseph A Busch

Taxonomy Strategies has been working with the American Physical Society (APS) to develop a new faceted classification scheme.

The proposed scheme includes several discrete sets of categories called facets whose values can be combined to express concepts such as existing Physics and Astronomy Classification Scheme (PACS) codes, as well as new concepts that have not yet emerged, or have been difficult to express with the existing PACS.

PACS codes formed a single-hierarchy classification scheme, designed to assign the “one best” category that an item will be classified under. Classification schemes come from the need to physically locate objects in one dimension, for example in a library where a book will be shelved in one and only one location, among an ordered set of other books. Traditional journal tables of contents similarly place each article in a given issue in a specific location among an ordered set of other articles, certainly a necessary constraint with paper journals and still useful online as a comfortable and familiar context for readers.

However, the real world of concepts is multi-dimensional. In collapsing to one dimension, a classification scheme makes essentially arbitrary choices that have the effect of placing some related items close together while leaving other related items in very distant bins. It also has the effect of repeating the terms associated with the last dimension in many different contexts, leading to an appearance of significant redundancy and complexity in locating terms.

A faceted taxonomy attempts to identify each stand-alone concept through the term or terms commonly associated with it, and have it mean the same thing whenever used. Hierarchy in a taxonomy is useful to group related terms together; however the intention is not to attempt to identify an item such as an article or book by a single concept, but rather to assign multiple concepts to represent the meaning. In that way, related items can be closely associated along multiple dimensions corresponding to each assigned concept. Where previously a single PACS code was used to indicate the research area, now two, three, or more of the new concepts may be needed (although often a single new concept will be sufficient). This requires a different mindset and approach in applying the new taxonomy to the way APS has been accustomed to working with PACS; however it also enables significant new capabilities for publishing and working with all types of content including articles, papers and websites.

To build and maintain the faceted taxonomy, APS has acquired the PoolParty taxonomy management tool. PoolParty will enable APS editorial staff to create, retrieve, update and delete taxonomy term records. The tool will support the various thesaurus, knowledge organization system and ontology standards for concepts, relationships, alternate terms etc. It will also provide methods for:

  • Associating taxonomy terms with content items, and storing that association in a content index record.
  • Automated indexing to suggest taxonomy terms that should be associated with content items, and text mining to suggest terms to potentially be added to the taxonomy.
  • Integrating taxonomy term look-up, browse and navigation in a selection user interface that, for example, authors and the general public could use.
  • Implementing a feedback user interface allowing authors and the general public to suggest terms, record the source of the suggestion, and inform the user on the disposition of their suggestion.

Arthur Smith, project manager for the new APS taxonomy notes “PoolParty allows our subject matter experts to immediately visualize the layout of the taxonomy, to add new concepts, suggest alternatives, and to map out the relationships and mappings to other concept schemes that we need. While our project is still in an early stage, the software tool is already proving very useful.”


Taxonomy Strategies ( is an information management consultancy that specializes in applying taxonomies, metadata, automatic classification, and other information retrieval technologies to the needs of business and other organizations.

The American Physical Society ( is a non-profit membership organization working to advance and diffuse the knowledge of physics through its outstanding research journals, scientific meetings, and education, outreach, advocacy and international activities. APS represents over 50,000 members, including physicists in academia, national laboratories and industry in the United States and throughout the world. Society offices are located in College Park, MD (Headquarters), Ridge, NY, and Washington, DC.

Categories: Blogroll

Why SKOS should be a focal point of your linked data strategy

Semantic Web Company - Mon, 2014-03-31 07:04

The Simple Knowledge Organization System (SKOS) has become one of the ‘sweet spots’ in the linked data ecosystem in recent years. Especially when semantic web technologies are being adapted for the requirements of enterprises or public administration, SKOS has played a most central role to create knowledge graphs.

In this webinar, key people from the Semantic Web Company will describe why controlled vocabularies based on SKOS play a central role in a linked data strategy, and how SKOS can be enriched by ontologies and linked data to further improve semantic information management.

SKOS unfolds its potential at the intersection of three disciplines and their methods:

  • library sciences: taxonomy and thesaurus management
  • information sciences: knowledge engineering and ontology management
  • computational linguistics: text mining and entity extraction

Linked Data based IT-architectures cover all three aspects and provide means for agile data, information, and knowledge management.

In this webinar, you will learn about the following questions and topics:

  • How SKOS builds the foundation of enterprise knowledge graphs to be enriched by additional vocabularies and ontologies?
  • How can knowledge graphs be used build the backbone of metadata services in organisations?
  • How text mining can be used to create high-quality taxonomies and thesauri?
  • How can knowledge graphs be used for enterprise information integration?

Based on PoolParty Semantic Suite, you will see several live demos of end-user applications based on linked data and of PoolParty’s latest release which provides outstanding facilities for professional linked data management, including taxonomy, thesaurus and ontology management.

Register here:


Categories: Blogroll

Bouncing Graphic Replays Human Heartbeat Dynamics of Yesterday

Information aesthetics - Thu, 2014-03-27 15:51

One Human Heartbeat [] by data scientist and communicator Jen Lowe displays the dynamics of Jen's heartbeat from about one day ago.

The data is captured by a Basis B1 band, which is able to detect one's heart rate by measuring the pulse and blood flow, and then records the average heart rate for each minute. As the data currently can only be accessed via a USB connection, the data shown on the webpage is from exactly 24 hours ago.

Next to the obvious, bright red spiral of life/death in the middle of the screen, a small, numerical countdown counter reveals how many heart beats are left (at least in comparison to the US average life expectancy).

See also Heart Beat Bracelet Display and Heart Beat Water Bowl.

Categories: Blogroll

Browser Plugin Maps Your Browser History as a Favicon Tapestry

Information aesthetics - Mon, 2014-03-24 15:27

Iconic History [] by Carnegie Mellon University interaction design student Shan Huang is as simple as it is beautifully revealing.

The Chrome browser plugin resulted as an accidental discovery while developing a quite sophisticated 3D webpage bookshelf for a particular course work assignment. It fetches the according favicon for each URL that was visited, and compiles all icons into a huge tapestry, in a sequence that is identical to the historical access order. As each icon is still linked to the original URL, one is able to return to the original website.

Via FastCoDesign.

Categories: Blogroll

LEGO Calendar: a Tangible Wall-Mounted Planner that Can be Digitized

Information aesthetics - Wed, 2014-03-19 16:13

The LEGO Calendar [], developed by design and invention studio Vitamins, is a wall-mounted time planner that simply can be photographed to create an online, digital counterpart.

The calendar is big, visible, tactile and flexible, as it makes the most of the tangibility of physical objects, and the ubiquity of digital platforms. It also looks neat and tidy, while keeping a certain degree of anonimity, not revealing client names or project information by casual passers-by.

See also:
. 3D Infographic Maps Built with Lego
. New York in Lego
. Lego-Based Time Tracking
. Fight Club Narrative in Lego

Categories: Blogroll

HubCab: Mapping All Taxi Trips in New York during 2011

Information aesthetics - Tue, 2014-03-18 16:19

The densely populated yet beautiful HubCab [] by MIT Senseable Lab is an interactive map that captures the more than 170 million unique taxi trips that were made by around 13,500 taxi cabs within the City of New York in 2011.

The map shows exactly how - and when - taxis picked up or dropped off individuals, hereby highlighting particular zones of condensed pickup and drop-off activities during specific times of day.

The map lead to the development of the concept of "shareability networks", which allows for the efficient modeling and optimization of the trip-sharing opportunities. The according sharing benefits consider the total fare fare savings to passengers, the distance savings in travelled miles, and the CO2 emission savings in kg of CO2 that result from potentially shared trips.

See also CabSpotting by Stamen Design and Tracking Taxi Flow Across the City by NYTimes.

Categories: Blogroll

CODE_n: Architectural-Scale Data Visualizations Shown at CeBit 2014

Information aesthetics - Mon, 2014-03-17 16:40

I guess that CODE_n [], developed by design agency Kram/Weisshaar, is best appreciated when perceived in the flesh, that is at the Hannover Fairgrounds during CeBit 2014 in Hannover, Germany.

CODE_n consists of more than 3.000 square meters (approx. 33,000 ft2) of ink-jet printed textile membranes, stretching more than 260 meters of floor-to-ceiling tera-pixel graphics.

The 12.5 terapixel, 90-meter long wall-like canopy titled "Retrospective Trending", shows over 400 lexical frequency timelines ranging from the years 1800 to 2008, each generated using Google's Ngram tool. The hundreds of search terms relate to ethnographic themes of politics, economics, engineering, science, technology, mathematics, and philosophy, resulting in the output of historical trajectories of word usage over time.

The 6.2 terapixel "Hydrosphere Hyperwall" is a visualization of the global ocean as dynamic pathways, polychrome swathes of sea climate, data-collecting swarms of mini robots and sea animals, as well as plumes of narrow current systems. NASA's ECCO2 maps were interwoven with directional arrows that specify wind direction and data vectors that represent buoys, cargo floats, research ships, wave gliders, sea creatures and research stations.

Finally, the 6.6 terapixel "Human Connectome" is a morphological map of the human brain. Consisting of several million multi-coloured fibre bundles and white matter tracts that were captured by diffusion-MRIs, the structural descriptions of the human mind were generated at 40 times the scale of the human body. The 3D map of human neural connections visualizes brain dynamics on an ultra-macro scale as well as the infinitesimal cell-scale.

The question remains... what will they do with these textiles after CeBit is over?


Photos by David Levene.

Categories: Blogroll

At Sixes And Sevens

Code from an English Coffee Drinker - Wed, 2014-03-12 15:46
At work we are slowly getting ready for a major new release of GATE. In preparation for the release I've been doing a bit of code cleanup and upgrading some of the libraries that we use. After every change I've been running the test suite and unfortunately some of the tests would intermittently fail. Given that none of the other members of the team had reported failing tests and that they were always running successfully on our Jenkins build server I decided the problem must be something related to my computer. My solution then was simply to ignore the failing tests as long as they weren't relevant to the code I was working on, and then have the build server do the final test for me. This worked, but it was exceedingly frustrating that I couldn't track down the problem. Yesterday I couldn't ignore the problem any longer because the same tests suddenly started to randomly fail on the build server as well as my computer and so I set about investigating the problem.

The tests in question are all part of a single standard JUnit test suite that was originally written back in 2001 and which have been running perfectly ever since. Essentially these tests run the main ANNIE components over four test documents checking the result at each stage. Each component is checked in a different test within the suite. Now if you know anything about unit testing you can probably already hear alarm bells ringing. For those of you that don't know what unit testing is, essentially each test should check a single component of the system (i.e. a unit) and should be independent from every other test. In this instance while each test checked a separate component, each relied on all the previous tests in the suite having run successfully.

Now while dependencies between tests isn't ideal it still doesn't explain why they should have worked fine for twelve years but were now failing. And why did they start failing on the build server long after they had been failing on my machine. I eventually tracked the change that caused them to fail when run on the build server back to the upgrade from version 4.10 to 4.11 of JUnit but even with the help of a web search I couldn't figure out what the problem was.

Given that I'd looked at the test results from my machine so many times and not spotted any problems I roped in some colleagues to run the tests for me on their own machines and send me the results to see if I could spot a pattern. The first thing that was obvious was that when using version 4.10 of JUnit the tests only failed for those people running Java 7. GATE only requires Java 6 and those people still with a Java 6 install, which includes the build server (so that we don't accidentally introduce any Java 7 dependencies), were not seeing any failures. If, however, we upgraded JUnit to version 4.11 everyone started to see random failures. The other thing that I eventually spotted was that when the tests failed, the logs seemed to suggest that they had been run in a random order which, given the unfortunate links between the tests, would explain why they then failed. Armed with all this extra information I went back to searching the web and this time I was able to find the problem and an easy solution.

Given that unit tests are meant to be independent from one another, there isn't actually anything within the test suite that stipulates the order in which they should run, but it seems that it always used to be the case that the tests were run in the order in which they were defined in the source code. The tests are extracted from the suite by looking for all methods that start with the word test, and these are extracted from the class definition using the Method[] getDeclaredMethods() method from java.lang.Class. The documentation for this method includes the following description:

Returns an array of Method objects reflecting all the methods declared by the class or interface represented by this Class object. This includes public, protected, default (package) access, and private methods, but excludes inherited methods. The elements in the array returned are not sorted and are not in any particular order.
This makes it more than clear that we should never have assumed that the tests would be run in the same order they were defined, but it turns out that this was the order in which the methods were returned when using the Sun/Oracle versions of Java up to and including Java 6 (update 30 is the last version I've tested). I've written the following simple piece of code that shows the order of the extracted tests as well as info on the version of Java being used:
import java.lang.reflect.Method;

public class AtSixesAndSevens {
public static void main(String args[]) {
System.out.println("java version \"" +
System.getProperty("java.version") + "\"");
System.out.println(System.getProperty("") +
" (build " + System.getProperty("java.runtime.version") + ")");
System.out.println(System.getProperty("") +
" (build " + System.getProperty("java.vm.version") + " " +
System.getProperty("") + ")\n");

for(Method m : AtSixesAndSevens.class.getDeclaredMethods()) {

public void testTokenizer() {}
public void testGazetteer() {}
public void testSplitter() {}
public void testTagger() {}
public void testTransducer() {}
public void testCustomConstraintDefs() {}
public void testOrthomatcher() {}
public void testAllPR() {}
}Running this on the build server gives the following output:
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02 mixed mode)

testAllPRWhile running it on my machine results in a random ordering of the test methods as you can see here:
java version "1.7.0_51"
OpenJDK Runtime Environment (build 1.7.0_51-b00)
OpenJDK 64-Bit Server VM (build 24.45-b08 mixed mode)

testAllPRInterestingly it would seem that the order only changes when the class is re-compiled, which suggests that the ordering may be related to how the methods are stored in the class file, but understanding the inner workings of the class file format is well beyond me. Even more interestingly it seems that even with Java 6 you can see a random ordering if you aren't using a distribution from Sun/Oracle, as here is the output from running under the Java 6 version of OpenJDK:
java version "1.6.0_30"
OpenJDK Runtime Environment (build 1.6.0_30-b30)
OpenJDK 64-Bit Server VM (build 23.25-b01 mixed mode)

testSplitterSo this explains why switching from Java 6 to Java 7 could easily cause these related tests to fail, but why should upgrading from JUnit version 4.10 to 4.11 while staying with Java 6 cause a problem?

It turns out that in the new version of JUnit the developers decided to change the default behaviour away from relying on the non-deterministic method ordering provided by Java. Their default approach is now to use a deterministic ordering to guarantee the tests are always run in the same order; as far as I can tell this orders the methods by sorting on the hashCodes of the method names. While this may at least remove the randomness from the test order it doesn't keep them in the same order they are defined in the source file, and so our tests were always failing under JUnit 4.11. Fortunately the developers also allow you to add a class level annotation to force the methods to be ordered alphabetically. I've now renamed the tests so that when sorted alphabetically they are run in the right order (by adding a three digit number after the initial test in their name), and the class definition now looks like:
public class TestPR extends TestCase {
}So I guess there are two morals to this story. Firstly unit tests are called unit tests for a reason and they really should be independent of one another, but more importantly reading the documentation for the language or library you are using and not making assumptions about how they work (especially when the documentation tells you not to rely on something always being true) would make life easier.
Categories: Blogroll

HereHere: Mapping the Concerns of NY Citizens as an Iconographic Map

Information aesthetics - Tue, 2014-03-11 16:36

Here Here [], developed by Future Social Experiences (FuSE) Labs at Microsoft Research, expresses neighborhood-specific public data by mapping it as text labels and cartoon-like iconography.

The data is based on New York City's 311 non-emergency data stream, consisting of the concerns and issues as reported by New York residents via email, phone calls, or text messages. Each day, HereHere pulls this 311 data for each neighborhood and identifies the most compelling, important 311 request types, after which the system generates appropriate cartoons and text that represent a neighborhood's typical reactions.

The iconographic communication approach is coined as 'characterization', and hypothesized to bring immediacy and a human scale to an otherwise overwhelming amount of abstract information. Next to developing an intriguing publicly available map, FuSE Labs wants to understand how this characterization can be a tool for data engagement, and aims to measure the impact of how people relate to their community when they can interact with data in this way.

More detailed information is also available here. Via Engadget.

Categories: Blogroll

Sorting: Understanding How Famous Sorting Algorithms Work

Information aesthetics - Tue, 2014-03-11 16:10

There are quite a few visualizations of sorting algorithms out there, such as at and "Sorting" [], developed by Nokia data visualization designer Carlo Zapponi, brings some innovation to this field by tackling the issue educationally (explaining algorithm step by step) as well as artistically.

The project was initiated to create visual representations of sorting algorithms with the hope of discovering patterns in their visual footprints. It provides an interactive walk-through that guides the reader step after step along the process of ordering a lists of integer numbers for a selection of sorting algorithms.

Categories: Blogroll

The New York ML Symposium, take 2

Machine Learning Blog - Tue, 2014-03-11 14:30

The 201314 is New York Machine Learning Symposium is finally happening on March 28th at the New York Academy of Science. Every invited speaker interests me personally. They are:

We’ve been somewhat disorganized in advertising this. As a consequence, anyone who has not submitted an abstract but would like to do so may send one directly to me ( title NYASMLS) by Friday March 14. I will forward them to the rest of the committee for consideration.

Categories: Blogroll

Lucene 4 Essentials for Text Search and Indexing

LingPipe Blog - Sat, 2014-03-08 07:22
Here’s a short-ish introduction to the Lucene search engine which shows you how to use the current API to develop search over a collection of texts. Most of this post is excerpted from Text Processing in Java, Chapter 7, Text Search with Lucene. Lucene Overview Apache Lucene is a search library written in Java. It’s […]
Categories: Blogroll

Exploring Ball Locations and Player Behaviors in Basketball

Information aesthetics - Wed, 2014-03-05 16:06

"Game on!" by Fathom Information Design is an exploratory visualization prototype that allows users to parse through a basketball game's data, to investigate the behaviors and patterns in terms of the statistics and locations of players.

Based on a vast collection of performance statistics as well as real-time tracked ball and player positions, the tools allows one to explore some potentially interesting patterns, such as each player's recurrent locations, standings, and alignments according to their team position, the concentration of movement around the 3-point mark, any personally preferred shooting spots, or the fact that players tend to transition into offence along the sides of the court

The actual data was acquired by linking noteworthy game event markings to a smart computer-vision algorithm that analyzes top-down video footage, which results in a large set of X, Y, Z positions for each player and the ball for every video frame.

See also:
. The 3D Trajectories of the Tennis Ball during the Final ATP Matches
. The NYTimes Visualization of Live World Cup Football Statistics
. VisualSport: Social Visualization of (Live) World Cup Football Statistics
. Adidas Match Tracker: Experience Soccer Games Like a Data Geek
. Guardian Interactive Chalkboards: Map and Share Soccer Game Events

Categories: Blogroll

More quick links

Greg Linden's Blog - Mon, 2014-03-03 19:47
More of what caught my attention recently:
  • Cool new tech, especially for mobile, detecting gesture movements from the changes they make to ambient wireless signals, uses a fraction of the power of other techniques ([1] [2] [3])

  • Also for mobile: "The big trick here is ... two [camera] lenses with two different focal lengths. One lens is wide-angle, while the other is at 3x zoom ... magnify more distant subjects ... improved low-light performance ... noise is reduced ... just as we would if we had one big imaging sensor instead of two little ones ... [and] depth analysis allows ... [auto] blurring out of backgrounds in portrait shots, quicker autofocus, and augmented reality." ([1])

  • "These are not the first artificial muscles to have been created, but they are among the first that are inexpensive and store large amounts of energy" ([1])

  • "Tesla is a glimpse into a future where cars and computers coexist in seamless harmony" ([1])

  • "Fields from anthropology to zoology are becoming information fields. Those who can bend the power of the computer to their will – computational thinking but computer science in greater depth – will be positioned for greater success than those who can’t." ([1] [2])

  • The CEOs of Amazon, Facebook, Google, Microsoft, Twitter, Netflix, and Yahoo have CS degrees

  • Details on fixing What's so impressive is how much they changed the culture in such a short time, from a hierarchical structure where no one would take any responsibility to an egalitarian one where everyone was focused on solving problems. ([1])

  • Clever idea, advertise to find experts on the Web and then get them to answer questions for free by enticing them into playing a little quiz game ([1] [2])

  • "A key to Google’s epic success was the discipline the company maintained around its hiring ... During his first seven years, the executive team met every week to review every single hiring candidate." ([1] [2])

  • "Peter Norvig, Google's research director, said recently that the company employs 'less than 50% but certainly more than 5%' of the world's leading experts on machine learning" ([1])

  • Yahoo is trying to rebuild its research group, which was destroyed by its previous CEO ([1] [2] [3] [4] [5] [6])

  • Software increasingly needs to be aware of its power consumption, the cost of power, and the availability of power, and be able to reduce its power consumption when necessary ([1] [2])

  • "Viewers with a buffer-free experience watch 226% more and viewers receiving better picture quality watch 25% longer" ([1])

  • Gaming the most popular lists in the app stores: "Total estimated cost to reach the top ten list: $96,000" ([1] [2])

  • "The Rapiscan 522 B x-ray system used to scan carry-on baggage in airports worldwide ... runs on the outdated Windows 98 operating system, stores user credentials in plain text, and includes a feature called Threat Image Projection used to train screeners by injecting .bmp images of contraband ... [that] could allow a bad guy to project phony images on the X-ray display." ([1])

  • "It would appear that a surprising number of people use webcam conversations to show intimate parts of their body to the other person." ([1])

  • "Ohhh there's not another cable company, is there? Oh that's right we're the only one in town." ([1])

  • It "sounds like it's straight out of a sci-fi horror flick: they thawed some 30,000-year-old permafrost and allowed any viruses present to infect some cells" ([1])

  • Very funny if you (or your kids) are a fan of Portal, educational too, and done by NASA ([1])

  • NPR's "Wait Wait" did a segment on Amazon's "Customers who bought this", very funny ([1])
Categories: Blogroll

Google Maps Gallery Highlights Specialized Maps based on Public Data

Information aesthetics - Mon, 2014-03-03 16:45

Google recently launched a dedicated Maps Gallery [] to showcase a collection of hand-picked maps from several preferred organizations, such as the National Geographic, the U.S. Geological Survey or the City of Edmonton. It is the goal that in the future, people will find most maps not through the gallery, but via the standard search results.

The included maps range from the somewhat unappealing population statistics map based based on data from the World Bank, over an intriguing overview map of all fastfood location in the US, to the beautifully rendered Dominican Republic AdventureMap by the National Geographic.

Participants who apply for the program and are selected by Google receive free access to the enterprise version of Google Maps Engine, which includes specific connectors that facilitates easy importation of public data.

Via TechCrunch, The Verge, CNET, and many others.

Categories: Blogroll

On The Role Data Visualization Plays in the Scientific Process

Information aesthetics - Mon, 2014-03-03 16:09

In a new exhibition titled Beautiful Science: Picturing Data, Inspiring Insight [], the British Library pays homage to the important role data visualization plays in the scientific process.

The exhibition can be visited from 20 February until 26 May 2014, and contains works ranging from John Snow's plotting of the 1854 London cholera infections on a map to colourful depictions of the Tree of Life.

"Science is beautiful... but we can also bring an aesthetic to it with makes it so much more impactful and can allow to have your ideas a much greater reach"

In a Nature Video, watchable below, curator Johanna Kieniewicz explores some of the beautiful examples of visualizations that are exhibited.

Categories: Blogroll

SEMANTiCS 2014: Call for Industry Presentations

Semantic Web Company - Mon, 2014-03-03 03:44

SEMANTiCS 2014 will take place in Leipzig (Germany) this year from September 4-5. The International Conference on Semantic Systems will be co-located with several workshops and other meetings, e.g. the 2nd DBpedia community meeting.

SEMANTiCS conference (formerly ‘I-Semantics’) focuses on transfer and industry-related applications of semantic systems and linked data.
Here are some of the options for end-users, vendors and experts to get involved (besides participating as a regular attendee and the option to submit a paper):

  1. Submit an Industry Presentation:
  2. Sponsoring / Marketplace / Exhibition:
  3. Become a reviewer:

The organizing committee would be happy to have you on board of the SEMANTiCS 2014 in Leipzig.

Categories: Blogroll
Syndicate content