Semantic Software LabConcordia UniversityMontréal, Canada
### Moving house (reMaker unCompany)

Computing Text - Mon, 2013-12-09 05:28
The bags are packed, the electric is turned off and the van is rammed to the gills...

Metaphorically speaking, that is. I've revamped my personal pages, and set up a blog there -- hamish.gate.ac.uk -- so head on over. There's coffee on the stove.

The new house? A bijou residence on the slopes of Mount Physical, with the catchy title of reMaker unCompany, to put a roof over some stuff I've been working on in relation to the new world of digital manufacture, single-board computers and the like.

See you soon.

### Immersion Reveals How People are Connected via Email

Information aesthetics - Mon, 2013-12-02 14:14

Immersion [mit.edu] is a quite revealing visualization tool of which the NSA - or your own national security agency - can only be jealous of... Developed by MIT students Daniel Smilkov, Deepak Jagdish and C�sar Hidalgo, Immersion generates a time-varying network visualization of all your email contacts, based on how you historically communicated with them.

Immersion is able to aggregate and analyze the "From", "To", "Cc" and "Timestamp" data of all the messages in any (authorized) Gmail, MS Exchange or Yahoo email account. It then filters out the 'collaborators' - people from whom one has received, and sent, at least 3 email messages from, and to.

As a result, people that appeared together in the headers of email messages are visually connected and placed closer together than others. A timeline slider at the bottom of the screen then allows to investigate any changes in social email connectivity over time. Individual names can be selected for further detailed exploration.

For those too paranoid to trust 3 MIT students with their personal data, there is a demo account available here.

### The Border of Search

Data Mining Blog - Mon, 2013-12-02 00:24

There are still many search scenarios where this is the expectation of the users. One that comes (frequently) to mind is finding solutions to coding issues or API usage on coding forums and question answering sites like StackOverflow.com.

However, there is a relatively new class of queries for which search engines like Bing and Google provide satisfaction immediately. For example, a query like 'who wrote To Kill a Mocking Bird' produces the following results:

There are many other examples of this type of direct question answering.

The question is - how will user expectations change and how will it impact the long tail of sites on the web. For example, the following query does not yet produce a direct answer in any search engine 'How much is 1 pound from 1811 worth in current money?' (a question a daughter might ask a father when watching a period drama on the beeb). A search engine can, however, and will provide a link to a site which can answer the question (e.g. Measuring Worth).

I would argue that this is directly where we are headed.

Related post: Did Web Search Kill Artificial Intelligence?

### NIPS tutorials and Vowpal Wabbit 7.4

Machine Learning Blog - Sun, 2013-12-01 16:53

At NIPS I’m giving a tutorial on Learning to Interact. In essence this is about dealing with causality in a contextual bandit framework. Relative to previous tutorials, I’ll be covering several new results that changed my understanding of the nature of the problem. Note that Judea Pearl and Elias Bareinboim have a tutorial on causality. This might appear similar, but is quite different in practice. Pearl and Bareinboim’s tutorial will be about the general concepts while mine will be about total mastery of the simplest nontrivial case, including code. Luckily, they have the right order. I recommend going to both

I also just released version 7.4 of Vowpal Wabbit. When I was a frustrated learning theorist, I did not understand why people were not using learning reductions to solve problems. I’ve been slowly discovering why with VW, and addressing the issues. One of the issues is that machine learning itself was not automatic enough, while another is that creating a very low overhead process for doing learning reductions is vitally important. These have been addressed well enough that we are starting to see compelling results. Various changes:

• The internal learning reduction interface has been substantially improved. It’s now pretty easy to write new learning reduction. binary.cc provides a good example. This is a very simple reduction which just binarizes the prediction. More improvements are coming, but this is good enough that other people have started contributing reductions.
• Zhen Qin had a very productive internship with Vaclav Petricek at eharmony resulting in several systemic modifications and some new reductions, including:
1. A direct hash inversion implementation for use in debugging.
2. A holdout system which takes over for progressive validation when multiple passes over data are used. This keeps the printouts ‘honest’.
3. An online bootstrap mechanism system which efficiently provides some understanding of prediction variations and which can sometimes effectively trade computational time for increased accuracy via ensembling. This will be discussed at the biglearn workshop at NIPS.
4. A top-k reduction which chooses the top-k of any set of base instances.
• Hal Daume has a new implementation of Searn (and Dagger, the codes are unified) which makes structured prediction solutions far more natural. He has optimized this quite thoroughly (exercising the reduction stack in the process), resulting in this pretty graph.

Here, CRF++ is commonly used conditional random field code, SVMstruct is an SVM-style approach to classification, and CRF SGD is an online learning CRF approach. All of these methods use the same features. Fully optimized code is typically rough, but this one is less than 100 lines.

I’m trying to put together a tutorial on these things at NIPS during the workshop break on the 9th and will add details as that resolves for those interested enough to skip out on skiing

Edit: The VW tutorial will take place during the break at the big learning workshop from 1:30pm – 3pm at Harveys Emerald Bay B.

### Why Aren't They Spamming The Chinese?

Code from an English Coffee Drinker - Sat, 2013-11-30 07:02
Whilst trying to drink my first cup of coffee this morning, I was rudely interrupted by click-jacking malware affecting my wife’s computer. All she was trying to do was look at some Google search results, but clicking on them would take her to a suspicious looking shopping search site. From a little bit of Googling it looked as if it might be a real nasty trojan which would have taken ages to clean up. Fortunately it turned out that all the pages she was having the problem with had been infected with the same bit of malicious JavaScript. I'm not sure how (probably through a malicious banner ad or something) but a reference to the following JavaScript had been inserted at the very end (after the </html>) of each affected page:
if (navigator.language)
var language = navigator.language;
else
var language = navigator.browserLanguage;

if(language.indexOf('zh') == -1) {
var where = document.referrer;
if (regexp.test(where)) {
window.location.href="http://www.bbc.co.uk/news";
}
}To make the script easier to read I've reformatted it, and replaced the redirect with a safe URL (who doesn't trust the BBC?) rather than giving the spammers free advertising, but I haven't changed any of the functional aspects of the script.

Essentially all it does is check the URL that you were on when you clicked the link leading you to the current page, and if that looks like a search results page from one of 14 different companies, then it redirects you. The regular expression it uses to check the referring page is simple yet effective and will catch any of the sub-domains of these search services as well. What I find weird is why the script checks the language of the browser.

The first four lines of the script get the language the browser is using. There are two ways of doing this depending on which browser you are using hence the if statement. On my machine this gets me en-US (which means I need to figure out why it has switched from en-UK which is what I thought I'd set it to). Line 6 then checks to make sure the language doesn't include the string zh, which according to Wikipedia is Chinese. I'm assuming that the spammers behind the script are Chinese and don't want to be inconvenienced by their own script, but it seems odd, especially when you consider that at least one of the search engines covered by the regular expression (118114 on many different top-level domains) seems to be a Chinese site.

Looking at this script there is of course another way to defeat it, other than disabling JavaScript. One of the privacy or security options in most browsers concerns the referer (yes I know it is spelt wrong, but that is the correct spelling in the HTTP spec) header. Essentially this header tells a web server the page you were on when you clicked the link leading to the page you are requesting. Some sites will use this to provide functionality so disabling it can cause problems but it does mitigate against scripts like this one. Because it can cause problems it's often an advanced setting, for example here are the details for Firefox.
### Game Maven: Learn to code by writing games

Greg Linden's Blog - Wed, 2013-11-27 17:26
I just launched Game Maven from Crunchzilla. It’s a new interactive tutorial -- part of the series that includes Code Monster and Code Maven -- that is a step-by-step walkthrough of writing the code for three casual video games.

The games themselves are really fun. One is a simple vaguely Asteroids-like base defense game. The second is a sort of Angry Birds-like cannon game complete with physics and particle system effects. The third is a platformer in the spirit of Mario Bros with auto-generated infinite levels.

Game Maven is an interactive tutorial using live code. Players learn step-by-step how to build each game, getting a chance to customize and play with the games as they build them. Game Maven is an immersive educational experience with a focus on action over explanation. Players build right away with code, learning about coding by coding. Be brave, make mistakes, try things, and see what happens.

Game Maven assumes some programming experience and an interest in writing games. It’s for adults and older teens (age 16+). It’s designed for a variety of motivation levels, from those that just click through the lessons and skip most of the work, to those that do every lesson, understand every line of code, and spend hours customizing the games, everyone learns from the experience.

If you have a teenager interested in coding and writing games, please let them know about Game Maven. Please tell your friends with teenagers about Game Maven. If you want to play with writing games yourself, go give it a try. And please let me know what you think!
### McKinsey Urban World App Reveals City Statistics on the Smart Phone

Information aesthetics - Wed, 2013-11-27 15:56

The Urban World App [mckinsey.com] by global management consulting firm McKinsey only runs on the iOS and Android operating systems.

The smoothly animated application consists of an interactive, 3D globe that shows the population, income and GDP statistics of about 2,600 different cities around the world. 2 separate cities can be selected and compared in terms of their historical 2010 data, as well as a possible future scenario for the year 2025. 2 alternative visualizations focus oin revealing the evolution of GDP and the global economic center of gravity over the last 2,000 years.

### Linked data based search: Make use of linked data to provide means for complex queries

Semantic Web Company - Wed, 2013-11-27 12:02

Two live demos of PoolParty Semantic Integrator demonstrate new ways to retrieve information based on linked data technologies

Linked data graphs can be used to annotate and categorize documents. By transforming text into RDF graphs and linking them with LOD like DBpedia, Geonames, MeSH etc. completely new ways to make queries over large document repositories become possible.

An online-demo illustrates those principles: Imagine you were an information officer at the Global Health Observatory of the World Health Organisation. You inform policy makers about the global situation in specific disease areas to direct support to the required health support programs. For your research you need data about disease prevalence in relation with socioeconomic factors.

Datasets and technology

About 160.000 scientific abstracts from PubMed, linked to three different disease categories were collected. Abstracts were automatically annotated with PoolParty Extractor, based on terms from the Medical Subject Headings (MeSH) and Geonames that are organized in a SKOS thesaurus, managed with PoolParty Thesaurus Server. Abstracts were transformed to RDF and stored in Virtuoso RDF store. In the next step, it is easy to combine these data sets within the triple store with large linked data sources like DBPedia, Geonames or Yago. The use of linked data makes it easy to e.g. group annotated countries by the Human Development Index (HDI). The hierarchical structure of the thesaurus was used to collect all concepts that are connected to a specific disease.

This demo was developed based on the libraries sgvizler to visualize SPARQL results. AngularJS was used to dynamically replace variables in SPARQL query templates.

Another example of linked data based search in the field of renewable energy can be tried out here.

### A Game Tree of Rafael Nadal's Tennis Matches in Season 2013

Information aesthetics - Tue, 2013-11-26 15:49

The Game Tree Interactive [gamesetmap.com] visualization by Damien Demaj analyzes the more than 600 service games played by Rafael Nadal during the Grand Slams, Masters 1000 and World Tour Finals in the year 2013. In particular, this visual summary shows where Nadal's history breaking season was won and rarely lost, point by point.

Each point is color-coded to reflect the momentum in each game: blue represents positive momentum, while red denotes negative momentum (and white is neutral). Accordingly, a match that was dominated by Nadal is highlighted with a thicker, outside flow that passes through the 'positive' points of the Game Tree, while the opposite shows a flow that enters more 'neutral' or outside 'negative' points.

### Where Students Study Abroad via the European Erasmus Network

Information aesthetics - Mon, 2013-11-25 14:36

Erasmus is an European Union student exchange program that allows more than 75.000 European students to travel abroad and study in a country different from their home. The network consists of over 2.900 universities, which are interconnected by over 90.000 partnerships.

Christian Gross and Sebastian Sadowski have implemented an interactive visualization of the network [ahoi.in], revealing all the academic institutes that participate in the program, in addition to the student flows of the 'top 100' receiving and sending universities.

The second series of visualizations reveals the reasons why students go abroad, on a per country level. This information is contrasted with the average temperature, Consumer Price Index (CPI) and distance to the home country.

Machine Learning Blog - Thu, 2013-11-21 13:13

I was not as personally close to Ben as Sam, but the level of tragedy is similar and I can’t help but be greatly saddened by the loss.

Various news stories have coverage, but the synopsis is that he had a heart attack on Sunday and is survived by his wife Anat and daughter Aviv. There is discussion of creating a memorial fund for them, which I hope comes to fruition, and plan to contribute to.

I will remember Ben as someone who thought carefully and comprehensively about new ways to do things, then fought hard and successfully for what he believed in. It is an ideal we strive for, that Ben accomplished.

Machine Learning Blog - Sat, 2013-11-09 10:54

Several strong graduates are on the job market this year.

• Alekh Agarwal made the most scalable public learning algorithm as an intern two years ago. He has a deep and broad understanding of optimization and learning as well as the ability and will to make things happen programming-wise. I’ve been privileged to have Alekh visiting me in NY where he will be sorely missed.
• John Duchi created Adagrad which is a commonly helpful improvement over online gradient descent that is seeing wide adoption, including in Vowpal Wabbit. He has a similarly deep and broad understanding of optimization and learning with significant industry experience at Google. Alekh and John have often coauthored together.
• Stephane Ross visited me a year ago over the summer, implementing many new algorithms and working out the first scale free online update rule which is now the default in Vowpal Wabbit. Stephane is not on the market—Google robbed the cradle successfully I’m sure that he will do great things.
• Anna Choromanska visited me this summer, where we worked on extreme multiclass classification. She is very good at focusing on a problem and grinding it into submission both in theory and in practice—I can see why she wins awards for her work. Anna’s future in research is quite promising.

I also wanted to mention some postdoc openings in machine learning.

### Serializing To A Human Interface Device

Code from an English Coffee Drinker - Sat, 2013-11-09 10:33
If you've read my previous post you'll know that I've been looking at a cheap and simple way of adding serial communication to a breadboard Arduino clone (such as this one). To summarise the situation so far; adding true RS-232 serial communication is both expensive and difficult as the required part is only available as a surface mount component but I discovered V-USB which allows me to emulate low speed USB devices. The end result was that I managed to use V-USB to emulate a USB keyboard. Being able to pass data from the Arduino to the PC by simply emulating key presses is useful but a) it is rather slow and b) different keyboard mappings lead to different characters being and typed and more importantly c) it doesn't allow me to send data to the Arduino. So on we go...

Let's start with what I haven't managed to achieve; a USB CDC ACM device for RS-232 communication. Unfortunately CDC ACM devices require bulk endpoints (these allow for large sporadic transfers using all remaining available bandwidth, but with no guarantees on bandwidth or latency) and these are only officially supported for high speed USB devices. V-USB only allows me to emulate low speed USB devices, and while most operating systems used to allow low speed devices to create bulk endpoints, even though this is contrary to the spec, modern versions of Linux (and possibly Windows) do not. I did manage to get a device configured correctly but as soon as I plugged it in the bulk endpoints were detected and converted to interrupt endpoints which stopped the device from working. However, all is not lost as I do have a solution which I think is just as good; serializing data to and from a generic USB Human Interface Device.

The USB specification defines the USB Human Interface Device (HID) class to support, as the name suggests, devices with some form of human interface. This doesn't mean sticking a USB cable into your arm, but rather defines common devices such as keyboards, mice and game controllers as well as devices like exercise machines, audio controllers and medical instruments. While such devices may communicate data in a variety of forms it all passes to and from the device using the same protocol. This means that when you plug any such device into practically any computer with a USB port it will be recognised and basic drivers will be loaded.

Writing code to communicate with a USB HID device isn't that much more complex than interfacing with a classic serial port and given the standard driver support we can rely on the operating system taking care of most of the communication for us.

For what follows I'm assuming the same basic USB circuit that I described in the previous post as we know it works and it is cheap to build.

Now we have the circuit let's move on to the software we need to write. Unlike with the USBKeyboard library, that powered The Caffeine Button, we will need both firmware for the Arduino and host software that will run on the PC and interface with the basic HID drivers the operating system provides. Given that you can't test the host software until we have a working device we'll start by looking at the firmware.

The first thing you have to do when constructing a HID is to define its descriptor. The descriptor is how the device presents itself to the operating system and defines the type of device as well as the size and type of any communication messages. Now you will probably never need to edit this but I thought it was worth showing you the full descriptor we are using:
PROGMEM char usbHidReportDescriptor[USB_CFG_HID_REPORT_DESCRIPTOR_LENGTH] = {
0x06, 0x00, 0xff, // USAGE_PAGE (Generic Desktop)
0x09, 0x01, // USAGE (Vendor Usage 1)
0xa1, 0x01, // COLLECTION (Application)
0x15, 0x00, // LOGICAL_MINIMUM (0)
0x26, 0xff, 0x00, // LOGICAL_MAXIMUM (255)
0x75, 0x08, // REPORT_SIZE (8)
0x95, OUT_BUFFER_SIZE, // REPORT_COUNT (currently 8)
0x09, 0x00, // USAGE (Undefined)
0x82, 0x02, 0x01, // INPUT (Data,Var,Abs,Buf)
0x95, IN_BUFFER_SIZE, // REPORT_COUNT (currently 32)
0x09, 0x00, // USAGE (Undefined)
0xb2, 0x02, 0x01, // FEATURE (Data,Var,Abs,Buf)
0xc0 // END_COLLECTION
};In this descriptor we define two reports of different sizes which we will use for transferring data to and from the device. The first thing to point out is that the specification defines input and output with respect to the host PC and not the device. So an input message is actually used for writing out from the device rather than for receiving data. Given this, we can see that the descriptor defines two message types. Firstly (lines 7 to 10) we define an 8 byte (OUT_BUFFER_SIZE is defined as 8) input report (the size is defined in bits so we have 8 bits times the count to give 8 bytes) which means we can write 8 byte blocks of data back to the PC we are connected to. The second message type is defined as a FEATURE message of 32 bytes (because IN_BUFFER_SIZE is defined as 32 and the REPORT_SIZE hasn't been redefined so it still 8 bits) which we will use for passing data from the PC to the USB device. As I said you will probably never need to edit this structure especially as you can tweak the message sizes, if necessary, by adjusting the two constants instead. If you do decided to change the descriptor it is worth noting that some operating systems are more forgiving than others. For example, with Linux if you have a defined a message of 8 bytes but only have two to send then you can do that and everything will work. Under Windows, however, if you only send two bytes the device will simply stop functioning altogether so you will need to pad the message to be exactly 8 bytes. This also means that it is easy to check that your descriptor matches what you are actually doing by quickly testing under Windows (I've been doing this with a copy of Windows XP running under VirtualBox).

Now that we know the size of the messages we will send and receive we still need to decide upon their format, i.e. the protocol we will use for our data that we are sending on top of the USB protocol. If we were only interested in sending textual data then we could send null terminated data (i.e. put a zero value byte into the array after the last byte of data), but if we want to send arbitrary bytes then using 0 as an end of data marker seems an odd choice. For this reason I've opted to set the first byte of each message to the length of the valid data in the array. This is both simple to use and results in firmware code that is slightly simpler (and hence smaller) than checking for the null terminator. This does of course mean that in an 8 byte message we can only fit 7 bytes of actual data plus the length marker (this is no different than with null terminated data of course). If you know the messages you want to send will always be of a fixed length then tweaking the buffer sizes to suit might make for a more efficient transfer of data. In general, as you will see shortly, as a user of the library a lot of these details are dealt with for you.

From the very beginning my aim was to find a drop-in replacement for the standard Arduino Serial object and so I've made the USBSerial library implement the same Stream interface. This means you can use any of the methods defined in the Stream interface for reading and writing data and the details about buffer sizes etc. are hidden within the library.

To show how easy the library is to use, here is a simple example where the sketch simply echos back any bytes that it is sent.
#include <USBSerial.h>

char buffer[IN_BUFFER_SIZE];

void setup() {
USBSerial.begin();
}

void loop() {
USBSerial.update();
if(USBSerial.available() > 0) {
if (size > 0) {
USBSerial.write((const uint8_t*)buffer, size);
}
}
}Note that I've used the same IN_BUFFER_SIZE constant in this example as within the library itself, as there is no reason to define a buffer that is bigger than we can ever expect to fill. The only line that you wouldn't find in a similar example using the standard Serial object is line 10 where we make sure that the USB connection is up to date (you need to do this approximately once every 50ms to keep the connection alive). Before we move on to looking at the host software there are a few things you need to know before trying to use the library.

Unfortunately the V-USB part of the library needs customizing for each project, so you can't simply drop the library into the Arduino sketchbook folder, as the USB manufacturer and product identifiers have to be unique for different devices. These identifiers are set in the usbconfig.h file. V-USB doesn't actually provide a copy of usbconfig.h what is provided is a file called usbconfig-prototype.h which you can copy and rename as a starting point. I've already done a lot of the configuration for you by editing usbconfig-prototype.h leaving just four lines you need to edit for yourself. Firstly you need to set the vendor name property by editing lines 244 and 245:
#define USB_CFG_VENDOR_NAME 'o', 'b', 'd', 'e', 'v', '.', 'a', 't'
#define USB_CFG_VENDOR_NAME_LEN 8and then the device name by editing lines 254 and 256:
#define USB_CFG_DEVICE_NAME 'T', 'e', 'm', 'p', 'l', 'a', 't', 'e'
#define USB_CFG_DEVICE_NAME_LEN 8These values have to be changed and can't be set to any random value because as part of the V-USB license agreement you need to conform to the following rules (taken verbatim from the file USB-IDs-for-free.txt):

(2) The textual manufacturer identification MUST contain either an Internet domain name (e.g. "mycompany.com") registered and owned by you, or an e-mail address under your control (e.g. myname@gmx.net"). You can embed the domain name or e-mail address in any string you like, e.g. "Objective Development http://www.obdev.at/vusb/".

(3) You are responsible for retaining ownership of the domain or e-mail address for as long as any of your products are in use.

(4) You may choose any string for the textual product identification, as long as this string is unique within the scope of your textual manufacturer identification.Once properly configured you should be able to compile (I recommend using arduino-mk instead of the Arduino IDE) and use the library without issue, and without understanding how it actually works internally (if you are interested in the details then both my code and the V-USB library contain vast amounts of code comments which should help you get a better understanding) so let's move on to looking at the host software.

As I've already mentioned connecting the device to a PC usually causes generic HID drivers to be loaded by the operating system. This means that you should be able to use any programming language you like to write the host software as long as it can talk to the generic USB drivers. I've included host software written in Java using javahidapi but, for instance, you could also use PyUSB if you prefer to program using Python. The important thing to remember is the protocol for passing data that we defined earlier: data to the USB device is sent as 32 byte feature requests with the first byte being the length of the valid data in the rest of the array, while data from the USB device is in 8 byte chunks again with the first byte being the length of the valid data.

As with the firmware code we have already discussed, I've written a simple Java library to hide most of the details behind standard interfaces, which allow you to read and write data using the standard Java InputStream and OutputStream interfaces. Full details of the available methods can be found in the Javadoc but a simple echo example shows most of the important details.
import java.io.PrintStream;

import englishcoffeedrinker.arduino.USBSerial;

public class SimpleEcho {
public static void main(String[] args) throws Exception {
// get an instance of the USBSerial class for the specified device
USBSerial serial =
USBSerial.getInstance("englishcoffeedrinker.co.uk", "EchoChamber");

// open the underlying USB connection to the device
serial.connect();

// create an output stream to write characters to the device
PrintStream out = new PrintStream(serial.getOutputStream());

// send a simple message
out.println("hello world!");

// ensure the message has been sent and not buffered internally somewhere
out.flush();

// create a reader for getting characters back from the device

String line;
while((line = in.readLine()) == null) {
// keep checking the device until a line of text is returned
}

// display the message sent from the device
System.out.println(line);

// we have finished so disconnect our connection to the device
serial.disconnect();
}
}Essentially, lines 10 to 14 get an instance of the USBSerial class for a specific device, in this case found via the manufacturer and product identifiers although other methods are available, and then opens the connection. Lines 16 to 23 then use the OutputStream to write data to the device while lines 26 to 35 read it back, with line 38 cleaning up by closing the connection. For anyone who is happy programming in Java this should look no different than reading or writing to and from any other type of stream, which means it should be easy to integrate within any project where you want to communicate with an Arduino.

To make life a little easier I've also included a slightly more useful demo application that effectively reproduces the Serial Monitor from the Arduino IDE. You can see it running here connected to a device running the simple echo sketch from earlier in this post, but it should work with any device that uses the USBSerial library.

I've included another example with the USBSerial library that shows you can use this for more than just echo commands. This is the CmdMsg example, which is an example I've talked about before on this blog, but this version uses the USBSerial library, and hence can be controlled through this new USB Serial Monitor, rather than using the standard Serial library.

If you've read all the way to here then I'm guessing you might want to know where you can get the library from, well it is available under the GPL licence (a restriction imposed because I'm using the free USB vendor and product IDs from Objective Development) from my SVN repository. Do let me know if you find it useful or if you have any suggestions for improvement.

When I set out to try and add serial support to a breadboarded Arduino (specifically this circuit) I did have a device I wanted to build in mind, so I'm sure at some point I'll blog again about using this library in a real device rather than just the simple examples included with the library that do nothing more than prove everything works.
Greg Linden's Blog - Fri, 2013-11-01 16:44
What caught my attention lately:
• Jeff Bezos on what innovators need: "A willingness to fail. A willingness to be misunderstood. And maintaining a childlike wonder in the world." ([1])

• "Agreement feels good -- hey, we get along great! -- but it's not the best for innovation. Why? Because if everybody has the same idea, then you only have one idea." ([1])

• "It's amazing the amount of difference a cultural intolerance to bullshit can make" ([1])

• "Two engineers with close ties to Google exploded in profanity when they saw the drawing" ([1] [2])

• "Amazon has boundless ambition. It wants to eat global retail ... the largest retailer in the history of the world ... [It is] a secret in plain sight." ([1])

• I love Duolingo and Duolingo's business model: "students receive high-quality, completely free language education, and organizations get translation services powered by the students ... two major publishers are financing our operation by [our students] translating their content." ([1] [2])

• "Pinterest now valued at $3.8 billion, on a constant valuation ratio of infinity times revenues" ([1]) • "Microsoft is continuing to fire on all cylinders with its enterprise products and services and has a ways to go on the consumer side" ([1]) • The "screamingly obvious ... solution ... Offer regular Windows on regular computers, offer TileWorld on tablets" ([1] [2]) • "Apple is trying to maintain premium pricing in a market in which competitors are increasingly selling high-quality iPad alternatives for significantly lower prices." ([1]) • Patents should only be used "to promote the progress of science and useful arts" ([1] [2] [3]) • "The most popular films are not even available through preferred legal channels" ([1] [2] [3]) • I do not think it means what you think it means: "No banner ads on the Google homepage or Web search results pages… ever." ([1] [2] [3]) • Gesture recognition (like the Kinect, but just using a laptop or tablet's existing webcam) is going to become more common? ([1]) • Instead of trying to recycle plastic, turn it into oil: "Each barrel of oil costs about$10 to produce" ([1] [2])

• Very clever idea for moving small robots: "No external moving parts. Nonetheless, they’re able to climb over and around one another, leap through the air, roll across the ground ... [a] flywheel is braked, it imparts its angular momentum to the cube" ([1])

• Frightening that you can see results this large with electrical stimulation of the brain on an area known to be important for compliance with social norms ([1])

• Four-legged, all terrain drones coming to a war near you ([1] [2])

• Curious claim about US education: "When controlling for demographic factors, public schools are doing a better job academically than private schools" ([1])

• Strange, I don't remember this, back in 2001, Jeff Bezos did a weird Taco Bell ad. ([1])

• A dark but hilarious Halloween comic from SMBC: "I'm expectation of death" ([1])

• The Onion: "CEO worked his way up from son of CEO" ([1] [2])
### tanh is a rescaled logistic sigmoid function

This confused me for a while when I first learned it, so in case it helps anyone else:

The logistic sigmoid function, a.k.a. the inverse logit function, is

$g(x) = \frac{ e^x }{1 + e^x}$

Its outputs range from 0 to 1, and are often interpreted as probabilities (in, say, logistic regression).

The tanh function, a.k.a. hyperbolic tangent function, is a rescaling of the logistic sigmoid, such that its outputs range from -1 to 1. (There’s horizontal stretching as well.)

$tanh(x) = 2 g(2x) - 1$

It’s easy to show the above leads to the standard definition $$tanh(x) = \frac{e^x – e^{-x}}{e^x + e^{-x}}$$. The (-1,+1) output range tends to be more convenient for neural networks, so tanh functions show up there a lot.

The two functions are plotted below. Blue is the logistic function, and red is tanh.

### The Caffeine Button

Code from an English Coffee Drinker - Sat, 2013-10-26 12:04
In a couple of previous posts (here and here) I've shown how easy and cheap it is to go from a prototyped setup using an Arduino to a standalone circuit built from just a handful of components. While those posts were more of an academic exercise to prove it was possible I've also now built, and blogged about, such a circuit that I'm actually using in anger. The problem is that an actual Arduino isn't just an Atmel ATMega328P-PU it also has accompanying electronics which enable you to talk to a computer via the USB connection which is great both for debugging and for interfacing external hardware to a PC.

When you connect an Arduino to a PC what actually happens is that the supporting circuitry creates a USB CDC ACM device which emulates a good old fashioned RS-232 serial port. If I wanted to add similar functionality to my standalone circuits then the most common way of doing so would be to use an FT232RL, but the chip alone would almost double the cost of the circuit plus it is only available as a surface mount part making it difficult to experiment with on a breadboard and I'm not sure my soldering skills are good enough to deal with surface mount parts either.

After pondering this for a bit and doing a little research I came across a potential solution in the form of V-USB. V-USB is a software only implementation of low speed USB for Atmel microcontrollers, such as the ATMega328P-PU. Unfortunately the distribution doesn't directly support the Arduino (it supports the ATMega328P-PU but not through the Arduino IDE etc.), however, I did find a previous attempt to add Arduino support although this project seems to have been abandoned as it hasn't seen any updates in over three years. It did, however, give me a good point to start from.

So far I haven't managed to emulate a serial port, but I have managed to make the Arduino behave like a USB keyboard which means that I can use it for debugging by having it pretend to press lots of keys in sequence, which is better than nothing. Before we get to looking at how to make use of V-USB we need to wire a USB plug up to the Arduino.

USB Connection Parts ListPartUnit CostQuantity USB Socket, Type B£0.5013.6V, 0.5W Zener Diode£0.071268Ω Resistor£0.00822.2kΩ Resistor£0.0081As you can see it isn't a particularly complex circuit to build. Essentially we have the two data lines D- and D+ linked to pins 2 and 4 of the Arduino and as the USB spec states that these run on 3.3V we restrict the voltage using two 3.6V zener diodes to step down from the 5V output of the Arduino. The Arduino itself is powered from the other two USB pins which provide 5V/GND (this also means that you could use USB to power a standalone ATMega328P-PU without needing a voltage regulator and smoothing capacitors). The final connection is between pin 5 and D- via a 2.2kΩ pull-up resistor which allows the connection to be connected/disconnected from within software (in theory you can link this to V+ instead if you don't need this flexibility but I found that this meant that the hardware wasn't always recognized correctly when it was connected to a PC). For the USB plug itself, I opted to use a Type B plug (the same as the Arduino) so that I could get away with a single cable snaking across my desk, but you should be able to use any USB plug with the same circuit. One thing to be careful with when constructing the circuit is that in most cases the metal case of the USB plug is connected to the ground pin, so be careful that you don't end up with any of the other connections catching the case otherwise you might end up with them pulled to ground which will cause weird things to happen; depending which pin is involved either your PC won't recognize you have anything plugged in or it might end up thinking you have a high speed or high power device connected and things won't work properly.

Now we have the hardware what we need is some working software we can upload to the Arduino. As I said above I'm using a previous attempt to get this working as my starting point. This code hasn't been updated for over three years and I'm guessing that a number of things have changed within the Arduino libraries since then as the code didn't just work. After a bit of trial and error I discovered that the problems were mostly related to timer code that had been added to get V-USB to work with the Arduino and which was now doing the exact opposite. Removing this code got me to a point where when I plugged in the cable the Arduino was now being recognized as a USB v1.01 HID Keyboard, which I know is my device because of the vendor and product strings being shown in this screenshot.

Having got the code running I then took the plunge to update the version of V-USB being used to the latest version (currently 20121206) which, after a few little tweaks, was a success. Unfortunately because of the way the library has to be configured a single copy can't be shared between multiple projects, but in the rest of this post I'll explain how to customize the library and show you a fully worked example: TheCaffeineButton.

Firstly I've had problems compiling the library through the Arduino IDE, so I'd recommend compiling any sketches that use the library via the arduino-mk project that I discussed in a previous post. This allows you to have libraries local to a sketch by putting them in a libs subfolder. A basic sketch to use the libary then looks like the following:
// pull in the USB Keyboard library
#include <USBKeyboard.h>

void setup() {
// TODO: setup like stuff in here...
}

void loop() {
// poll the USB connection
USBKeyboard.update();

// TODO: whatever you want in here...
}You simply include the library (line 2) and then poll the connection every time through the main loop (line 6). As I mentioned though, the library needs customizing for each project and so if you try and compile the sketch at this point you'll end up with an error:
libs/USBKeyboard/usbdrv.h:12:23: fatal error: usbconfig.h: No such file or
directoryV-USB doesn't provide a copy of usbconfig.h as it is project specific, what they do provide is a file called usbconfig-prototype.h which you can copy and rename as a starting point. I've already done a lot of the configuration for you by editing usbconfig-prototype.h leaving just four lines you need to edit for yourself. Firstly you need to set the vendor name property by editing lines 244 and 245:
#define USB_CFG_VENDOR_NAME 'o', 'b', 'd', 'e', 'v', '.', 'a', 't'
#define USB_CFG_VENDOR_NAME_LEN 8and then the device name by editing lines 254 and 256:
#define USB_CFG_DEVICE_NAME 'T', 'e', 'm', 'p', 'l', 'a', 't', 'e'
#define USB_CFG_DEVICE_NAME_LEN 8These values have to be changed and can't be set to any random value because as part of the V-USB license agreement you need to conform to the following rules (taken verbatim from the file USB-IDs-for-free.txt):

(2) The textual manufacturer identification MUST contain either an Internet domain name (e.g. "mycompany.com") registered and owned by you, or an e-mail address under your control (e.g. myname@gmx.net"). You can embed the domain name or e-mail address in any string you like, e.g. "Objective Development http://www.obdev.at/vusb/".

(3) You are responsible for retaining ownership of the domain or e-mail address for as long as any of your products are in use.

(4) You may choose any string for the textual product identification, as long as this string is unique within the scope of your textual manufacturer identification.If you glance back at the screenshot I showed earlier of my example device being recognized then you can see that I followed these rules by setting the vendor name to englishcoffeedrinker.co.uk and the device name to TheCaffeineButton.

So having created a valid usbconfig.h file you can now compile the basic sketch shown above. Of course this does nothing other than allow the Arduino to be recognized as a USB keyboard. If you want to actually do something interesting then you need a little more code. As an example, I'll finally introduce you to The Caffeine Button!

Caffeine (usually in the form of coffee) is my one true addiction and so I thought I'd build a device with a single button that when pressed types out the chemical formula for caffeine: C8H10N4O2. The full sketch is fairly simple and assumes the basic circuit above with a push button between pin 12 and ground (it uses the internal pull-up resistor to keep the part list down to the single button):
// pull in the USB Keyboard library
#include <USBKeyboard.h>

// using Bounce makes working with buttons nice and easy
// http://www.arduino.cc/playground/Code/Bounce
#include <Bounce.h>

// we have the button on pin 12
#define BUTTON_PIN 12

// create a Bounce instance to manage the button
Bounce button = Bounce(BUTTON_PIN, 5);

void setup() {
// initialise the pin the button is connected to
pinMode(BUTTON_PIN, INPUT);

// enable the internal pull-up resistor
digitalWrite(BUTTON_PIN, HIGH);
}

void loop() {
// poll the USB connection
USBKeyboard.update();

// check the status of the button
button.update();

if (button.fallingEdge()) {
// if the button has just been pressed then...

// ...print out the formula for caffeine
USBKeyboard.print("C8H10N4O2");
}
}There are actually four different methods you can use to emulate pressing keys:
void write(byte keycode);
void write(byte keycode, byte modifiers);
void print(const char* text);
void println(const char* text);The first two method send a USB key usage code (with or without a modifier, such as the shift key) and then release the key, while the last two methods translate alphanumeric characters (and space) into a sequence of keystrokes (followed by the enter key in the println case) to make the library a little easier to use. Unfortunately there is no guarantee that my simple sketch will always result in C8H10N402 being displayed when you press the button.

When you press a key on a keyboard a keycode is sent to the computer which determines which letter has been pressed. This allows the same physical keyboard to be used for different languages simply by changing the printed labels on each key. Unfortunately this makes it impossible to translate a string into a sequence of keycodes which will always display the same on every computer. For example, if you send keycode 28 on a computer with an English keyboard mapping you'll get a 'y', but if the keyboard mapping is German you'll get a 'z'. I've defined a bunch of constants in USBKeyboard.h (i.e. KEY_A) which work for English, and I've used the same mapping in the print and println methods. If you can set the keyboard mapping for the device to English then these methods will work properly for you, if not you might need to tweak the mappings to get what you need. You can find the full list of mappings in chapter 10 of the Universal Serial Bus HID Usage Tables document should you need more details.

So here we have the final working item. As you can see I built the main USB circuit onto a prototyping shield so I can experiment with lots of different circuits without having to keep recreating the basics every time, and in this case have simply jammed the button between pins 12 and ground.

If you've read all the way to here then I'm guessing you might want to know where you can get the library from, well it is available under the GPL licence (a restriction imposed because I'm using the free USB vendor and product IDs from Objective Development) from my SVN repository.

Whilst I haven't yet been able to emulate a serial port I haven't given up and when/if I'm successful I'm sure there will be a post about it and another Arduino library for you to play with.
### European Data Forum’s Call for Contributions

Semantic Web Company - Thu, 2013-10-24 03:26

The European Data Forum (EDF) is an annual meeting place for industry, research, policy makers, and community initiatives to discuss the challenges and opportunities of data in Europe, especially in the light of recent developments such as Open Data, Linked Data and Big Data.

EDF 2014 will be held in Athens, Greece on March 19-20, 2014. The program will consist of a mixture of presentations, panels and networking sessions by industry, academics, policy makers, and community initiatives. Topics will cover a wide spectrum of research and technology development, applications, and socio-economic aspects of the data value chain.

Call for Contributions

EDF 2014 is seeking inspiring presentations addressing the topics listed below:

• Innovative research and technology for Open Data, Linked Data and Big Data
• Applications
• Socio-economic and policy issues
• Data visions for the future

Proposals for presentations should be submitted as a single PDF file at https://www.easychair.org/conferences/?conf=edf2014 or sent via e-mail to edf2014@data-forum.eu. Proposals will be reviewed by the Organizing Committee of EDF 2014 according to their relevance to the scope and purpose of the event.

Important Dates

• Submission of proposals:  December 10 2013, 22.00pm CET
• Notification of acceptance or rejection: early January 2014
• Full EDF2014 program available: end of January 2014

### Farewell to d8taplex.com

Data Mining Blog - Tue, 2013-10-22 12:20

Last night I shutdown d8taplex.com. This was a site that I used to demonstrate a number of experimental systems that I'd been playing with. These included:

Categories: Blogroll

### #newsHACK

Code from an English Coffee Drinker - Sun, 2013-10-20 04:14
Last week I spent two interesting days at the BBC's #newsHACK event at Shoreditch Town Hall in London. The BBC have recently started to open up APIs to a lot of useful indexing and annotation work that they have been doing and this event was aimed at getting news organizations and other interested parties using some of the APIs to produce interesting tech demos.

Our team consisted of Ian Roberts, Dominic Rout and myself from the University of Sheffield and Helen Lippell from the Press Association. We were kind of the odd ones out as our expertise centres around processing large amounts of text to expose interesting information and to make that searchable or useable in some way; which is exactly what some of the BBC APIs already did. None of us claim to be great user interface peopel so there was no point us trying to generate a really fancy interface over the BBC APIs. In the end we decided that we would play to our strengths and so our demo (which you can go and play with) tried to go one step further than the existing BBC APIs by using the power of Mímir to allow for complex searches over text, annotations and Linked Open Data to allow journalists a deeper view into the news archive when writing a story. One of our examples (that seemed to go down well) was; imagine you are writing a story about a CEO who has just been given a £5m bonus and you want to find other people who have been awarded more in the past.

Our demo didn't win any of the categories but we certainly didn't embarrass ourselves and there were some really good ideas presented. It's worth looking at the list of hacks as some of them have really cool demos you can play with.
### I was wrong on Netflix

Greg Linden's Blog - Sat, 2013-10-12 10:20
I was wrong on Netflix. A few years ago, I saw their pricing changes (which overpriced DVD rentals) and streaming catalog changes (switching to only buying whatever they could get cheap, very few hit movies in there, and then making a little of their own content) and thought this new strategy of becoming HBO-lite was headed for disaster.

But Netflix has done well. People add them on as an addition to cable and DVDs, not as a replacement, essentially as an HBO-lite. Competitors -- like Amazon, Hulu, and YouTube -- aren't acting as a replacement to Netflix, but, at best, an addition. Redbox is eating away at Netflix's neglected but profitable DVD business, but that hasn't hurt Netflix as much as I thought it would. And no one -- not even Apple, Amazon, Sony, Hulu, Walmart/Vudu, or Google/YouTube -- has been able to offer a full cable TV replacement, a streaming service with a massive, nearly exhaustive catalog of high quality content.

I was wrong about Netflix's new strategy. They've done pretty well with their HBO-lite strategy of being one of several (and the most popular) streaming service for TV-like entertainment. There is still the question of what happens when someone launches something with a much better UX and catalog than Netflix, but that may never happen.
