The political visualization Words & Votes [sandyhookpromise.org], developed by digital agency R/GA for non-profit organization Sandy Hook Promise, provides a comprehensive look into the opinions of congressional representatives on the issue of gun violence.
More specifically, the visualization tracks each member of congress as being "neutral", on the side of "Gun Safety," or on the side of "Gun Rights". It then maps the evolution of these opinions over time on a vertical timeline.
These individual opinions have been based on two separate types of information: the analysis of the tweets sent by members of Congress and their voting record on Capitol Hill on laws and bills that relate to gun violence.
Individual members of congress can be explored in terms of being influential or vocal, or filtered by address, zip code or home town.
The movie shown below, developed by a real-time trading software developer Nanex, shows the stock trading activity in Johnson & Johnson (JNJ) as it occurred during a particular half a second on May 2, 2013.
Each colored box represents one unique exchange. The whote box at the bottom of the screens shows the National Best Bid/Offer, which often drastically changes in a fraction of a second. The moving shapes represent quote changes which are the result of a change to the top of the book at each exchange. The time at the bottom of the screen is Eastern Time HH:MM:SS:mmm, which is slowed down to be able to better observe what goes on at the millisecond level (1/1000th of a second).
In the movie, one can observe how High Frequency Traders (HFT) jam thousands of quotes at the millisecond level, and how every exchange must process every quote from the others for proper trade through price protection. This complex web of technology must run flawlessly every millisecond of the trading day, or arbitrage (HFT profit) opportunities will appear. However, it is easy for HFTs to cause delays in one or more of the connections between each exchange. Yet if any of the connections are not running perfectly, High Frequency Traders tend to profit from the price discrepancies that result.
L.A. Street QualGrades [latimes.com], developed by the Los Angeles Times Data Desk, maps the pavement quality rating for each of the 68,000 street segments in L.A., the largest municipal system in the US with about 6,500 miles of paved roadway streets.
Using a state-of-the-art van equipped with cameras and lasers, the Bureau of Street Services graded each single street segment of L.A.'s vast street network from from A (dark green) to F (dark pink). The grades were based on a 100-point scale called the "pavement condition index".
The YouTube Trends Map [youtube.com] is a visualization of the most shared and viewed videos in various regions across the United States over the last 12 to 24 hours. It accompanies the more analytical Trends Dashboard to provide a full overview of the the rising videos and trends on YouTube in terms of actual views or shares, filtered by geographical location, gender or age of the viewers.
The demographic information of viewers is solely based on the information reported by registered, logged-in users in their YouTube account profiles. Next to the geographical map, the Trends Map also include a series of horizontal bar graphs, each representing a graphical summary of the top videos for a different demographic. Within each bar, a video is represented by a colorful segment, the colors are drawn from the video's thumbnail. The width of a video's segment reflects the number of regions on the map where the video is #1.
Blogging is dead. To the extent that it lives, it is dominated by professional journalists, writers backed by major organizations, or has transformed into microblogging. The original objective of an amateur form of journalism -- long articles written and published without an organization or editor -- has become archaic.
I have been writing on this blog since 2004. At its peak, this blog had about 10k regular readers. Over a decade, I have watched blogging rise and fall.
Nowadays, my posts here on this blog often get less attention that my tweets on Twitter. 140 characters that take two minutes to spew out sometimes get more attention than an article that takes four hours of thoughtful analysis, careful reading, and tight writing.
There is nothing wrong with people moving on. Professional journalists now use blogs to air early research or analysis that will later make it into a full print article. Companies use blogs to announce changes or new features. Many use microblogging as a useful means of quick communication. That is good.
But there was something charming about so many people trying to be amateur journalists. Journalistic writing is a skill; it emphasizes clear, tight, concise writing. That so many were attempting it and practicing it had a lot of value, both in the the skills bloggers gained and sometimes candid and insightful articles produced.
I find my blogging here to be too useful to me to stop doing it. I have also embraced microblogging in its many forms. Yet I am left wondering if there is something we are all missing, something shorter than blogging and longer than tweets and different than both, that would encourage thoughtful, useful, relevant mass communication.
We are still far from ideal. A few years ago, it used to be that millions of blog and press articles flew past, some of which might pile up in an RSS reader, a few of which might get read. Now, millions of tweets, thousands of Facebook posts, and millions of articles fly past, some of which might be seen in an app, a few of which might get read. Attention is random; being seen is luck of the draw. We are far from ideal.
Attention should flow to relevant and useful writing. I should see writings that are personally relevant and useful to me. When a friend does something I want to know about, when a colleague reads an article I should read too, when a company announces a useful change to a product I use, when a well-written article important for my work is published from a reputable source, when a major event occurs in the world, those should be brought to my attention.
Blogging wasn't that, but neither is microblogging. We need to build something that focuses our attention, improves our communication, and finally solves the problems blogging and microblogging failed to solve.
Here is one of our exciting just-finished ACL papers. David and I designed an algorithm that learns different types of character personas — “Protagonist”, “Love Interest”, etc — that are used in movies.
To do this we collected a brand new dataset: 42,306 plot summaries of movies from Wikipedia, along with metadata like box office revenue and genre. We ran these through parsing and coreference analysis to also create a dataset of movie characters, linked with Freebase records of the actors who portray them. Did you see that NYT article on quantitative analysis of film scripts? This dataset could answer all sorts of things they assert in that article — for example, do movies with bowling scenes really make less money? We have released the data here.
Our focus, though, is on narrative analysis. We investigate character personas: familiar character types that are repeated over and over in stories, like “Hero” or “Villian”; maybe grand mythical archetypes like “Trickster” or “Wise Old Man”; or much more specific ones, like “Sassy Best Friend” or “Obstructionist Bureaucrat” or “Guy in Red Shirt Who Always Gets Killed”. They are defined in part by what they do and who they are — which we can glean from their actions and descriptions in plot summaries.
Our model clusters movie characters, learning posteriors like this:
Each box is one automatically learned persona cluster, along with actions and attribute words that pertain to it. For example, characters like Dracula and The Joker are always “hatching” things (hatching plans, presumably).
One of our models takes the metadata features, like movie genre and gender and age of an actor, and associates them with different personas. For example, we learn the types of characters in romantic comedies versus action movies. Here are a few examples of my favorite learned personas:
One of the best things I learned about during this project was the website TVTropes (which we use to compare our model against).
We’ll be at ACL this summer to present the paper. We’ve posted it online too:
Learn how to make use of PoolParty, step by step! Each video is about a specific feature or functionality of PoolParty Thesaurus Server or PoolParty Extractor. By walking through all the modules you learn how to use PoolParty for your Semantic Information Management.
This series of tutorials has been produced in cooperation with our partner Term Management, LLC.
ICML registration is also available, at about an x3 higher cost. My understanding is that this is partly due to the costs of a larger conference being harder to contain, partly due to ICML lasting twice as long with tutorials and workshops, and partly because the conference organizers were a bit over-conservative in various ways.
Bolides - Visualizing Meteorites [bolid.es] by data visualization designer Carlo Zapponi visualizes all historical occurrences of meteorites that collided with the Earth and were eye-witnessed when falling and hitting the ground.
The visualizes is comprised of a linear timeline of which the top denotes the number of meteorites spottings per year, and the bottom shows their mass, estimated in kilograms. The dataset includes 34,513 recordings of found and fell meteorites that have not been classified as doubtful or discredited. A meteorite is classified as 'fell' if it has been observed by people or automated devices during its fall.
The most recent problem I've faced involved producing a PDF from an SVG file (authored using Inkscape) which included linked digital photos. The SVG file itself is just 77kB in size, and the five photos it links to total just over 7MB. Even if the images were simply included as is, I wouldn't expect the resulting PDF to be more than 8MB in size, although I would hope that some reasonable down sampling would be applied. Printing to a PDF from within Inkscape, however, results in a file of 22MB in size! That's over three times the size of the included photos, which is just ridiculous. Exporting to a PDF instead of printing, results in a file of the same size so isn't much help.
Now I don't know enough about the inner workings of a PDF file, but a 22MB file (which is incidentally a single A4 page) just isn't right, and certainly isn't sensible for trying to send around via e-mail etc. I've tried searching the web for suggestions, but none of the ideas I've found have turned out to be much use; either they don't alter the size of the file, or they reduce the quality so far that the text becomes unreadable.
The first alternative I tried was to produce a PNG image from the SVG instead of a PDF. While an image file wasn't really what I wanted I thought it might be a good place to start. Using Inkscape to produce a 300 DPI image resulted in a file of just 1.7MB in size; a file, I should add, with a quality that is perfectly adequate. The problem with an image file is that it is easy to alter and can be a pain to print at the right size, so I'd still like to use a PDF file. Fortunately, on linux at least, it is easy to create a PDF from a PNG, using the simple command
convert input.png output.pdfGiven that Inkscape can also be controlled from the command line I can easily write a reusable script that converts an SVG direct to a PDF of a reasonable size:
inkscape -f "$1.svg" -e out.png -C -d 300
convert out.png "$1.pdf"
I can then use this by passing in the name of the SVG file (minus the .svg extension) and out will pop a PDF file. Now I know this approach isn't perfect but given my requirements (a PDF of a reasonable size that is really only needed for printing) this works well. I'm intending to see if I can find a better solution (one that will keep the text as text) so if you have any suggestions please feel free to leave a comment.
Linked (Open) Data has reached the European Publishing Industry – but is it the ‘Real Linked Data’ – a short review on the Publishers’ Forum 2013
Invited by Helmut von Berg, Director at Klopotek & Partner (Klopotek is THE European vendor for publishing production software) I had the chance to participate and speak at this years Publishers’ Forum 2013 at the Concorde Hotel in Berlin on 22nd to 23rd of April 2013.
Coming from the semantic web / linked (open) data community to this publishing industry event with about 320 participants (mainly decision makers) from small to huge publishers all across Europe made me really curious in the forefront of the Forum – what would be the most important issues for innovative publishing processes, what would be the hypes and hopes of a sector that is in the middle of a big change: coming from paper publishing straight into the world of our todays’ data economy?
And then in Berlin, Monday morning – the big surprise: already the opening keynotes by David Worlock, Outsell, UK (Title of Talk: The Atomization of Everything) and Dan Pollock, Nature Publishing Group, UK (Title of Talk: Networked Publishing is Open for Business) mentioned topics as the Semantic Web, Linked (Open) Data and even RDF and Triple Stores – last but not least pointing out that the content of publishers needs to be atomized down to the ‘data level’ and then can to be used successfully for new and innovative business models to serve existing and future customers…
As I participated in the European Data Forum 2013 (EDF2013) just a few days before the Publishers’ Forum my first thought was: WOW – publishers today have arrived in modern data economy (following already the data value chain)! And I enjoyed talking to David Worlock in the coffee break telling him my thoughts and that I will manage a workshop about ‘Enterprise Terminology as a basis for powerful semantic services for publishers’ in the afternoon that day (see slides on slideshare) and his answer was ‘Yes Martin, it seems that I was singing your song’.
The following 1.5 days of the Publishers’ Forum 2013 were full of presentations, workshops and discussions about innovative publishing processes, new business models for publishers and innovative approaches and services – full of terms that are well known by myself like: meta data management, semantics, contextualisation and very very often: Big Data and Linked (Open) Data…..and I listened very carefully to all of this – and at some point it was clear: this discussion needs to be evaluated more carefully – because many of talks and presentations were using the above mentioned terms, principles and technologies only as marketing buzz words – but taking a deeper look showed: there is no semantic web technology in place?!
Hey, Linked Data does NOT mean to establish something like a relation / a link between ‘an Author and a publication’ inside of a repository / a database – Linked (Open) Data is a well established and specified methodology using W3C semantic web standards:
Tim Berners-Lee outlined four principles of linked data in his Design Issues: Linked Data as follows:
- Use URIs to denote things.
- Use HTTP URIs so that these things can be referred to and looked up (“dereferenced”) by people and user agents.
- Provide useful information about the thing when its URI is dereferenced, leveraging standards such as RDF*, SPARQL.
- Include links to other related things (using their URIs) when publishing data on the Web.
Please read in more detail here:
- W3C Wiki
- Linked Open Data – The Essentials – a quick start guide for decision makers
- Or the FP7 project working on Linked Open Data: LOD2
As being a bit like an evangelist for Linked (Open) Data I think such a hype can be very dangerous for the publishing industry – because I see a very strong need for these companies to go for innovative content- and data management approaches very quickly to ensure competitiveness today as well as competitive advantage tomorrow – but not using the respective standards (means: only having the packaging and marketing brochures branded with it) cannot fulfill the hopes in the mid- and the long term!
Thereby I would like to point out here that ‘Linked Data’ seems not always to be ‘Linked Data’ – and I would like to strongly recommend to take a look at the well proven standards – and when selecting IT consultants and IT vendors (means: your IT partners – also a very interesting message taken home from the Forum: that publishers and IT vendors should co-operate more closely in the future in the form of sustainable partnerships) to ensure that these partners really have worked already and are working continuously with these standards and mechanisms!
Christian Dirschl (Wolters Kluwer) presenting the
WKD Use Case on Enterprise Terminologies
Btw. I had a great workshop on Monday afternoon together with Christian Dirschl from Wolters Kluwer Germany (WKD) discussing applications on top of enterprise terminologies (controlled vocabularies using real linked (open) data principles). And: The Semantic Web Company (SWC) is already a partner of the publisher WKD – and this partnership seems to become a more and more fruitful and sustainable one every day – using real linked (open) data…
Waar is de Koning? [waarisdekoning.nl], which can be translated as "Where is the King", was designed by Interactive Design Agency Clever Franke to map the movements and activities of the crowds as they gather in the city of Amsterdam today.
Based on actual anonymized usage data of the mobile phones antennas present in the city as well as the density of geo-located tweets, the map aims to inform the public during the ceremonies and festivities that are happening during the crowning of the Dutch King Willem-Alexander.
As is usual in the Netherlands, the map naturally features the color orange.
Ranging from one day to one eon, and framing the time periods different kinds of species emerged on Earth, the timeline ribbon acts like a dynamic stacked bar chart that enables easy comparison.
- "Employees who ate at cafeteria tables designed for 12 were more productive than those at tables for four, thanks to more chance conversations and larger social networks. That, along with things like companywide lunch hours and the cafes Google is so fond of, can boost individual productivity by as much as 25 percent." ()
- "Managers avoid dealing with low performers (because they believe the conversation will be difficult), and instead assign work to the employees they enjoy — i.e. high performers ... They end up 'burning out' those same high performers." ()
- "Is it really true that using someone else's invention is the actually the same thing as stealing their sheep? If I steal your sheep, you don't have them any more. If I use your idea, you still have the idea, but are less able to profit from using it. The two concepts may be cousins, but they not identical." ()
- Clever and simple idea: Attach a little flash memory and a small battery to memory chips ( )
- Another clever and simple idea: On touchscreens (like your phone), make a knuckle or nail tap like a right mouse click so it does something different (  )
- Most data visualizations would be more clear done as a simple bar chart ()
- When someone comes back to a search result page after hitting the back button, you should add more search results to the bottom of the page ()
- For the first time, more smartphone ship than dumbphones, which has big implications, especially for the developing world ( )
- You can identify people based on just four locations sampled from a mobility trace (cell towers and Wifi nearby) from their cell phone ()
- "The problem is that Apple has not been able to sustain its high margin levels" ( )
- Humor (from The Onion): Weeping Tim Cook spotted screaming for help at Steve Jobs' tombstone ()
- Amazingly arrogant executive hired from Apple didn't understand customer base or think he had to, destroyed a major retailer ( )
- Amazon moves against Google ( ) and Google moves against Amazon (   )
- Very soon, only big players -- like Amazon, Facebook, and Google -- will be able to do personalized advertising. A change to third-party cookies will kill off all startups working on personalized advertising, but major websites get an exemption. ( )
- A new compression library from Google designed for web content, can be decompressed by existing software so no changes required on the client side to use it, just need to recompress the static content on the server to save about 5% in bandwidth ()
- eBay successfully moves away from auctions. "Auctions ... are less than 10% of what we do." ()
- "At this point, unfortunately, it seems clear that the Windows 8 launch not only failed to provide a positive boost to the PC market, but appears to have slowed the market ... Radical changes to elements like the user interface and higher costs had made PCs less attractive compared with tablets and other devices." ( )
- A MacBook Pro runs Windows faster than any PC laptop (but only because PCs have so much crapware installed) ( )
- "Aereo's founders realized that [a court] ruling offered a blueprint for building [an IPTV] service that wouldn't require the permission of broadcasters. In Aereo's server rooms are row after row of tiny antennas mounted on circuit boards. When a user wants to view or record a television program, Aereo assigns him an antenna exclusively for his own use." ()
- The vast majority of people have simple taxes, so simple that the IRS could just mail you a tax return, you'd look it over to make sure everything is correct and sign it, and you'd be done. Why don't we have that? Apparently, "it's been opposed for years by the company behind the most popular consumer tax software—Intuit, maker of TurboTax." ()
- Why Redfin has been unable to undermine the absurdly high 6% commission when you sell your home ( )
- "Personal finance courses ... have no effect on financial outcomes ... [but] additional training in mathematics [does]" ()
- "Graduate school in the humanities: Just don't go" ( )
- At least so far, MOOCs (like Coursera and Udacity) seem to only work for people who are already highly motivated, which isn't the group in the most need ()
- Seems to be increasing evidence that some autoimmune diseases (including allergies) are rooted in a bored immune system incorrectly prioritizing threats. Almost a parallel with anxiety disorders, your immune system is seeing threats where none exist, incorrectly prioritizing dangers. ( )
- "Deep waters have absorbed a surprising amount of heat -- and they are doing so at an increasing rate over the last decade" ()
- "Resilience -- building systems able to survive unexpected and devastating attacks -- is the best answer we have right now." ()
- The web-based version of blackmailing people who have done something embarrassing ()
- Little known fact, the second most used web server is something called Allegro RomPager ( )
- For most people in the US, the vast majority of entertainment time is still spent watching normal, live TV ()
- Odd similarities between distributed denial of service attacks and pollution. As Ed Felten writes, misconfigured DNS servers allow massive DDoS attacks, but it's hard to get people to fix it, because "the resulting harm falls mostly on people outside the organization." ( )
The EnviLOD project demonstrated the benefits that location-based searches, enabled and underpinned by Linked Open Data (LOD) and semantic technologies, can have in terms of enabling improved retrieval of information. Although the semantic search tool developed through the EnviLOD project is not yet ‘production-ready’, it does demonstrate the benefits of this newly emerging technology. As such, it will be incorporated into the Envia ‘labs’ page of the Envia website, which is currently under development. Within Envia Labs, users of the regular Envia service will be able to experiment with and comment on tools that might eventually augment or be incorporated into the service, thus allowing the Envia project team to gauge their potential uptake by the user community.
We also worked on the automatic generation of semantically enriched metadata, to accompany records within the Envia system. This aims to improve the discovery of information within the current Envia system by automatically generating keywords to be included in the article metadata based on the occurrences of terms from the GEMET, DBpedia, and GeoNames vocabularies. A pipeline for this to be incorporated into the Envia system in a regular and sustainable manner is already under way.
One particularly important lesson learnt from this short-term project is that availability of large amounts of content, open to text mining and experimentation needs to be ensured from the very beginning of the project. In EnviLOD there were copyright issues with the majority of environmental science content at the British Library, which limited the experimental system to just over one thousand documents. Due to this limited content, users were not always able to judge how comprehensive or accurate the semantic search was, especially if compared against results offered by Google. Since the British Library is now planning to integrate the EnviLOD semantic enrichment tools within the advanced Envia Labs functionality, future work on this tool could potentially be able to evaluate on this more comprehensive data, through the Envia system.
Another important lesson learnt from the research activities is that working with Linked Open Data is very challenging, not only in terms of data volumes and computational efficiency, but also in terms of data noise and robustness. In terms of noise, an initial evaluation of the DBpedia-based semantic enrichment pipeline revealed that relevant entity candidates were not included initially, because in the ontology they were classified as owl:Thing, whereas we were considering instances of specific sub-classes (e.g. Person, Place). There are over 1 million unclassified instances in the current DBpedia snapshot. In terms of computational efficiency, we had to introduce memory-based caches and efficient data indexing, in order to make the entity linking and disambiguation algorithm sufficiently efficient to process data in near real-time. Lastly, deploying the semantic enrichment on a server, e.g. at Sheffield or at the British Library, is far from trivial, since both OWLIM and our algorithms require large amounts of RAM and computational power. Parallelising the computation to more than three threads is an open challenge, due to the difficulties experienced with parallelising OWLIM. Ontotext are currently working on cloud-based, scalable deployments, so future projects would be able to solve the scalability issue effectively.
Lastly, the quantitative evaluation of our DBpedia-based semantic enrichment pipeline was far from trivial. It required us to annotate manually a gold standard corpus of environmental science content (100 documents were annotated with disambiguated named entities). However, releasing these to other researchers has proven to be practically impossible, due to the copyright and licensing restrictions imposed by the content publishers on the British library. In a related project, we have now developed a web-based entity annotation interface, based on Crowd Flower. This will enable future projects to create gold standards in an easier fashion, based on copyright-free content. Ultimately, during development we made use of available news and similar corpora created by TAC-KBP 2009 and 2010, which we used for algorithm development and testing in EnviLOD, prior to final quantitative evaluation on the copyrighted BL content. So even though the aims of the project were achieved and a useful running pilot system was created, publishing the results in scientific journals has been hampered by these content issues.
In conclusion, we fully support the findings of the JISC report on text mining that copyright exemption for text mining research is necessary, in order to fully unlock the benefits of text mining to scientific research.
The beautifully crafted app Connected China [reuters.com], designed by Fathom Information Design, tracks and visualizes the people, institutions and relationships that form China's elite power structure.
The application was specifically designed for the iPad and is based on data, text, photos and videos from the international news agency Reuters.
Three visualizations stand out: "Social Power", which represents the power base of Xi Jinping through family connections; "Institutional Power", which reflects the power of the Politburo Standing Committee on every level of the government; and finally, "Career Comparison", which tracks the career paths of the seven members of the Politburo Standing Committee.
More information about the this app is available here.
Chesapeake Bay Grasses [chesapeakebay.net], designed by Stamen Design is an interactive map that tracks a quite exotic subject: the changes of the underwater grasses at Chesapeake Bay, the largest estuary in the United States.
Accordingly, the map reveals how the fluctuations in water temperature, salinity and turbidity correlate to grass abundance, as dominant species ebb and flow and grass beds shrink and expand over a period of 30 years.
More information about this project is also available here.
In its core, PoolParty is built upon SKOS, W3C’s standard to define controlled vocabularies like taxonomies or thesauri. However, the latest release 3.2.2 of the well known Thesaurus Software offers a highly flexible RDF schema editor to introduce either widely accepted schemas like FOAF or SIOC or even individual ones, customized to one’s own needs.
“This extension of PoolParty offers new options to our clients to create highly expressive knowledge graphs. Custom schemas can also be used to make links between differing enterprise vocabularies. One the other hand we have taken care not to overload the PoolParty user interface with unwanted complexity”, says Helmut Nagy, COO of the Semantic Web Company.
Watch this video to get an impression how this new feature works:
In addition to “Custom Schemas”, PoolParty Thesaurus Server is now integrated with Virtuoso Universal Server. Thesaurus managers can ‘deploy’ stable versions of their knowledge graphs into a Virtuoso RDF store. Virtuoso is well-known for its high performance even when complex queries are made across different (named) graphs.
The following video will show a short demo of this brandnew feature which opens up completely new options for big data solutions based on enterprise linked data integration:
To get a complete overview over all new features of PoolParty Thesaurus Server 3.2.2, please take a look at the release notes.
- To engage actively with environmental science researchers and other key stakeholders, in order to derive requirements and evaluate project results.
- To develop tools for efficient LOD-based semantic enrichment of unstructured content.
- To create and evaluate intuitive user interface methods that hide the complexities of the SPARQL semantic search language.
- To use British Library’s Envia tool as a case study in using LOD vocabularies for enhanced information discovery and management.
Objective 1: Stakeholder engagement, Requirements Capture, and Evaluation
In order to demonstrate the value of shared LOD vocabularies to different applications, information types and audiences, we focused on use cases related to research on flooding and climate change in the UK. We captured the requirements from relevant audiences and groups via a web based questionnaire, which captured some actual search queries, alongside user input on the kinds of searches they require. We engaged researchers, practitioners and information managers, in order to assess how LOD vocabularies might support their needs. This also motivated our choice of different information types including full-text content, metadata, and LOD datasets. The main user requirement to be fulfilled was for supporting location-based searches, e.g. flooding near Sheffield, or flooding on rivers flowing through Gloucestershire. In addition, users emphasized their need of an intuitive semantic search UI.
A new British Library information discovery tool for environmental science, Envia, was used as a starting point to test the use of semantics towards enhancing information discovery and management. Envia is particularly suited as a test case for these purposes, as it features a mixed corpus of content, including datasets, journal articles, and grey literature, with accompanying metadata records. Envia also enabled us to examine the value of semantic enrichment for information managers. Environmental consultants at HR Wallingford collaborated as domain experts, providing feedback on how the semantic work undertaken in EnviLOD supported their work as environmental science practitioners and innovators.
During the project, stakeholder engagement was ongoing through the project website, blog, Twitter presence, published reports, and joint meetings. In particular, user input and feedback was sought during the design of the semantic search user interface, in order to ensure that it meets user needs. The interface was implemented in three iterations:
- The British Library team participated in the design meeting and provided feedback on the first implementation. This helped the Sheffield team to adjust and simplify the interface.
- Following this, the semantic search UI was demonstrated during a lunchtime workshop and EnviLOD presentation. At the end, environmental science researchers were given the opportunity to try the interface and provide us with structured feedback (via a written questionnare) and a user-lead discussion. This early evaluation helped us refine further the user interface design and remove confusing elements.
- Lastly, a much wider stakeholder feedback was solicited during a user outreach and evaluation workshop, organised at the British Library. There were 25 participants at the event, which enabled us to gather very detailed feedback and suggestions for minor improvements. Overall, the majority of users stated that semantic search would be very useful for information discovery and that they would be using the system, if it were deployed in production.
Semantic annotation is the process of tying semantic models, such as ontologies, and scientific articles together. It may be characterised as the dynamic semantic enrichment of unstructured and semi-structured documents and linking these to relevant domain ontologies/knowledge bases.
The focus of our work was on implementing a LOD-based semantic enrichment algorithm and apply it to metadata and full-text documents from Envia. A trial web service is now available.
As part of this work, we evaluated the coverage and accuracy of relevant general purpose LOD datasets (namely GeoNames and DBPedia), when applied to data and content from our domain. The results showed that GeoNames is a useful resource of rich knowledge about locations (e.g. NUTS administrative regions, latitude, longitude, parent country), however it is not suitable on its own as a primary source for knowledge enrichment. This is due to the high level of detail and location ambiguity (e.g. it contained names of farms). DBpedia on the other hand is much more balanced, including also knowledge about people, organisations, products, and other entities. Therefore DBpedia was chosen as the primary LOD resource for semantic enrichment. Specifically for locations, we identified their equivalent entry in GeoNames and enriched the text content with additional metatada from there.
Tools for LOD-based geo-location disambiguation, date and measurement recognition and normalisation were implemented and tested. In some detail, the first step is to identify all candidate instance URIs from DBpedia, which are mentioned in a given document. This phase is designed to maximise recall, in order to ensure that more relevant documents can be returned at search time. The second step is entity disambiguation, which is carried out on the basis of string, semantic, and contextual similarity, coupled with a corpus frequency metric. The algorithm was developed on a general purpose, shared news-like corpus and evaluated on environmental science papers and metadata records from the British Library.
Objective 3: User Interface for Semantic SearchThe semantic search interface is shown below and can be tried online too:
There is a keyword search field, complemented with optional semantic search constraints, through a set of inter-dependent drop-down lists. In the first list, Location allows users to search for mentions of locations; Date – of dates; Document – for specifying constraints on document-level attributes, etc.
More than one semantic constraint can be added, through the plus button, which inserts a new row underneath the current row of constraints.
If a Location is chosen as a semantic constraint, then, if required, further constraints can be specified by choosing an appropriate property constraint. Population allows users to pose restrictions on the population number of the locations that are being searched for. Similar numeric constraints can be imposed on the latitude, longitude, and population density attribute values.
Restrictions can also be imposed in terms of its name or the country code, i.e. which country it belongs to. When “is” is chosen, it means that the location name must be exactly as specified (e.g. Oxford), whereas “contains” provides sub-string matching, (e.g. Oxfordshire). In the example below, the user is searching for documents mentioning locations which name contains Oxford. When the search is executed, this would return documents mentioning Oxford explicitly, but also documents mentioning Oxfordshire and other locations in Oxfordshire (e.g. Wytham Woods, Banbury).
Objective 4: Use of Envia as a TestbedEnvia provided us with readily available content and a testbed for experimenting with the semantic enrichment methods. Sparsely populated metadata records were enriched with environmental science terms and location and organisation entities, as well as with additional metadata imported from GeoNames and DBpedia.
The British Library will launch a public beta of Envia in May 2013, where EnviLOD enriched content would be included as an experimental option, complementing the traditional full-text search in Envia. Over time, this will give access to user query logs and allow the iterative identification and improvement of the quality of the semantic enrichment and search algorithms.
All our technical objectives have now been completed and we are ready to deploy the semantic enrichment pipeline within the Envia system, as well as carry out further improvements and experiments with the EnviLOD semantic search UI. We are looking forward to taking this work further in the future, implementing the ideas which we received during the user evaluation workshop.
We have now completed the analysis of the user feedback received during two evaluation sessions, carried out towards the end of the 7 month long #EnviLOD project:
- The first smaller user evaluation session was held internally at HR Wallingford and was used as an initial test of our semantic search user interface (UI). Following this, some minor adjustments were made to the UI.
- We held a much larger user workshop at the British Library at the end of January 2013. The core of our findings are derived from this second, much larger user evaluation.
Overall, workshop participants found the EnviLOD semantic search UI (see a screen shot at the end of this post) easy to learn and use. In some detail, 87.5% of users disagreed or strongly disagreed with the statement that the UI is unnecessarily complex. Similarly, 81.25% agreed or strongly agreed that the EnviLOD UI is easy to use. 93.75% of participants also felt they can use the system without needing to learn more about it first. With respect to quality, we asked whether the results returned by the semantic search UI made sense to the users, where 75% agreed or strongly agreed with this statement.
We also carried out a task completion analysis, comparing semantic search versus keyword search. For details on task definitions, success rates, and more in-depth questionnaire results, please see our WP2 User Feedback Report.
The group discussions and the three written feedback questions on the user feedback forms allowed us to elicit a number of small, easy to implement changes to the UI, which we hope will lead to improved usability in the future. We are planning to implement these during follow-up research in the next 1-2 months and then carry out a second user-based evaluation. This time we will include users recruited online, who will not be shown a demonstration of the semantic search UI in advance.
In addition, we also elicited a number of more challenging ideas for future improvements, which cannot be easily addressed within the scope of short, informal follow-up work. The most substantial of these include the implementation of a natural language interface, map-based visualisations, support for user feedback on search results and search query refinement, for example. These are all valuable possible extensions to this work, including building a natural language interface (see (Damljanovic et al, 2013) for some preliminary work which we have done on this already). However, to fully address these, a much longer and larger, two or three year project would be required.
Lastly, a known limitation of this evaluation which could have impacted on our results came from the limited amount of content which we could index in the experimental system. This was due to copyright issues with the majority of environmental science content at the British Library. Due to this limited content, participants were not always able to judge how comprehensive or accurate the semantic search was, especially if compared against results offered by Google. Since the British Library is now planning to integrate the EnviLOD semantic enrichment tools within the advanced Envia Labs functionality, future work on this tool could potentially be able to evaluate on this more comprehensive data, through the Envia system.