If you are interested, please email msrnycrsvp at microsoft.com and say “I want to come” so we can get a count of attendees for refreshments.
With the rise of linked data and the semantic web, concepts and terms like ‘ontology’, ‘vocabulary’, ‘thesaurus’ or ‘taxonomy’ are being picked up frequently by information managers, search engine specialists or data engineers to describe ‘knowledge models’ in general. In many cases the terms are used without any specific meaning which brings a lot of people to the basic question:
What are the differences between a taxonomy, a thesaurus, an ontology and a knowledge graph?
This article should bring light into this discussion by guiding you through an example which starts off from a taxonomy, introduces an ontology and finally exposes a knowledge graph (linked data graph) to be used as the basis for semantic applications.
1. Taxonomies and thesauri
Taxonomies and thesauri are closely related species of controlled vocabularies to describe relations between concepts and their labels including synonyms, most often in various languages. Such structures can be used as a basis for domain-specific entity extraction or text categorization services. Here is an example of a taxonomy created with PoolParty Thesaurus Server which is about the Apollo programme:
The nodes of a taxonomy represent various types of ‘things’ (so called ‘resources’): The topmost level (orange) is the root node of the taxonomy, purple nodes are so called ‘concept schemes’ followed by ‘top concepts’ (dark green) and ordinary ‘concepts’ (light green). In 2009 W3C introduced the Simple Knowledge Organization System (SKOS) as a standard for the creation and publication of taxonomies and thesauri. The SKOS ontology comprises only a few classes and properties. The most important types of resources are: Concept, ConceptScheme and Collection. Hierarchical relations between concepts are ‘broader’ and its inverse ‘narrower’. Thesauri most often cover also non-hierarchical relations between concepts like the symmetric property ‘related’. Every concept has at least on ‘preferred label’ and can have numerous synonyms (‘alternative labels’). Whereas a taxonomy could be envisaged as a tree, thesauri most often have polyhierarchies: a concept can be the child-node of more than one node. A thesaurus should be envisaged rather as a network (graph) of nodes than a simple tree by including polyhierarchical and also non-hierarchical relations between concepts.
Ontologies are perceived as being complex in contrast to the rather simple taxonomies and thesauri. Limitations of taxonomies and SKOS-based vocabularies in general become obvious as soon as one tries to describe a specific relation between two concepts: ‘Neil Armstrong’ is not only unspecifically ‘related’ to ‘Apollo 11′, he was ‘commander of’ this certain Apollo mission. Therefore we have to extend the SKOS ontology by two classes (‘Astronaut’ and ‘Mission’) and the property ‘commander of’ which is the inverse of ‘commanded by’.
The SKOS concept with the preferred label ‘Buzz Aldrin’ has to be classified as an ‘Astronaut’ in order to be described by specific relations and attributes like ‘is lunar module pilot of’ or ‘birthDate’. The introduction of additional ontologies in order to expand expressivity of SKOS-based vocabularies is following the ‘pay-as-you-go’ strategy of the linked data community. The PoolParty knowledge modelling approach suggests to start first with SKOS to further extend this simple knowledge model by other knowledge graphs, ontologies and annotated documents and legacy data. This paradigm could be memorized by a rule named ‘Start SKOS, grow big’.
3. Knowledge Graphs
Knowledge graphs are all around (e.g. DBpedia, Freebase, etc.). Based on W3C’s Semantic Web Standards such graphs can be used to further enrich your SKOS knowledge models. In combination with an ontology, specific knowledge about a certain resource can be obtained with a simple SPARQL query. As an example, the fact that Neil Armstrong was born on August 5th, 1930 can be retrieved from DBpedia. Watch this YouTube video which demonstrates how ‘linked data harvesting’ works with PoolParty.
Knowledge graphs could be envisaged as a network of all kind things which are relevant to a specific domain or to an organization. They are not limited to abstract concepts and relations but can also contain instances of things like documents and datasets.
Why should I transform my content and data into a large knowledge graph?
The answer is simple: to being able to make complex queries over the entirety of all kind of information. By breaking up the data silos there is a high probability that query results become more valid.
With PoolParty Semantic Integrator, content and documents from SharePoint, Confluence, Drupal etc. can be tranformed automatically to integrate them into enterprise knowledge graphs.
Taxonomies, thesauri, ontologies, linked data graphs including enterprise content and legacy data – all kind of information could become part of an enterprise knowledge graph which can be stored in a linked data warehouse. Based on technologies like Virtuoso, such data warehouses have the ability to serve as a complex question answering system with excellent performance and scalability.
In the early days of the semantic web, we’ve constantly discussed whether taxonomies, ontologies or linked data graphs will be part of the solution. Again and again discussions like ‘Did the current data-driven world kill ontologies?‘ are being lead. My proposal is: try to combine all of those. Embrace every method which makes meaningful information out of data. Stop to denounce communities which don’t follow the one or the other aspect of the semantic web (e.g. reasoning or SKOS). Let’s put the pieces together – together!
The last several years have seen a phenomonal growth in machine learning, such that this earlier post from 2007 is understated. Machine learning jobs aren’t just growing on trees, they are growing everywhere. The core dynamic is a digitizing world, which makes people who know how to use data effectively a very hot commodity. In the present state, anyone reasonably familiar with some machine learning tools and a master’s level of education can get a good job at many companies while Phd students coming out sometimes have bidding wars and many professors have created startups.
Despite this, hiring in good research positions can be challenging. A good research position is one where you can:
- Spend the majority of your time working on research questions that interest.
- Work with other like-minded people.
- For several years.
I see these as critical—research is hard enough that you cannot expect to succeed without devoting the majority of your time. You cannot hope to succeed without personal interest. Other like-minded people are typically necessary in finding the solutions of the hardest problems. And, typically you must work for several years before seeing significant success. There are exceptions to everything, but these criteria are the working norm of successful research I see.
The set of good research positions is expanding, but at a much slower pace than the many applied scientist types of positions. This makes good sense as the pool of people able to do interesting research grows only slowly, and anyone funding this should think quite hard before making the necessary expensive commitment for success.
But, with the above said, what makes a good candidate for a research position? People have many diverse preferences, so I can only speak for myself with any authority. There are several things I do and don’t look for.
- Something new. Any good candidate should have something worth teaching. For a phd candidate, the subject of your research is deeply dependent on your advisor. It is not necessary that you do something different from your advisor’s research direction, but it is necessary that you own (and can speak authoritatively) about a significant advance.
- Something other than papers. It is quite possible to persist indefinitely in academia while only writing papers, but it does not show a real interest in what you are doing beyond survival. Why are you doing it? What is the purpose? Some people code. Some people solve particular applications. There are other things as well, but these make the difference.
- A difficult long-term goal. A goal suggests interest, but more importantly it makes research accumulate. Some people do research without a goal, solving whatever problems happen to pass by that they can solve. Very smart people can do well in research careers with a random walk amongst research problems. But people with a goal can have their research accumulate in a much stronger fashion than a random walk through research problems. I’m not an extremist here—solving off goal problems is fine and desirable, but having a long-term goal makes a long-term difference.
- A portfolio of coauthors. This shows that you are the sort of person able to and interested in working with other people, as is very often necessary for success. This can be particularly difficult for some phd candidates whose advisors expect them to work exclusively with (or for) them. Summer internships are both a strong tradition and a great opportunity here.
- I rarely trust recommendations, because I find them very difficult to interpret. When the candidate selects the writers, the most interesting bit is who the writers are. Letters default positive, but the degree of default varies from writer to writer. Occasionally, a recommendation says something surprising, but do you trust the recommender’s judgement? In some cases yes, but in many cases you do not know the writer.
Meeting the above criteria within the context of a phd is extraordinarily difficult. The good news is that you can “fail” with a job that is better in just about every way
Anytime criteria are discussed, it’s worth asking: should you optimize for them? In another context, Lines of code is a terrible metric to optimize when judging programmer productivity. Here, I believe optimizing for (1), (2), (3), and (4) are all beneficial and worthwhile for phd students.
While Google has been doing a great job of their front page animations (today's is very nice, illustrating how Brazil and The Netherlands are on their way to Russia for 2018), Bing appears to be far more attentive to actually answering questions about the competition. For example:
Compared to Bing's
Google's answer brings up some interesting news articles, but Bing brings up stats on the teams and even a prediction of who will win (Cortana - which is driving these predictions - has been doing a perfect job of predicting game outcomes).
- Crazy cool and the first time I've seen ultrasound used for device-to-device communication outside of research: "Chromecast will be able to pair without Wi-Fi, or even Bluetooth, via an unusual method: ultrasonic tones." ()
- A 3D printer that can print in "any weldable material" including titanium, aluminum, and stainless steel ()
- "You teach Baxter [an inexpensive industrial robot] how to do something by grabbing an arm and showing it what you want, sort of like how you would teach a child to paint" ()
- When trying to use the wisdom of the crowds, you're better off using only the best part of the crowd. ()
- "Americans now appear to trust internet news about as much as newspapers and television news ... not because confidence in internet news is rising, but because confidence in TV news and newspapers has plummeted over the years." ()
- "Microsoft is basically 'done' with Windows 8.x. Regardless of how usable or functional it is or isn't, it has become Microsoft's Vista 2.0 -- something from which Microsoft needs to distance itself." ()
- Google Flights now lets you see everywhere you can fly out of a city (including limiting to non-stops only) and how much it would cost (   )
- "Entering the fulfillment center in Phoenix feels like venturing into a realm where the machines, not the humans, are in charge ... The place radiates a non-human intelligence, an overarching brain dictating the most minute movements of everyone within its reach." ()
- Google's location history feature is both fascinating and frightening. If you own an Android device, go to location history, set it to 30 days, and see the detail on where you have been. While it's true that many have this kind of data, it may surprise you to see it all at once.
- "Vodafone, one of the world's largest mobile phone groups, has revealed the existence of secret wires that allow government agencies to listen to all conversations on its networks, saying they are widely used in some of the 29 countries in which it operates in Europe and beyond." ()
- Many "users actually do not attach any signiﬁcant economic value to the security of their systems" ( )
- "Ensuring that our patent system 'promotes the progress of science,' rather than impedes it, consistent with the constitutional mandate underlying our intellectual property system" ()
- Smartphones may have hit the limit on how much improvements to screen resolution matter, meaning they will have to compete on other features (like sensors or voice recognition) ()
- "Project Tango can see the world around it in 3D. This would allow developers to make augmented-reality apps that line up perfectly with the real world or make an app that can 3D scan an object or environment." ()
- The selling point of smartwatches is paying $200 to not have to pull your phone out of your pocket, and that might be a tough sell. ()
- "As programmers will tell you, the building part is often not the hardest part: It's figuring out what to build. 'Unless you can think about the ways computers can solve problems, you can't even know how to ask the questions that need to be answered'" ()
- "[No] lectures, discussion sections, midterms ... a pre-test for each subject area ... given a mentor with a graduate degree in the field ... [and] textbooks, tutorials, and other resources. Eventually, they're assessed on how well they understand the concepts." ()
- "A naked mole rat has never once been observed to develop cancer" ()
- Hilarious Colbert Report on the Hachette mess, particularly loved the bit on "Customers who enjoyed this also bought this" at 3:00 in the video ()
- Humor from The Onion: "We want $100 from you, so we’re just going to take it. As a cable subscriber, you really have no other option here" ()
- Humor from the Borowitz Report: "It never would have occurred to me that an enormous corporation with the ability to track over half a billion customers would ever exploit that advantage in any way." ()
Before going further, it appears that a Data Scientist should possess an awful lot of skills : Statistics, Programming, Databases, Presentation Skills, Knowledge of Data Cleaning and Transformations.
The skills that ideally you should acquire are as follows :
1) Sound Statistical Understanding and Data Pre-Processing
2) Know the Pitfalls : You must be aware of the Biases that could affect you as an analyst and also the common mistakes made during Statistical Analysis
3) Understand how several Machine Learning / Statistical Techniques work.
4) Time Series Forecasting
5) Computer Programming (R, Java, Python, Scala)
6) Databases (SQL and NoSQL Databases)
7) Web Scraping (Apache Nutch, Scrapy, JSoup)
8) Text Data
Statistical Understanding : A good Introductory Book is Fundamental Statistics for the Behavioral Sciences by Howell. Also IBM SPSS for Introductory Statistics - Use and Interpretation and IBM SPSS For Intermediate Statistics by Morgan et al. Although all of the books (especially the two latter) are heavy on IBM SPSS Software they are able to provide a good introduction to key statistical concepts while the books by Morgan et al give a methodology to use with a practical example of analyzing the High-Scool and Beyond Dataset.
Data Pre-Processing : I must re-iterate the importance of thoroughly checking and identifying problems within your Data. Data Pre-processing guards against the possibility of feeding erroneous data to a Machine Learning / Statistical Algorithm but also transforms data in such a way so that an algorithm can extract/identify patterns more easily. Suggested Books :
- Data Preparation for Data Mining by Dorian Pyle
- Mining Imperfect Data: Dealing with Contamination and Incomplete Records by Pearson
- Exploratory Data Mining and Data Cleaning by Johnson and Dasu
Know the Pitfalls : There are many cases of Statistical Misuse and biases that may affect your work even if -at times- you do not know it consciously. This has happened to me in various occasions. Actually, this blog contains a couple of examples of Statistical Misuse even though i tried (and keep trying) to highlight limitations due to the nature of Data as much as i can. Big Data is another technology where caution is warranted. For example, see : Statistical Truisms in the Age of Big Data and The Hidden biases of Big Data.
Some more examples :
-Quora Question : What are common fallacies or mistakes made by beginners in Statistics / Machine Learning / Data Analysis
-Identifying and Overcoming Common Data Mining Mistakes by SAS Institute
The following Book is suggested :
- Common Errors in Statistics (and how to avoid them) by P. Good and J. Harding
In case you are into Financial Forecasting i strongly suggest reading Evidence-Based Technical Analysis by David Aronson which is heavy on how Data Mining Bias (and several other cognitive biases) may affect your Analysis .
Understand how several Machine Learning / Statistical Algorithms work : You must be able to understand the pros and cons of each algorithm. Does the algorithm that you are about to try handle noise well? How Does it scale? What kind of optimizations can be performed? Which are the necessary Data transformations? Here is an example for fine-tuning Regression SVMs:
Practical Selection of SVM Parameters and Noise Estimation for SVM Regression
Another book which deserves attention is Applied Predictive Modelling by Khun, Johnson which also gives numerous examples on using the caret R Package which -among other things- has extended Parameter Optimization capabilities.
When it comes to getting to know Machine Learning/ Statistical Algorithms I'd suggest the following books :
- Data Mining : Practical Machine Learning Tools and Techniques by Witten and Frank
- The Elements of Statistical Learning by Friedman, Hasting, Tibishirani
Time Series Forecasting : In many situations you might have to identify and predict trends from Time Series Data. A very good Introductory Book is Forecasting : Principles and Practice by Hyndman and Athanasopoulos which contains sections on Time Series Forecasting. Time Series Analysis and its Applications with R Examples by Shumway and Stoffer is another book with Practical Examples and R Code as the title suggests.
In case you are interested more about Time Series Forecasting i would also suggest ForeCA (Forecastable Component Analysis) R package written by Georg Goerg -working at Google at the moment of writing- which tells you how forecastable a Time Series is (Ω = 0:white noise, therefore not forecastable, Ω=100: Sinusoid, perfectly forecastable).
Computer Programming Knowledge: This is another essential skill. It allows you to use several Data Science Tools/APIs that require -mainly- Java and Python skills. Scala appears to be also becoming an important Programming Language for Data Science. R Knowledge is considered a "must". Having prior knowledge of Programming gives you the edge if you wish to learn n new Programming Language. You should also constantly be looking for Trends on programming language requirements (see Finding the right Skillset for Big Data Jobs). It appears that -currently- Java is the most sought Computer Language, followed by Python and SQL. It is also useful looking at Google Trends but interestingly "Python" is not available as a Programming Language Topic at the moment of writing.
Database Knowledge : In my experience this is a very important skill to have. More often than not, Database Administrators (or other IT Engineers) that are supposed to extract Data for you are just too busy to do that. That means that you must have the knowledge to connect to a Database, Optimize a Query and perform several Queries/Transformations to get the Data that you want on a format that you want.
Web Scraping: It is a useful skill to have. There are tons of useful Data which you can access if you know how to write code to access and extract information from the Web. You should get to know HTML Elements and XPath. Some examples of Software that can be used for this purpose :
Text Data: Text Data contain valuable information : Consumer Opinions, Sentiment, Intentions to name just a few. Information Extraction and Text Analytics are important Technologies that a Data Scientist should ideally know.
Information Extraction :
-The "tm" R Package
The following Books are suggested :
- Introduction to Information Retrieval by Manning, Raghavan and Schütze
- Handbook of Natural Language Processing by Indurkhya, Damerau (Editors)
- The Text Mining HandBook - Advanced Approaches in Analyzing Unstructured Data by Feldman and Sanger
Finally here are some Books that should not be missed by any Data Scientist :
- Data Mining and Statistics for Decision Making by Stéphane Tufféry (A personal favorite)
- Introduction to Data Mining by Tan, Steinbach, Kumar
- Applied Predictive Modelling by Khun, Johnson
- Data Mining with R - Learning with Case Studies by Torgo
- Principles of Data Mining by Bramer
This year’s ICML had several papers which I want to read through more carefully and understand better.
- Chun-Liang Li, Hsuan-Tien Lin, Condensed Filter Tree for Cost-Sensitive Multi-Label Classification. Several tricks accumulate to give a new approach for addressing cost sensitive multilabel classification.
- Nikos Karampatziakis and Paul Mineiro, Discriminative Features via Generalized Eigenvectors. An efficient, effective eigenvalue solution for supervised learning yields compelling nonlinear performance on several datasets.
- Nir Ailon, Zohar Karnin, Thorsten Joachims, Reducing Dueling Bandits to Cardinal Bandits. An effective method for reducing dueling bandits to normal bandits that extends to contextual situations.
- Pedro Pinheiro, Ronan Collobert, Recurrent Convolutional Neural Networks for Scene Labeling. Image parsing remains a challenge, and this is plausibly a step forward.
- Cicero Dos Santos, Bianca Zadrozny, Learning Character-level Representations for Part-of-Speech Tagging. Word morphology is clearly useful information, and yet almost all ML-for-NLP applications ignore it or hard-code it (by stemming).
- Alekh Agarwal, Daniel Hsu, Satyen Kale, John Langford, Lihong Li, Robert Schapire, Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits. Statistically efficient interactive learning is now computationally feasible. I wish this one had been done in time for the NIPS tutorial
- David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, Martin Riedmiller, Deterministic Policy Gradient Algorithms. A reduction in variance from working out the deterministic limit of policy gradient make policy gradient approaches look much more attractive.
Edit: added one that I forgot.
The key reason for this is information: I expect everyone participating in ICML has some baseline interest in how ICML is doing. Everyone involved has personal anecdotal information, but we all understand that a few examples can be highly misleading.
Aside from satisfying everyone’s joint curiousity, I believe this could improve ICML itself. Consider for example reviewing. Every program chair comes in with ideas for how to make reviewing better. Some succeed, but nearly all are forgotten by the next round of program chairs. Making survey information available will help quantify success and correlate it with design decisions.
The key question to ask for this is “who?” The reason why surveys don’t happen more often is that it has been the responsibility of program chairs who are typically badly overloaded. I believe we should address this by shifting the responsibility to a multiyear position, similar to or the same as a webmaster. This may imply a small cost to the community (<$1/participant) for someone’s time to do and record the survey, but I believe it’s a worthwhile cost.
I plan to bring this up with IMLS board in Beijing, but would like to invite any comments or thoughts.
- Fun data: "How to tell someone's age when all you know is her name" ()
- "The possibility of proper tricorder technology in the future, scanning a bit of someone's blood and telling you if they have any diseases or anomalous genetic conditions" ()
- Will self-driving vehicles appear first in trucking? ()
- "Apple's moves into the world of fashion and wearable computing" ()
- "Few people try to or want to use tablets like laptops" ( )
- "While managers do indeed add value to a company, there’s no particular reason to believe that they add more value to a company than the people who report to them ... [You want] an organization where fairly-compensated people work together as a team, rather than trying to work out the best way to make money for themselves at the expense of their colleagues." ()
- "Each meeting ... spawns even more meetings ... The solution ... reduce default meeting length from 60 to 30 minutes ... limit meetings to seven or fewer participants ... agendas with clear objectives ... materials ... distributed in advance .. on-time start ... early ending, especially if the meeting is going nowhere ... remove ... unnecessary supervisors." ()
- Fun article on the history of the modern office: "The cubicle was actually intended to be this liberating design, and it basically became perverted" ()
- "We were wrong about the first-time shoppers. They did mind registering. They resented having to register when they encountered the page. As one shopper told us, 'I'm not here to enter into a relationship. I just want to buy something.'" ()
- Private investment in broadband infrastructure is actually dropping in the US ()
- "Not only are packets being dropped, but all those not being dropped are also subject to delay. ... They are deliberately harming the service they deliver to their paying customers ... Shouldn't a broadband consumer network with near monopoly control over their customers be expected, if not obligated, to deliver a better experience than this?" ()
- Fascinating data on cancer shows a surprising lack of linear relationship between aging and cancer ( )
- "A wayward spacecraft ISEE-3/ICE was returning to fly past Earth after many decades of wandering through space. It was still operational, and could potentially be sent on a new mission, but NASA no longer had the equipment to talk to it ... crowdfunding project ... commandeer the spacecraft ... awfully long shot ... They are now in command of the ISEE-3 spacecraft." ()
- I love the caption on this comic: "Somebody please do this and post it on YouTube so I can live vicariously through your awesomeness." ()
- Hilarious SMBC comic on privacy and technology ()
- Great SMBC comic: "Wanna play the Bayesian drinking game?" ()
- Hilarious John Oliver segment on net neutrality () directs people to FCC website to comment, crashing FCC website ()
- Very funny, from The Onion: "New Facebook Feature Scans Profile To Pinpoint Exactly When Things Went Wrong" ()
The reduction of green house gas emissions is one of the big global challenges for the next decades. (Linked) Open Data on this multi-domain challenge is key for addressing the issues in policy, construction, energy efficiency, production a like. Today – on the World Environment Day 2014 – a new (linked open) data initiative contributes to this effort: GBPN’s Data Endpoint for Building Energy Performance Scenarios.
GBPN (The Global Buildings Performance Network) provides the full data set on a recently made global scenario analysis for saving energy in the building sector worldwide, projected from 2005 to 2050. The multidimensional dataset includes parameters like housing types, building vintages and energy uses - for various climate zones and regions and is freely available for full use and re-use as open data under CC-BY 3.0 France license.
To explore this easily, the Semantic Web Company has developed an interactive query / filtering tool which allows to create graphs and tables in slicing this multidimensional data cube. Chosen results can be exported as open data in the open formats: RDF and CSV and also queried via a provided SPARQL endpoint (a semantic web based data API). A built-in query-builder makes the use as well as the learning and understanding of SPARQL easy – for advanced users as well as also for non-experts or beginners.
The LOD based information- & data system is part of Semantic Web Companies’ recent Poolparty Semantic Drupal developments and is based on OpenLinks Virtuoso 7 QuadStore holding and calculating ~235 million triples as well as it makes use of the RDF ETL Tool: UnifiedViews as well as D2R Server for RDF conversion. The underlying GBPN ontology runs on PoolParty 4.2 and serves also a powerful domain-specific news aggregator realized with SWC’s sOnr webminer.
Together with other Energy Efficiency related Linked Open Data Initiatives like REEEP, NREL, BPIE and others, GBPNs recent initative is a contribution towards a broader availability of data supporting action agains global warming - as also Dr. Peter Graham, Executive Director of GBPN emphasized “…data and modelling of building energy use has long been difficult or expensive to access – yet it is critical to policy development and investment in low-energy buildings. With the release of the BEPS open data model, GBPN are providing free access to the world’s best aggregated data analyses on building energy performance.”
The Linked Open Data (LOD) is modelled using the RDF Data Cube Vocabulary (that is a W3C recommendation) including 17 dimensions in the cube. In total there are 235 million triples available in RDF including links to DBpedia and Geonames – linking the indicators: years – climate zones – regions and building types as well as user scenarios….
You are given a problem (good examples:    ), go off and work on it in whatever programming language you like using whatever tools you like, and submit your answer (multiple submissions allowed). Simple, but surprisingly fun and interesting.
It's been around for a while (since 2006), and, though I've looked at it a few times, I only recently got addicted to it. It's not, as I first thought, just a series of interview-style coding questions, but a much more interesting set of deeper challenges in math that require programming to explore and solve. It's a great way to refresh on math and fun too.
Honestly, I can't say enough good things about it. I've blown hundreds of hours on some addictive video game before, addicted to the point that it occasionally interferes with work and sleep even, and this has the same feel. It's a great little educational tool and fun as well.
Definitely worth a look. Seems like it'd work for older teenagers too if you're looking for a summer project for a teen that already has some programming skill.
In 2012 Jem Rayfield released an insightful post about the BBC’s Linked Data strategy during the Olympic Games 2012. In this post he coined the term “Dynamic Semantic Publishing”, referring to
“the technology strategy the BBC Future Media department is using to evolve from a relational content model and static publishing framework towards a fully dynamic semantic publishing (DSP) architecture.”
According to Rayfield this approach is characterized by
“a technical architecture that combines a document/content store with a triple-store proves an excellent data and metadata persistence layer for the BBC Sport site and indeed future builds including BBC News mobile.”
The technological characteristics are further described as …
- A triple-store that provides a concise, accurate and clean implementation methodology for describing domain knowledge models.
- An RDF graph approach that provides ultimate modelling expressivity, with the added advantage of deductive reasoning.
- SPARQL to simplify domain queries, with the associated underlying RDF schema being more flexible than a corresponding SQL/RDBMS approach.
- A document/content store that provides schema flexibility; schema independent storage; versioning, and search and query facilities across atomic content objects.
- Combining a model expressed as RDF to reference content objects in a scalable document/content-store provides a persistence layer that uses the best of both technical approaches.
So what are actually the benefits of Linked Data from a non-technical perspective?Benefits of Linked (Meta)Data
Semantic interoperability is crucial in building cost efficient IT systems that integrate numerous data sources. Since 2009 the Linked Data paradigm has emerged as a light weight approach to improve data portability ferderated IT systems. By building on Semantic Web standards the Linked Data approach offers significant benefits compared to conventional data integration approaches. These are according to Auer :
- De-referencability. IRIs are not just used for identifying entities, but since they can be used in the same way as URLs they also enable locating and retrieving resources describing and representing these entities on the Web.
- Coherence. When an RDF triple contains IRIs from different namespaces in subject and object position, this triple basically establishes a link between the entity identified by the subject (and described in the source dataset using namespace A) with the entity identified by the object (described in the target dataset using namespace B). Through these typed RDF links, data items are effectively interlinked.
- Integrability. Since all Linked Data sources share the RDF data model, which is based on a single mechanism for representing information, it is very easy to attain a syntactic and simple semantic integration of different Linked Data sets. A higher-level semantic integration can be achieved by employing schema and instance matching techniques and expressing found matches again as alignments of RDF vocabularies and ontologies in terms of additional triple facts.
- Timeliness. Publishing and updating Linked Data is relatively simple thus facilitating a timely availability. In addition, once a Linked Data source is updated it is straightforward to access and use the updated data source, since time consuming and error prune extraction, transformation and loading is not required.
On top of these technological principles Linked Data promises to improve the reusability and richness (in terms of depth and broadness) of content thus adding significant value to the content value chain.Linked Data in the Content Value Chain
According to Cisco communication within electronic networks has become increasingly content-centric. I.e. Cisco reports for the time period from 2011 to 2016 an increase of 90% of video content, 76% of gaming content, 36% VoIP, 36% file sharing being transmitted electronically. Hence it is legitimate to ask what role Linked Data takes in the content production process. Herein we can distinguish five sequential steps: 1) content acquisition, 2) content editing, 3) content bundling, 4) content distribution and 5) content consumption. As illustrated in the figure below Linked Data can contribute to each step by supporting the associated intrinsic production function .
- Content acquisition is mainly concerned with the collection, storage and integration of relevant information necessary to produce a content item. In the course of this process information is being pooled from internal or external sources for further processing.
- The editing process entails all necessary steps that deal with the semantic adaptation, interlinking and enrichment of data. Adaptation can be understood as a process in which acquired data is provided in a way that it can be re-used within editorial processes. Interlinking and enrichment are often performed via processes like annotation and/or referencing to enrich documents either by disambiguating of existing concepts or by providing background knowledge for deeper insights.
- The bundling process is mainly concerned with the contextualisation and personalisation of information products. It can be used to provide customized access to information and services i.e. by using metadata for the device-sensitive delivery of content, or to compile thematically relevant material into Landing Pages or Dossiers thus improving the navigability, findability and reuse of information.
- In a Linked Data environment the process of content distribution mainly deals with the provision of machine-readable and semantically interoperable (meta-)data via Application Programming Interfaces (APIs) or SPARQL Endpoints. These can be designed either to serve internal purposes so that data can be reused within controlled environments (i.e. within or between organizational units) or for external purposes so that data can be shared between anonymous users (i.e. as open SPARQL Endpoints on the Web).
- The last step in the content value chain is dealing with content consumption. This entails any means that enable a human user to search for and interact with content items in a pleasant und purposeful way. So according to this view this step mainly deals with end user applications that make use of Linked Data to provide access to content items (i.e. via search or recommendation engines) and generate deeper insights (i.e. by providing reasonable visualizations).
There is definitely a place for Linked Data in the Content Value Chain, hence we can expect that Dynamic Semantic Publishing is here to stay. Linked Data can add significant value to the content production process and carry the potential to incrementally expand the business portfolio of publishers and other content-centric businesses. But the concrete added value is highly context-dependent and open to discussion. Technological feasibility is easily contradicted by strategic business considerations, a lack of cultural adaptability to legacy issues like dual licensing, technological path dependencies or simply a lack of resources. Nevertheless Linked Data should be considered as a fundamental principle in next generation content management as it provides a radically new environment for value creation.
More about the topic – live
Linked Data in the content value chain is also one of the topics set onto the agenda of this year’s SEMANTiCS 2014. Listen to keynote speaker Sofia Angeletou an others, to learn more about next generation content management.References
 Auer, Sören (2011). Creating Knowledge Out of Interlinked Data. In: Proceedings of WIMS’11, May 25-27, 2011, p. 1-8
 Pellegrini, Tassilo (2012). Integrating Linked Data into the Content Value Chain: A Review of News-related Standards, Methodologies and Licensing Requirements. In: Presutti, Valentina; Pinto, Sofia S.; Sack, Harald; Pellegrini, Tassilo (2012). Proceedings of I-Semantics 2012. 8th International Conference on Semantic Systems. ACM International Conference Proceeding Series, p. 94-102
Quote: “The vision of semantic publishing in the BBC has shifted from supporting high profile events to connecting the BBC’s content around things that matter to the audience. To this end, we have increased the application of linked data to domains other than sports such as news, education and music with the intention that the content we produce can be reused and discovered through a multitude of channels.”
In her keynote, Sofia will outline the technological and cultural factors that have influenced the BBC’s adoption of linked data. A talk reflecting the early assumptions BBC made their effects on the development of the platform and they way BBC are addressing them now.
A talk people who are working in the media and publishing industry should not miss and one of several highlights the completely re-brushed SEMANTiCS conference will provide to you.
The rationale behind mining business data directly from the business's own website is that the business has a clear economic motivation to ensure that the data is up to date. If you own a restaurant that changes location, and your website still publishes the former address, those potential customers who visit your site will not be enjoying your delicious offerings.
For the web mining proposition to work, it is important to firstly know that you have in your hand a genuine business website and secondly, to have excellent extraction and inference technology to pull the required data from the HTML.
The first requirement can get pretty murky. There are sites that could easily be mistaken for a business site but which are, in fact, other types of legitimate sites (such as a blog with the contact information of the blogger). Unfortunately, there are also sites which are essentially fake store fronts for the actual business in question. The most obvious are those sites which are simply parked domains with some spam links on them. A domain parker might snap up the domain yourrestaurantseattle.com hoping that web surfers looking for yourrestaurant will land there and give them some clicks. An emerging new trend is a far more sophisticated site which (through some amount of templating but also specific editorial attention) aims to look like the actual site of the business. The motivation for these sites is the burgeoning third party restaurant delivery service industry - for which GrubHub might be the poster child.
Take, for example, the 1947 Tavern in Pittsburgh. This establishment's site is located on the web at http://www.1947tavern.com/. GrubHub, however, has set up a phasmid site at http://1947tavernpittsburgh.com/ which looks like a legitimate home page for the tavern.
There is, however, a modest amount of GrubHub visibility on this site, including the link to the menu as well as references to the tavern's status in the GrubHub universe.
Inspecting the domain's registration data in the whois database shows that it was actually registered by GrubHub. This brings up the following entry as the email contact:
Registrant Email: @GRUBHUB.COM (here http://www.whois.com/whois/1947tavernpittsburgh.com)
I can't find the email address ghwebsites at grubhub.com on the GrubHub website, but a site search on Google over the Web Analyzer domain (site:wa-com.com "ghwebsite at grubhub.com") produces 13, 200 results. Dipping in to these brings up further examples of fake sites made with the same template as that for the 1947 Tavern phasmid.
[In the above, I substitie ' at ' for the '@' to avoid typepad's automated obsfucation of email addresses in posts.]
Here's another example: http://indiagardenmonroeville.com/.
I don't believe these sites are particularly malicious - most likely, they bring additional customers to the business even if it is through deception. They do, however, pose a problem for web mining systems. There is less pressure on GrubHub to keep the exact details of the business up to date. In addition, when GrubHub goes belly up, these sites will linger.
- Excellent history of Google's love-hate relationship with management (hint: it's mostly hate) ()
- Excellent BBC documentary on Amazon.com, plenty of fun historical tidbits, quite critical in parts, very well done ()
- Excellent charts on the history of wealth concisely summarizing three centuries ()
- Compelling example of virtual tourism ()
- Visually stunning math concepts which are easy to explain ()
- "The most interesting things are happening at the intersection of two fields" ()
- "The largest driver of Facebook’s mobile revenue is app-install ads ... largely purchased by free-to-play game publishers such as King (maker of Candy Crush Saga) and Big Fish Games (the Bejeweled series) ... to target the small percentage of players who will spend hundreds of dollars on in-app purchases." ( )
- "When you subtract out the value of Yahoo's stake in Alibaba, the rest of Yahoo is worthless. Indeed, it has negative worth." (  )
- "Newspaper print ad revenue has declined 73% in 15 years" ()
- "Microsoft is backtracking on practically every part of the Windows 8 interface that developers abhorred" (   )
- Survivor bias in perceptions of startup life and what might be closer to reality: "It's a decision to throw away a large chunk of your precious youth at a venture which is almost certain to fail" ( )
- "Many hospitals in the US still use Windows XP on workstations and healthcare devices" ()
- "It's no longer realistic to think that routers, DVRs, or other Internet-connected home appliances aren't worth an attacker's time ... poorly designed 'Internet of Things' devices ... [are] particularly easy to hack" ()
- Newegg exec on patent trolls: "Why those asshats continue to trade at ANY value, I do not know. The world would be a better place without them." ()
- I still don't understand why more tech companies don't provide free food ( )
- Humor: "The pain of being the only engineer in a business meeting" ()
Here’s a re-plotting of a graph in this 538 post. It’s looking at whether pilots speed up the flight when there’s a delay, and find that it looks like that’s the case. This is averaged data for flights on several major transcontinental routes.
I’ve replotted the main graph as follows. The x-axis is departure delay. The y-axis is the total trip time — number of minutes since the scheduled departure time. For an on-time departure, the average flight is 5 hours, 44 minutes. The blue line shows what the total trip time would be if the delayed flight took that long. Gray lines are uncertainty (I think the CI due to averaging).
What’s going on is, the pilots seem to be targeting a total trip time of 370-380 minutes or so. If the departure is only slightly delayed by 10 minutes, the flight time is still the same, but delays in the 30-50 minutes range see a faster flight time which makes up for some of the delay.
The original post plotted the y-axis as the delta against the expected travel time (delta against 5hr44min). It’s good at showing that the difference does really exist, but it’s harder to see the apparent “target travel time”.
Also, I wonder if the grand averaging approach — which averages totally different routes — is necessarily the best. It seems like the analysis might be better by adjusting for different expected times for different routes. The original post is also interested in comparing average flight times by different airlines. You might have to go to linear regression to do all this at once.
While there is already an awful lot of text in the world, more and more is being produced everyday, usually in electronic form, and usually published on the internet. Given that we could never read all this text we rely on search engines, such as Google, to help us pinpoint useful or interesting documents. These search engines rely on two main things to find the documents we are interested in, the text and the links between the documents, but what if we could tell them what some of the text actually means?
In the newest version of the HTML specification (which is a work in progress usually referred to as HTML5) web pages can contain semantic information encoded as HTML Microdata. I'm not going to go into the details of how this works as there is already a number of great descriptions available, including this one.
HTML5 Microdata is, in essence, a way of embedding semantic information in a web page, but it doesn't tell a human or a machine what any of the information means, especially as different people could embed the same information using different identifiers or in different ways. What is needed is a common vocabulary that can be used to embed information about common concepts, and currently many of the major search engines have settled on using schema.org as the common vocabulary.
When I first heard about schema.org, back in 2011, I thought it was a great idea, and wrote some code that could be used to embed the output of a GATE application within a HTML page as microdata. Unfortunately the approach I adopted was, to put it bluntly, hacky. So while I had proved it was possible the code was left to rot in a dark corner of my SVN repository.
I was recently reminded of HTML5 microdata and schema.org in particular when one of my colleges tweeted a link to this interesting article. In response I was daft enough to admit that I had some code that would allow people to automatically embed the relevant microdata into existing web pages. It wasn't long before I'd had a number of people making it clear that they would be interested in me finishing and releasing the code.
I'm currently in a hotel in London as I'm due to teach a two day GATE course starting tomorrow (if you want to learn all about GATE then you might be interested in our week long course to be held in Sheffield in June) and rather than watching TV or having a drink in the bar I thought I'd make a start on tidying up the code I started on almost three years ago.
Before we go any further I should point out that while the code works the current interface isn't the most user friendly. As such I've not added this to the main GATE distribution as yet. I'm hopping that any of you who give it a try can leave me feedback so I can finish cleaning things up and integrate it properly. Having said that here is what I have so far...
I find that worked examples usually help convey my ideas better than prose so, lets start with a simple HTML page:
<title>This is a schema.org test document</title>
<h1>This is a schema.org test document</h1>
Mark Greenwood works in Sheffield for the University of Sheffield.
He is currently in a Premier Inn in London, killing time by working on a GATE plugin to allow annotations to be embedded within a HTML document using HTML microdata and the schema.org model.
</html>As you can see this contains a number of obvious entities (people, organizations and locations) that could be described using the schema.org vocabulary (people, organizations and locations are just some of the schema.org concepts) and which would be found by simply running ANNIE over the document.
Once we have sensible annotations for such a document, probably from running ANNIE, and a mapping between the annotations and their features and the schema.org vocabulary then it is fairly easy to produce a version of this HTML document with the annotations embedded as microdata. The current version of my code generate the following file:
<title>This is a schema.org test document</title>
<h1>This is a schema.org test document</h1>
<span itemscope="itemscope" itemtype="http://schema.org/Person"><meta content="male" itemprop="gender"/><meta content="Mark Greenwood" itemprop="name"/>Mark Greenwood</span> works in <span itemscope="itemscope" itemtype="http://schema.org/City"><meta content="Sheffield" itemprop="name"/>Sheffield</span> for the <span itemscope="itemscope" itemtype="http://schema.org/Organization"><meta content="University of Sheffield" itemprop="name"/>University of Sheffield</span>.
He is currently in a <span itemscope="itemscope" itemtype="http://schema.org/Organization"><meta content="Premier" itemprop="name"/>Premier</span> Inn in <span itemscope="itemscope" itemtype="http://schema.org/City"><meta content="London" itemprop="name"/>London</span>, killing time by working on a GATE plugin to allow <span itemscope="itemscope" itemtype="http://schema.org/Person"><meta content="female" itemprop="gender"/><meta content="ANNIE" itemprop="name"/>ANNIE</span> annotations to be embedded within a HTML document using HTML microdata and the schema.org model.
</html>This works nicely and the embeded data can be extracted by the search engines, as proved using the Google rich snippets tool.
As I said earlier while the code works, the current integration with the rest of GATE definitely needs improving. If you load the plugin (details below) then right clicking on a document will allow you to Export as HTML5 Microdata... but it won't allow you to customize the mapping between annotations and a vocabulary. Currently the ANNIE annotations are mapped to the schema.org vocabulary using a config file in the resources folder. If you want to change the mapping you have to change this file. In the future I plan to add some form of editor (or at least the ability to choose a different file) as well as the ability to export a corpus not just a single file.
So if you have got all the way to here then you probably want to get your hands on the current plugin, so here it is. Simply load it into GATE in the usual way and it will add the right-click menu option to documents (you'll need to use a nightly build of GATE, or a recent SVN checkout, as it uses the resource helpers that haven't yet made it into a release version).
Hopefully you'll find it useful but please do let me know what you think, and if you have any suggestions for improvements, especially around the integration the GATE Developer GUI.
The proposed scheme includes several discrete sets of categories called facets whose values can be combined to express concepts such as existing Physics and Astronomy Classification Scheme (PACS) codes, as well as new concepts that have not yet emerged, or have been difficult to express with the existing PACS.
PACS codes formed a single-hierarchy classification scheme, designed to assign the “one best” category that an item will be classified under. Classification schemes come from the need to physically locate objects in one dimension, for example in a library where a book will be shelved in one and only one location, among an ordered set of other books. Traditional journal tables of contents similarly place each article in a given issue in a specific location among an ordered set of other articles, certainly a necessary constraint with paper journals and still useful online as a comfortable and familiar context for readers.
However, the real world of concepts is multi-dimensional. In collapsing to one dimension, a classification scheme makes essentially arbitrary choices that have the effect of placing some related items close together while leaving other related items in very distant bins. It also has the effect of repeating the terms associated with the last dimension in many different contexts, leading to an appearance of significant redundancy and complexity in locating terms.
A faceted taxonomy attempts to identify each stand-alone concept through the term or terms commonly associated with it, and have it mean the same thing whenever used. Hierarchy in a taxonomy is useful to group related terms together; however the intention is not to attempt to identify an item such as an article or book by a single concept, but rather to assign multiple concepts to represent the meaning. In that way, related items can be closely associated along multiple dimensions corresponding to each assigned concept. Where previously a single PACS code was used to indicate the research area, now two, three, or more of the new concepts may be needed (although often a single new concept will be sufficient). This requires a different mindset and approach in applying the new taxonomy to the way APS has been accustomed to working with PACS; however it also enables significant new capabilities for publishing and working with all types of content including articles, papers and websites.
To build and maintain the faceted taxonomy, APS has acquired the PoolParty taxonomy management tool. PoolParty will enable APS editorial staff to create, retrieve, update and delete taxonomy term records. The tool will support the various thesaurus, knowledge organization system and ontology standards for concepts, relationships, alternate terms etc. It will also provide methods for:
- Associating taxonomy terms with content items, and storing that association in a content index record.
- Automated indexing to suggest taxonomy terms that should be associated with content items, and text mining to suggest terms to potentially be added to the taxonomy.
- Integrating taxonomy term look-up, browse and navigation in a selection user interface that, for example, authors and the general public could use.
- Implementing a feedback user interface allowing authors and the general public to suggest terms, record the source of the suggestion, and inform the user on the disposition of their suggestion.
Arthur Smith, project manager for the new APS taxonomy notes “PoolParty allows our subject matter experts to immediately visualize the layout of the taxonomy, to add new concepts, suggest alternatives, and to map out the relationships and mappings to other concept schemes that we need. While our project is still in an early stage, the software tool is already proving very useful.”
Taxonomy Strategies (www.taxonomystrategies.com) is an information management consultancy that specializes in applying taxonomies, metadata, automatic classification, and other information retrieval technologies to the needs of business and other organizations.
The American Physical Society (www.aps.org) is a non-profit membership organization working to advance and diffuse the knowledge of physics through its outstanding research journals, scientific meetings, and education, outreach, advocacy and international activities. APS represents over 50,000 members, including physicists in academia, national laboratories and industry in the United States and throughout the world. Society offices are located in College Park, MD (Headquarters), Ridge, NY, and Washington, DC.
The Simple Knowledge Organization System (SKOS) has become one of the ‘sweet spots’ in the linked data ecosystem in recent years. Especially when semantic web technologies are being adapted for the requirements of enterprises or public administration, SKOS has played a most central role to create knowledge graphs.
In this webinar, key people from the Semantic Web Company will describe why controlled vocabularies based on SKOS play a central role in a linked data strategy, and how SKOS can be enriched by ontologies and linked data to further improve semantic information management.
SKOS unfolds its potential at the intersection of three disciplines and their methods:
- library sciences: taxonomy and thesaurus management
- information sciences: knowledge engineering and ontology management
- computational linguistics: text mining and entity extraction
Linked Data based IT-architectures cover all three aspects and provide means for agile data, information, and knowledge management.
In this webinar, you will learn about the following questions and topics:
- How SKOS builds the foundation of enterprise knowledge graphs to be enriched by additional vocabularies and ontologies?
- How can knowledge graphs be used build the backbone of metadata services in organisations?
- How text mining can be used to create high-quality taxonomies and thesauri?
- How can knowledge graphs be used for enterprise information integration?
Based on PoolParty Semantic Suite, you will see several live demos of end-user applications based on linked data and of PoolParty’s latest release which provides outstanding facilities for professional linked data management, including taxonomy, thesaurus and ontology management.
Register here: https://www4.gotomeeting.com/register/404918583