Skip navigation.
Home
Semantic Software Lab
Concordia University
Montréal, Canada

Blogroll

Extracting Insights from Consumer Reviews

Life Analytics Blog - Mon, 2015-02-23 05:53
Here is one more example on how we can extract Insights from Consumer Reviews. This time we will use Reviews that were given for several Supplement Brands of Omega-3 Fish Oil.
For this example we analyze 4018 Reviews of Consumers who bought Omega-3 Supplements.  Keep in mind that in most cases each Product Review has an associated Rating (usually given as 1-5 stars) which signifies the overall satisfaction of each Consumer . Therefore, after data collection of the Reviews and Ratings we have a file with the following entries per row :
[Text of Review,Rating]

The fact that a Customer gives also a Score can be especially helpful because we can identify the words and Phrases that differentiate Positive experiences (ie those having 5 Star Ratings) from the Negative Ones (We assume that any Review having a Rating of  4 stars or less is Negative). So for example, Positive Reviews may contain mostly words and phrases such as "Great", "Happy" and "Will buy again" whereas Negative Reviews may contain words and phrases such as "Never buying again","not happy" or "damaged".
The tools used for this example are NLTK and Python. The code simply reads the reviews and associated text and creates a Matrix with the same representation as the file it read.

Next, we want to identify which Insights we can extract from this representation. For example :

-Identify which words commonly occur in 5-star reviews
-Identify which words commonly occur in Reviews with a rating of 4 Stars or Lower.
-Identify potentially Interesting Phrases and Words
-Extract term Co-Occurrences

We start with terms occurring more frequently in Negative Reviews for Omega-3 Supplements. Here is what we've found :






So it appears that people tend to give negative Reviews when the Taste (and possibly After-Taste) is not quite right. A lot of people complain about a Fishy odor. Notice also that the 3rd Term is sure which we can assume that it originates from customers saying that they are not sure if the Product works or not (Notice also that the 4th term is yet). Some more terms to consider :

however
rancid
krill (a type of Oil which is alternative Product to Omega-3 Supplementation)
soy
stick


Now let's look at the Terms associated with Positive Reviews :




great and excellent are terms that were expected to be found in Positive Reviews.  Some terms to consider are :

price
quality
brain
triglycerides
cholesterol

We move on to identifying potentially interesting terms and Phrases. Here is a Screenshot from the Software that i used  :







I added a Red Rectangle wherever sensitive information (such as Company Names) appears which for the purpose of this post is not relevant (but it certainly is relevant in a different setting).

We immediately see some interesting mentions, for example : Heavy Metal poisoning, Upset Stomach incidences, Cognitive Function , Joint Pains, Panic Attacks, Reasonably Priced Items, Postpartum Depression, Allergic Reactions, Speedy Delivery and Soft Gels that Stick together.

Recall that in a previous example we found that the term however is a term that occurs frequently within Negative Reviews. Some analysts may have chosen to treat this term as a stopword which in this case would be a serious mistake. The reason for this is that the term however shows us very often the reason for which a product or service is not receiving a perfect rating and vice-versa. Therefore, If a Data Scientist would have chosen to exclude this term from the Analysis (stopwords are typically removed from the text), potentially interesting insights would have never surfaced.

Ideally, we would like to know what is the context that occurs after the term however whenever this term occurs withing a negative review. That will help us to focus on all occurrences of however with negative sentiment. To do this, we only take into account all reviews containing the term however and having a Rating of 3 stars or less. It appears that the most common terms occurring after the term however was Fishy odor and After-taste. In other words, fishy odor is the cause that keeps Customers from giving a 5-star Rating.

On the other hand, phrases such as highly recommend are interesting because we may use co-occurrence analysis to see which terms co-occur with a highly recommended product.

Of course this is -by no means- the end on what we can do. To extract even better insights we have to spend significantly more time to do proper Pre-processing, use Information Extraction and use several other techniques to analyze Text Data in novel and potentially interesting ways.


Categories: Blogroll

SEMANTiCS2015: Calls for Research & Innovation Papers, Industry Presentations and Poster/Demos are now open!

Semantic Web Company - Fri, 2015-02-20 05:00

The SEMANTiCS2015 conference comes back this year in its 11th edition where it all started in 2005 to Vienna, Austria!

The conference  takes place from 15-17 September 2015 (the main conference will be on 16-17th of September and several back 2 back workshops & events on 15th) at the University of Economics – see all information: http://semantics.cc/.

We are happy to announce the SEMANTiCS Open Calls as follows. All infos on the Calls can also be found on the SEMANTiCS2015 website here: http://semantics.cc/open-calls

Call for Research & Innovation Papers

The Research & Innovation track at SEMANTiCS welcomes the submission of papers on novel scientific research and/or innovations relevant to the topics of the conference. Submissions must be original and must not have been submitted for publication elsewhere. Papers should follow the ACM ICPS guidelines for formatting (http://www.acm.org/sigs/publications/proceedings-templates) and must not exceed 8 pages in lenght for full papers and 4 pages for short papers, including references and optional appendices.

Abstract Submission Deadline: May 22, 2015
Paper Submission Deadline: May 29, 2015
Notification of Acceptance: July 10, 2015
Camera-Ready Paper: July 24, 2015
Details: http://bit.ly/semantics15-research

Call for Industry & Use Case Presentations

To address the needs and interests of industry SEMANTICS presents enterprise solutions that deal with semantic processing of data and/or information in areas like like Linked Data, Data Publishing, Semantic Search, Recommendation Services, Sentiment Detection, Search Engine Add-Ons, Thesaurus and/or Ontology Management, Text Mining, Data Mining and any related fields. All submissions have a strong focus on real world applications beyond the prototypical status and demonstrate the power of semantic systems!

Submission Deadline: July 1, 2015
Notification of Acceptance: July 20, 2015
Presentation Ready: August 15, 2015
Details: http://bit.ly/semantics15-industry

Call for Posters and Demos

The Posters & Demonstrations Track invites innovative work in progress, late-breaking research and innovation results, and smaller contributions (including pieces of code) in all fields related to the broadly understood Semantic Web. The informal setting of the Posters & Demonstrations Track encourages participants to present innovations to business users and find new partners or clients.  In addition to the business stream, SEMANTiCS 2015 welcomes developer-oriented posters and demos to the new technical stream.

Submission Deadline: June 17, 2015
Notification of Acceptance: July 10, 2015
Camera-Ready Paper: August 01, 2015
Details: http://bit.ly/semantics15-poster

We are looking forward to receive your submissions for SEMANTiCS2015 and see you in Vienna in autumn!

Categories: Blogroll

Data to Value & Semantic Web Company agree partnership to bring cutting edge Semantic Management to Financial Services clients

Semantic Web Company - Wed, 2015-02-18 08:04

The partnership aims to change the way organisations, particularly within Financial Services, manage the semantics embedded in their data landscapes. This will offer several core benefits to existing and prospective clients including locating, contextualising and understanding the meaning and content of Information faster and at a considerably lower cost. The partnership will achieve this through combining the latest Information Management and Semantic techniques including:

  • Text Mining, Tagging, Entity Definition & Extraction.
  • Business Glossary, Data Dictionary & Data Governance techniques.
  • Taxonomy, Data Model and Ontology development.
  • Linked Data & Semantic Web analyses.
  • Data Profiling, Mining & Discovery.

This includes improving regulatory compliance in areas such as BCBS, enabling new investment research and client reporting techniques as well as general efficiency drivers such as faster integration of mergers and acquisitions. As part of the partnership, Data to Value Ltd. will offer solution services and training in PoolParty product offerings, including ontology development and data modeling services.

Nigel Higgs, Managing Director of Data to Value notes; “this is an exciting collaboration between two firms which are pushing the boundaries in the way Data, Information and Semantics are managed by business stakeholders. We spend a great deal of time helping organisations at a grass roots level pragmatically adopt the latest Information Management techniques. We see this partnership as an excellent way for us to help organisations take realistic steps to adopting the latest semantic techniques.”

Andreas Blumauer, CEO of Semantic Web Company adds, “The consortium of our two companies offers a unique bundle, which consists of a world-class semantic platform and a team of experts who know exactly how Semantics can help to increase the efficiency and reliability of knowledge intensive business processes in the financial industry.”

Categories: Blogroll

The NIPS experiment

Machine Learning Blog - Wed, 2015-01-07 15:38

Corinna Cortes and Neil Lawrence ran the NIPS experiment where 1/10th of papers submitted to NIPS went through the NIPS review process twice, and then the accept/reject decision was compared. This was a great experiment, so kudos to NIPS for being willing to do it and to Corinna & Neil for doing it.

The 26% disagreement rate presented at the conference understates the meaning in my opinion, given the 22% acceptance rate. The immediate implication is that between 1/2 and 2/3 of papers accepted at NIPS would have been rejected if reviewed a second time. For analysis details and discussion about that, see here.

Let’s give P(reject in 2nd review | accept 1st review) a name: arbitrariness. For NIPS 2014, arbitrariness was ~60%. Given such a stark number, the primary question is “what does it mean?”

Does it mean there is no signal in the accept/reject decision? Clearly not—a purely random decision would have arbitrariness of ~78%. It is however quite notable that 60% is much closer to 78% than 0%.

Does it mean that the NIPS accept/reject decision is unfair? Not necessarily. If a pure random number generator made the accept/reject decision, it would be ‘fair’ in the same sense that a lottery is fair, and have an arbitrariness of ~78%.

Does it mean that the NIPS accept/reject decision could be unfair? The numbers give no judgement here. It is however a natural fallacy to imagine that random judgements derived from people implies unfairness, so I would encourage people to withhold judgement on this question for now.

Is an arbitrariness of 0% the goal? Achieving 0% arbitrariness is easy: just choose all papers with an md5sum that ends in 00 (in binary). Clearly, there is something more to be desired from a reviewing process.

Perhaps this means we should decrease the acceptance rate? Maybe, but this makes sense only if you believe that arbitrariness is good, as it will almost surely increase the arbitrariness. In the extreme case where only one paper is accepted, the odds of it being the rejected on re-review are near 100%.

Perhaps this means we should increase the acceptance rate? If all papers submmitted were accepted, the arbitrariness would be 0, but as mentioned above arbitrariness 0 is not the goal.

Perhaps this means that NIPS is a very broad conference with substantial disagreement by reviewers (and attendees) about what is important? Maybe. This even seems plausible to me, given anecdotal personal experience. Perhaps small highly-focused conferences have a smaller arbitrariness?

Perhaps this means that researchers submit themselves to an arbitrary process for historical reasons? The arbitrariness is clear, but the reason less so. A mostly-arbitrary review process may be helpful in the sense that it gives authors a painful-but-useful opportunity to debug the easy ways to misinterpret their work. It may also be helpful in that it perfectly rejects the bottom 20% of papers which are actively wrong, and hence harmful to the process of developing knowledge. None of these reasons are confirmed of course.

Is it possible to do better? I believe the answer is “yes”, but it should be understood as a fundamentally difficult problem. Every program chair who cares tries to tweak the reviewing process to be better, and there have been many smart program chairs that tried hard. Why isn’t it better? There are strong nonvisible constraints on the reviewers time and attention.

What does it mean? In the end, I think it means two things of real importance.

  1. The result of the process is mostly arbitrary. As an author, I found rejects of good papers very hard to swallow, especially when the reviews were nonsensical. Learning to accept that the process has a strong element of arbitrariness helped me deal with that. Now there is proof, so new authors need not be so discouraged.
  2. CMT now has a tool for measuring arbitrariness that can be widely used by other conferences. Joelle and I changed ICML 2012 in various ways. Many of these appeared beneficial and some stuck, but others did not. In the long run, it’s the things which stick that matter. Being able to measure the review process in a more powerful way might be beneficial in getting good review practices to stick.

Other commentary from Lance, Bert, and Yisong.

Edit: Cross-posted on CACM.

Categories: Blogroll

Stamp of Approval

Data Mining Blog - Sat, 2015-01-03 00:31

After getting a hint of this a few months ago, I've finally tracked down an image of a stamp that will be released this year to celebrate the invention of the World Wide Web using an image I created.

Here's the source.

Think of me as you lick 'em.

Categories: Blogroll

Vowpal Wabbit 7.8 at NIPS

Machine Learning Blog - Sat, 2014-12-06 21:46

I just created Vowpal Wabbit 7.8, and we are planning to have an increasingly less heretical followup tutorial during the non-“ski break” at the NIPS Optimization workshop. Please join us if interested.

I always feel like things are going slow, but in the last year, but there have been many changes overall. Notes for 7.7 are here. Since then, there are several areas of improvement as well as generalized bug fixes and refactoring.

  1. Learning to Search: Hal completely rewrote the learning to search system, enough that the numbers here are looking obsolete. Kai-Wei has also created several advanced applications for entity-relation and dependency parsing which are promising.
  2. Languages Hal also created a good python library, which includes call-backs for learning to search. You can now develop advanced structured prediction solutions in a nice language. Jonathan Morra also contributed an initial Java interface.
  3. Exploration The contextual bandit subsystem now allows evaluation of an arbitrary policy, and an exploration library is now factored out into an independent library (principally by Luong with help from Sid and Sarah). This is critical for real applications because randomization must happen at the point of decision.
  4. Reductions The learning reductions subsystem has continued to mature, although the perfectionist in me is still dissatisfied. As a consequence, it’s now pretty easy to program new reductions, and the efficiency of these reductions has generally improved. Several news ones are cooking.
  5. Online Learning Alekh added an online SVM implementation based on LaSVM. This is known to parallelize well via the para-active approach.

This project has grown quite a bit—there are about 30 different people contributing to VW since the last release, and there is now a VW meetup (December 8th!) in the bay area that I wish I could attend.

Categories: Blogroll

Allreduce (or MPI) vs. Parameter server approaches

Machine Learning Blog - Fri, 2014-11-28 18:01

In the last 7 years or so there has been quite a bit of work on parallel machine learning approaches, enough that I felt like a summary might be helpful both for myself and others. In each case, I put in the earliest known citation. If I missed something please comment.

One basic dividing line between parallel approaches is single-machine vs. multi-machine. Multi-machine approaches offer the potential for much greater improvements than single-machine approaches, but generally suffer from a lower bandwidth between components of the parallelized process.

Amongst single machine approaches, GPU-based learning is the dominant form of parallelism. For many algorithms, this can provide an easy 10x speedup, with the limits being programming (GPUs are special), the amount of GPU RAM (12GB for a K40), the bandwidth to the GPU interface, and your algorithms needing care as new architectures come out. I’m not sure who first started using GPUs for machine learning.

Another important characteristic of parallel approaches is deterministic vs. nondeterministic. When you run the same algorithm twice, do you always get the same result? Leon Bottou tells me that he thinks reproducibility is worth a factor of 2. I personally would rate it somewhat higher, just because debugging is such an intrinsic part of using machine learning algorithms and the debuggability of nondeterministic algorithms is greatly impaired.

  1. MPI gradient aggregation (See here (2007).) Accumulate gradient statistics in parallel and use a good solver to find a good solution. There are two weaknesses here:
    1. Batch solvers are slow compared to online gradient descent approaches, at least for the first pass.
    2. Large datasets typically do not sit in MPI clusters. There are good reasons for this—MPI clusters are typically not designed for heavy data work.
  2. Map-Reduce statistical query algorithms. The first paper (2007) of this sort was single machine, but it obviously applied to map-reduce clusters of the time starting the Mahout project. This addressed the second problem of the MPI approach, but not the first (batch algorithms are slow), and created a new problem (iteration and communication are slow in a map-reduce setting).
  3. Parameter averaging. (see here (2010)). Use online learning algorithms and then average parameter values. This dealt with both of the drawbacks of the MPI approach as applied to convex learning, but is limited to convex(ish) approaches and may take a long time to converge on datasets where a second order optimization is needed. Iteration in a map-reduce paradigm remains awkward/slow in many cases.
  4. Graph-based approaches. (see here (2010)). Learning algorithms that are represented by graphs can be partitioned across compute nodes and then operated on with parallel algorithms. This allows models larger than the state of a single machine. This addresses many learning algorithms that can be represented this way, but there are many that cannot be effectively represented this way as well.
  5. Parameter server approaches. (see here (2010)). This is distinct from graph based approaches in that parameter storage and updating is broken out as a primary component. This allows models larger than the state of a single machine. Parameter server approaches require nondeterminism to be performant. There has been quite a bit of follow-up work on parameter server approaches including shockingly inefficient systems(2012) and more efficient systems(2014) although they remain less efficient than GPUs.
  6. Allreduce approaches. (see here (2011)) Allreduce is an MPI-primitive which allows normal sequential code to work in parallel, implying very low programming overhead. This allows both parameter averaging, gradient aggregation, and iteration. The fundamental drawbacks are poor performance under misbalanced loads and difficulty with models that exceed working memory in size. A refined version of this approach has been used for speech recognition (2014).
  7. GPU+MPI approaches. (see here (2013)) GPUs are good and MPI is good, so GPU+MPI should be good. It is good, although there are caveats related to the phenomenal amount of computation a GPU provides compared to the communication available, even with a good interconnect. See the speech recognition paper above for a remediation.

Most of these papers are about planting a flag rather than determining what the best approach to parallelization is. This makes determining how to parallelize learning algorithms rather unclear. My present approach remains case-based.

  1. Don’t do it for the sake of parallelization. Have some other real goal in mind that justifies the effort. Parallelization is both simple and subtle which makes it unrewarding unless you really need it. Strongly consider the programming complexity of approaches if you decide to proceed.
  2. If you are locked into a particular piece of hardware or cluster software, then you don’t have much choice—make the best of it. For some people this is an MPI cluster, Hadoop, or an SQL cluster.
  3. If your data can easily be copied onto a single machine, then a GPU based approach seems like a good idea. GPU programming is nontrivial, but many people have done it at this point.
  4. If your data is of a multimachine scale you must do some form of cluster parallelism.
    1. Graph-based approaches can be the right answer when your graph is not too deeply interconnected.
    2. Allreduce-based approaches appear effective and easy to use in many other cases. I wish every cluster came with an allreduce library.
      1. If you are parsing limited (i.e. for linear representations) then a CPU cluster is fine.
      2. If you are compute limited, then a cluster of GPUs seems the way to go.

The above leaves out parameter server approaches, which is controversial since a huge amount of effort has been invested in parameter server approaches and reasonable people can disagree. The caveats that matter for me are:

  1. It might be that the right way to parallelize in a cluster has nothing to do with the right way to parallelize on a single machine, but this seems implausible.
  2. Success/effort appears relatively low. It’s surprising that you can effectively compete with mature parameter server approaches on compute heavy tasks using a single GPU.

Note that I’m not claiming parameter servers are useless—I think they could be effective if applied in situations where you cannot prebalance the compute load of parallel elements. But the extent to which this applies in a datacenter computation seems both manageable and a flaw of the datacenter that will be reduced with time.

Categories: Blogroll

Automatic Semantic Tagging for Drupal CMS launched

Semantic Web Company - Fri, 2014-11-28 09:30

REEEP [1] and CTCN [2] have recently launched Climate Tagger, a new tool to automatically scan, label, sort and catalogue datasets and document collections. Climate Tagger now incorporates a Drupal Module for automatic annotation of Drupal content nodes. Climate Tagger addresses knowledge-driven organizations in the climate and development arenas, providing automated functionality to streamline, catalogue and link their Climate Compatible Development data and information resources.

Climate Tagger for Drupal is a simple, FREE and easy-to-use way to integrate the well-known Reegle Tagging API [3], originally developed in 2011 with the support of CDKN [4], (now part of the Climate Tagger suite as Climate Tagger API) into any web site based on the Drupal Content Management System [5]. Climate Tagger is backed by the expansive Climate Compatible Development Thesaurus, developed by experts in multiple fields and continuously updated to remain current (explore the thesaurus at http://www.reegle.info/glossary). The thesaurus is available in English, French, Spanish, German and Portuguese. And can connect content on different portals published in these different languages.

Climate Tagger for Drupal can be fine-tuned to individual (and existing) configuration of any Drupal 7 installation by:

  • determining which content types and fields will be automatically tagged
  • scheduling “batch jobs” for automatic updating (also for already existing contents; where the option is available to re-tag all content or only tag with new concepts found via a thesaurus expansion / update)
  • automatically limit and manage volumes of tag results based on individually chosen scoring thresholds
  • blending with manual tagging

click to enlarge

“Climate Tagger [6] brings together the semantic power of Semantic Web Company’s PoolParty Semantic Suite [7] with the domain expertise of REEEP and CTCN, resulting in an automatic annotation module for Drupal 7 with an accuracy never seen before” states Martin Kaltenböck, Managing Partner of Semantic Web Company [8], which acts as the technology provider behind the module.

Climate Tagger is the result of a shared commitment to breaking down the ‘information silos’ that exist in the climate compatible development community, and to provide concrete solutions that can be implemented right now, anywhere” said REEEP Director General Martin Hiller. “Together with CTCN and SWC laid the foundations for a system that can be continuously improved and expanded to bring new sectors, systems and organizations into the climate knowledge community.”

For the Open Data and Linked Open Data communities, a Climate Tagger plugin for CKAN [9] has also been published, which was developed by developed by NREL [10] in cooperation with CTCN’s support, harnessing the same taxonomy and expert vetted thesaurus behind the Climate Tagger, helping connect open data to climate compatible content through the simultaneous use of these tools.

REEEP Director General Martin Hiller and CTCN Director Jukka Uosukainen will be talking about Climate Tagger at the COP20 side event hosted by the Climate Knowledge Brokers Group in Lima [11], Peru, on Monday, December 1st at 4:45pm.

Further reading and downloads

About REEEP:

REEEP invests in clean energy markets in developing countries to lower CO2 emissions and build prosperity. Based on strategic portfolio of high impact projects, REEEP works to generate energy access, improve lives and economic opportunities, build sustainable markets, and combat climate change.

REEEP understands market change from a practice, policy and financial perspective. We monitor, evaluate and learn from our portfolio to understand opportunities and barriers to success within markets. These insights then influence policy, increase public and private investment, and inform our portfolio strategy to build scale within and replication across markets. REEEP is committed to open access to knowledge to support entrepreneurship, innovation and policy improvements to empower market shifts across the developing world.

About the CTCN

The Climate Technology Centre & Network facilitates the transfer of climate technologies by providing technical assistance, improving access to technology knowledge, and fostering collaboration among climate technology stakeholders. The CTCN is the operational arm of the UNFCCC Technology Mechanism and is hosted by the United Nations Environment Programme (UNEP) in collaboration with the United Nations Industrial Development Organization (UNIDO) and 11 independent, regional organizations with expertise in climate technologies.

About Semantic Web Company

Semantic Web Company (SWC, http://www.semantic-web.at) is a technology provider headquartered in Vienna (Austria). SWC supports organizations from all industrial sectors worldwide to improve their information and data management. Their products have outstanding capabilities to extract meaning from structured and unstructured data by making use of linked data technologies.

Categories: Blogroll

Introducing the Linked Data Business Cube

Semantic Web Company - Fri, 2014-11-28 07:09

With the increasing availability of semantic data on the World Wide Web and its reutilization for commercial purposes, questions arise about the economic value of interlinked data and business models that can be built on top of it. The Linked Data Business Cube provides a systematic approach to conceptualize business models for Linked Data assets. Similar to an OLAP Cube, the Linked Data Business Cube provides an integrated view on stakeholders (x-axis), revenue models (y-axis) and Linked Data assets (z-axis), thus allowing to systematically investigate the specificities of various Linked Data business models.

 

Mapping Revenue Models to Linked Data Assets

By mapping revenue models to Linked Data assets we can modify the Linked Data Business Cube as illustrated in the figure below.

The figure indicates that with increasing business value of a resource the opportunities to derive direct revenues rise. Assets that are easily substitutable generate little incentives for direct revenues but can be used to trigger indirect revenues. This basically applies to instance data and metadata. On the other side, assets that are unique and difficult to imitate and substitute, i.e. in terms of competence and investments necessary to provide the service, carry the highest potential for direct revenues. This applies to assets like content, service and technology. Generally speaking, the higher the value proposition of an asset – in terms of added value – the higher the willingness to pay.

Ontologies seem to function as a “mediating layer” between “low-incentive assets” and “high-incentive assets”. This means that ontologies as a precondition for the provision and utilization of Linked Data can be capitalized in a variety of ways, depending on the business strategy of the Linked Data provider.

It is important to note that each revenue model has specific merits and flaws and requires certain preconditions to work properly. Additionally they often occur in combination as they are functionally complementary.

Mapping Revenue Models to Stakeholders

A Linked Data ecosystem is usually comprised of several stakeholders that engage in the value creation process. The cube can help us to elaborate the most reasonable business model for each stakeholder.

Summing up, Linked Data generates new business opportunities, but the commercialization of Linked Data is very context specific. Revenue models change in accordance to the various assets involved and the stakeholders who take use of them. Knowing these circumstances is crucial in establishing successful business models, but to do so it requires a holistic and interconnected understanding of the value creation process and the specific benefits and limitations Linked Data generates at each step of the value chain.

Read more: Asset Creation and Commercialization of Interlinked Data

Categories: Blogroll

Blogger Stole My Thumbnails

Code from an English Coffee Drinker - Thu, 2014-11-13 05:19
I really hoped I was done writing posts about how Blogger was messing with our blogs, but unfortunately here I am writing another post. It seems that something in the way images are added to our posts has changed and this in turn means that Blogger no longer generates and includes a thumbnail for each post in the feeds produced for a blog. Whilst this doesn't effect the look of your blog it will make it look less interesting when viewed in a feed reader or blog list on another blog as only the title of the posts will appear.

The problem appears to be that when you upload and embed an image, for some reason Blogger is omitting the http: from the beginning of the image URL. This means that all the image URLs now start with just //. This is perfectly valid (it's defined in RFC 1808) and is often referred to as a scheme relative URL. What this means is that the URL is resolved relative to the scheme (http, https, ftp, file, etc.) of the page in which it is embedded.

I'm guessing this changes means that blogger is looking at switching some blogs to be served over https instead of http. Leaving the scheme off the image URL means that the image will be served using the same scheme as the page regardless of what that is. The problem though is that the code that blogger uses to generate the post feeds seem to ignore images that don't start http: meaning that no thumbnails are generated.

For now the easy fix is to manually add the http: to the image URLs (or at least of the image you want to use as the thumbnail for the post). Of course it would be better if blogger just fixed their code to spot these URLs properly and include them in the post feed.

Updated 18th November 2014: It looks as if this problem has been resolved as I've just tried posting on one of my other blogs and the http: bit is back.
Categories: Blogroll

The Longform Manifesto

Data Mining Blog - Fri, 2014-09-26 00:37

Sometimes a title for a blog posts suggests itself to me which seems so self contained that it takes real effort to actual write the post ('Machine Intelligence, not Machine Learning is the Next Big Thing' is another in this line). The idea behind the (or a) Longform Manifesto is as follows. I have become aware of late of the sense of deterioration that is associated with the mobile 'revolution' and the info snacking, casual gaming and interupt driven lifestyle that it has entailed. The behaviours are perfectly illustrated in this scene from Portlandia:

 

With a daughter who has now come of technological age (she has a cell phone) it has become important to me to remind myself what content consumption was like before this mobile mess appeared.

We read books, we watched movies, we listened to music. But, of course, we haven't stopped doing that. Rather, we have started all this other stuff, and the problem is that this is influencing how we approach longform content. I find myself watching bits of movies, or listening to bits of music or reading parts of essays.

The Longform Manifesto, through the definition of longform content and the discipline and commitment needed to consume it as it was meant to be consumed, helps to dillute and remove the behaviour degrading influence of mobile technology. Someone should write it.

Categories: Blogroll

The Longform Manifesto

Data Mining Blog - Fri, 2014-09-26 00:37

Sometimes a title for a blog posts suggests itself to me which seems so self contained that it takes real effort to actual write the post ('Machine Intelligence, not Machine Learning is the Next Big Thing' is another in this line). The idea behind the (or a) Longform Manifesto is as follows. I have become aware of late of the sense of deterioration that is associated with the mobile 'revolution' and the info snacking, casual gaming and interupt driven lifestyle that it has entailed. The behaviours are perfectly illustrated in this scene from Portlandia:

 

With a daughter who has now come of technological age (she has a cell phone) it has become important to me to remind myself what content consumption was like before this mobile mess appeared.

We read books, we watched movies, we listened to music. But, of course, we haven't stopped doing that. Rather, we have started all this other stuff, and the problem is that this is influencing how we approach longform content. I find myself watching bits of movies, or listening to bits of music or reading parts of essays.

The Longform Manifesto, through the definition of longform content and the discipline and commitment needed to consume it as it was meant to be consumed, helps to dillute and remove the behaviour degrading influence of mobile technology. Someone should write it.

Categories: Blogroll

Visualizing Publicly Available US Government Data Online

Information aesthetics - Fri, 2014-09-19 03:11


Brightpoint Consulting recently released a small collection of interactive visualizations based on open, publicly available data from the US government. Characterized by a rather organic graphic design style and color palette, each visualization makes a socially and politically relevant dataset easily accessible.

The custom chore diagram titled Political Influence [brightpointinc.com] highlights the monetary contributions made by the top Political Action Committees (PAC) for the 2012 congressional election cycle, for the House of Representatives and the Senate.

The hierarchical browser 2013 Federal Budget [brightpointinc.com] reveals the major flows of spending in the US government, at the federal, state, and local level, such as the relationship of spending between education and defense.

The circular flow chart United States Trade Deficit [brightpointinc.com] shows the US Trade Deficit over the last 11 years by month. The United States sells goods to the countries at a the top, while vice versa, the countries at the bottom sell goods to the US. The dollar amount in the middle represents the cumulative deficit over this period of time.

Categories: Blogroll

The Disappearing Planet: Comparing the Extinction Rates of Animals

Information aesthetics - Thu, 2014-09-18 15:05


The subtly designed A Disappearing Planet [propublica.org] by freelance data journalist Anna Flagg reveals the extinction rates of animals, caused by a variety of human-caused effects, including climate change, habitat destruction and species displacement.

Divided into mammals, reptiles, amphibians and birds, the interactive bar graph allows users to browse horizontally through the vast amount of species by order and family, and vertically by genus.

Species in risk are highlighted in red, so that dense clusters denote related families (e.g. bears, parrots, turtles) that are specially threatened over the next 100 years.

Categories: Blogroll

Scottish Independence : Bing Predicts 'No'

Data Mining Blog - Thu, 2014-09-18 11:58

Bing's prediction team has a feature live on the site right now that predicts Scotland will not become an independant nation as a result of today's referendum.

Categories: Blogroll

Scottish Independence : Bing Predicts 'No'

Data Mining Blog - Thu, 2014-09-18 11:58

Bing's prediction team has a feature live on the site right now that predicts Scotland will not become an independant nation as a result of today's referendum.

Categories: Blogroll

GitHut: the Universe of Programming Languages across GitHub

Information aesthetics - Fri, 2014-09-12 10:37


GitHut [githut.info], developed by Microsoft data visualization designer Carlo Zapponi, is an interactive small multiples visualization revealing the complexity of the wide range of programming languages used across the repositories hosted on GitHub.

GitHub is a web-based repository service which offers the distributed revision control and source code management (SCM) functionality of Git, enjoying more than 3 million users.

Accordingly, by representing the distribution and frequency of programming languages, one can observe the continuous quest for better ways to solve problems, to facilitate collaboration between people and to reuse the effort of others.

Programming languages are ranked by various parameters, ranging from the number of active repositories to new pushes, forks or issues. The data can be filtered over discrete moments in time, while evolutions can be explored by a collection of timelines.

Categories: Blogroll

Web 2: But Wait, There's More (And More....) - Best Program Ever. Period.

Searchblog - Thu, 2011-10-13 12:20
I appreciate all you Searchblog readers out there who are getting tired of my relentless Web 2 Summit postings. And I know I said my post about Reid Hoffman was the last of its kind. And it was, sort of. Truth is, there are a number of other interviews happening... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview Reid Hoffman, Founder, LinkedIn (And Win Free Tix to Web 2)

Searchblog - Wed, 2011-10-12 11:22
Our final interview at Web 2 is Reid Hoffman, co-founder of LinkedIn and legendary Valley investor. Hoffman is now at Greylock Partners, but his investment roots go way back. A founding board member of PayPal, Hoffman has invested in Facebook, Flickr, Ning, Zynga, and many more. As he wears (at... (Go to Searchblog Main)
Categories: Blogroll

Help Me Interview the Founders of Quora (And Win Free Tix to Web 2)

Searchblog - Tue, 2011-10-11 12:54
Next up on the list of interesting folks I'm speaking with at Web 2 are Charlie Cheever and Adam D'Angelo, the founders of Quora. Cheever and D'Angelo enjoy (or suffer from) Facebook alumni pixie dust - they left the social giant to create Quora in 2009. It grew quickly after... (Go to Searchblog Main)
Categories: Blogroll
Syndicate content