Sequence Data Mining is a type of Analysis which aims in extracting patterns sequences of Events. We can also see Sequence Data Mining as an Associations Discovery Analysis with a Temporal Element.
Sequence Data Mining has many potential applications (Web Page Analytics, Complaint Events, Business Processes) but here today we will show an application for Health. I believe that this type of Analysis will become even more important as wearable technology will be used even more and therefore more Data of this kind will be generated.
Consider the following hypothetical scenario :
A 30-year old Male patient complaints about several symptoms which -for simplicity reasons- we will name them as Symptom1, Symptom2, Symptom3,etc.
His Doctor tries to identify what is going on and after the patient takes all necessary Blood work and finds no problems. After thorough evaluation the Doctor believes that his patient suffers from Chronic Fatigue Syndrome. Under the Doctor's supervision the patient will record his symptoms along with different supplements to understand more about his condition. Several events (e.g a Visit to the Gym, a stressful Event) will also be taken under consideration to see if any patterns emerge.
-How Can we easily record Data for the scenario above?-Can we extract sequences of events that occur more frequently than mere chance?-Can we identify which sequences of Events / Food / Medication may potentially lead to specific Symptoms or to a lack of Symptoms?
Looking the problem through the eyes of a Data Scientist, We have :
A series of Events that happen during a day : A Stressful event, A sedentary day, Cardio workouts, Weight Lifting, Abrupt Weather Deterioration, etc
A Number of Symptoms : Headaches, "Brain Fog", Mood problems, Insomnia, Arthralgia, etc.
Let's begin with Data Collection. We first suggest to the patient to use an Android app called MyLogsPro (or some other equivalent application) to easily input information as this happens :
So if the patient feels a specific Symptom he will press the relevant Symptom button on his mobile device. The same applies for any events that have happened and any Food or Medication taken. As the day passes we have the following data collected :
The snapshot shows what happened starting on the 20th of August 2014, where our patient has logged the intake of Medication (at 08:22 AM) and/or Supplements upon waking up then a Food entry was added at 08:47. At 11:06 the patient had a Symptom and immediately reached his phone and pressed the relevant Symptom (Symptom No 4) button.
After many days of Data Collection we decide that its time to analyze this information. We export the data from the application as a csv file which looks as follows :
We will use KNIME to read the csv file, change the contents of the entries accordingly so that an Algorithm can read the events and then perform Sequence Data Mining. We have the following layout :
The File Reader reads the .csv file, then during the Pre-processing block (shown in yellow), a String Manipulation node which removes colon (:) from time field (e.g 12:10 becomes 1210). The Sorter sorts the data according to date then time as the second field and a Java snippet uses replaceAll() function to remove all leading zeros from Time field (e.g 0010 becomes 10).
The R Snippet loads the CSPADE Algorithm and then uses this Algorithm to extract pattern of sequences.
After executing the stream we get the following output :
The information consists of two outputs : The first one is a list of sequences along with their support and the second one contains the output from rule induction which gives us two more useful metrics (namely the lift and the confidence for each rule).
We immediately notice an interesting entry on the first output :
and on the second output we see that this particular rule has a lift of 1.4 and 0.8 confidence.
However, as Data Scientists we should always double-check the extracted knowledge and must be aware of pitfalls. Let's see some examples (list not exhaustive) :
1) The algorithm does not account for time as it should : As an example, consider the following entries :
We assume that Medication1 is taken by mouth and needs 60 minutes to be properly dissolved and that these entries occur frequently enough in that order in our data set. Even though the algorithm might show a statistically significant pattern , it is not logical to hypothesize that Medication1 could be related to Symptom2. The Analyst should first examine each of these entries to see which proportion of the records has a time difference of at least -say- or greater than 60 minutes.
Apart from the example shown above we must consider the opposite effect. Consider this entry :
In other words : Is it possible that a Medication taken in the morning to generate a Symptom 12 hours later?
2) The algorithm is not able to account for the compounding effect of a Medication. For example, the patient might have low levels of Taurine and for this level to be replenished, an x amount of days of Taurine supplementation is needed. The algorithm cannot account for this possibility.
3) The patient should also input entries of "No Symptoms". It is not clear however when this should be done (e.g at the end of each day? assess every 6 hours and add 2 entries accordingly?)
However, this does not mean that a Sequence Mining algorithm should not be used under these circumstances. This technique can generate several potentially interesting hypotheses which Doctors and/or Researchers may wish to pursue further.
I just attended CODE. The set of people interested in digital experimentation have very diverse backgrounds encompassing theory, machine learning, social science, economics, and industry so this seems like a good subject for a new conference. I hope it continues.
I found several talks interesting.
- Eytan Bakshy talked about PlanOut which is language/platform for flexibly specifying experiments.
- Ron Kohavi talked about EXP which is a heavily used A/B testing platform.
- Susan Athey talked about long term vs short term metrics which seems both important to address, a constant problem, and not yet systematically solved.
There was a panel about the ongoing Facebook experimentation controversy. The issue here is complex. My understanding is that Facebook users have some expected ownership of the content they create, and hence aren’t comfortable with the content being used in unexpected ways. On the other hand, experimentation is so necessary to the functioning of all large modern internet sites that banning it or slowing down the process by a factor of a million (as some advocated) would badly degrade the future of these sites in practice.
My belief is that what’s lacking is education and trust. W.r.t. education, people need to understand that experimentation is unavoidable when trying to figure out how to optimize an enormously complex system, as there is just no other way to systematically make 1000 right decisions as is necessary for basic things like choosing the best homepage/search result/etc… W.r.t. trust, companies are not particularly good at creating trust in general, but finding the right mechanism for doing so seems critical. I would point out Vanguard as a company that managed to successfully create trust by design.
- 12% of Harvard is enrolled in CS 50: "In pretty much every area of study, computational methods and computational thinking are going to be important to the future" ()
- Excellent "What If?" nicely shows the value of back-of-the-envelope calculations and re-thinking what exactly it is you want to do ()
- The US has almost no competition, only local monopolies, for high speed internet ( )
- You can't take two large, dysfunctional, underperforming organizations, mash them together, and somehow make diamonds. When you take two big messes and put them together, you just get a bigger mess. ()
- "Yahoo was started nearly 20 years ago as a directory of websites ... At the end of 2014, we will retire the Yahoo Directory." ( )
- Investors think that Yahoo is essentially worthless ()
- "At a moment when excitement about the future of robotics seems to have reached an all-time high (just ask Google and Amazon), Microsoft has given up on robots" ()
- "Firing a bunch of tremendously smart and creative people seems misguided. But hey—at least they own Minecraft!" ()
- "Macs still work basically the same way they did a decade ago, but iPhones and iPads have an interface that's specifically designed for multi-touch screens" ( )
- On the difficulty of doing startups ( )
- "Be glad some other sucker is fueling the venture capital fire" ()
- "Just how antiquated the U.S. payments system has become" ()
- Is everyone grabbing money from online donations to charities? Visa's charge fee on charities is only 1.35%, but the lowest online payment system for charities charges 2.2% and most charge much more than that. ()
- "For most people, the risk of data loss is greater than the risk of data theft" ()
- Password recovery "security questions should go away altogether. They're so dangerous that many security experts recommend filling in random gibberish instead of real answers" ()
- Brilliantly done, free, open source, web-based puzzle game with wonderfully dark humor about ubiquitous surveillance ()
- How Udacity does those cool transparent hands in its videos ()
- There's just a bit of interference when you move your hand above the phone, just enough interference to detect gestures without using any additional power or sensors ( )
- Small, low power wireless devices powered by very small fluctuations in temperature ( )
- Cute intuitive interface for transferring data between PC and mobile ( )
- "Federal funding for biomedical research [down 20%] ... forcing some people out of science altogether" ()
- Another fun example of virtual tourism ()
- Ig Nobel Prizes: "Dogs prefer to align themselves to the Earth's north-south magnetic field while urinating and defecating" ()
- Xkcd: "In CS, it can be hard to explain the difference between the easy and the virtually impossible" ( )
- Dilbert: "That process sounds like a steaming pile of stupidity that will beat itself to death in a few years" ()
- Dilbert on one way to do job interviews ()
- The Onion: "Startup Very Casual About Dress Code, Benefits" ()
- Hilarious South Park episode, "Go Fund Yourself", makes fun of startups ()
Sometimes a title for a blog posts suggests itself to me which seems so self contained that it takes real effort to actual write the post ('Machine Intelligence, not Machine Learning is the Next Big Thing' is another in this line). The idea behind the (or a) Longform Manifesto is as follows. I have become aware of late of the sense of deterioration that is associated with the mobile 'revolution' and the info snacking, casual gaming and interupt driven lifestyle that it has entailed. The behaviours are perfectly illustrated in this scene from Portlandia:
With a daughter who has now come of technological age (she has a cell phone) it has become important to me to remind myself what content consumption was like before this mobile mess appeared.
We read books, we watched movies, we listened to music. But, of course, we haven't stopped doing that. Rather, we have started all this other stuff, and the problem is that this is influencing how we approach longform content. I find myself watching bits of movies, or listening to bits of music or reading parts of essays.
The Longform Manifesto, through the definition of longform content and the discipline and commitment needed to consume it as it was meant to be consumed, helps to dillute and remove the behaviour degrading influence of mobile technology. Someone should write it.
This news report is correct, the Microsoft Research Silicon Valley center has been cut. The New York lab has not been directly affected although obviously cross-lab collaborations are impacted, and we sympathize deeply with those involved. Most of the rest of MSR is not directly affected.
I’m not privy to the rationale behind the decision, but in my opinion there are some very strong people in the various groups (Algorithms, Architecture, Distributed Systems, Privacy, Software tools, Web Search), and I expect offers have started raining on them. In my experience, this is harrowing in the short term, yet I know that most of my previous colleagues ended up happier after the troubles hit Yahoo! Research 2 1/2 years ago.
Brightpoint Consulting recently released a small collection of interactive visualizations based on open, publicly available data from the US government. Characterized by a rather organic graphic design style and color palette, each visualization makes a socially and politically relevant dataset easily accessible.
The custom chore diagram titled Political Influence [brightpointinc.com] highlights the monetary contributions made by the top Political Action Committees (PAC) for the 2012 congressional election cycle, for the House of Representatives and the Senate.
The hierarchical browser 2013 Federal Budget [brightpointinc.com] reveals the major flows of spending in the US government, at the federal, state, and local level, such as the relationship of spending between education and defense.
The circular flow chart United States Trade Deficit [brightpointinc.com] shows the US Trade Deficit over the last 11 years by month. The United States sells goods to the countries at a the top, while vice versa, the countries at the bottom sell goods to the US. The dollar amount in the middle represents the cumulative deficit over this period of time.
The subtly designed A Disappearing Planet [propublica.org] by freelance data journalist Anna Flagg reveals the extinction rates of animals, caused by a variety of human-caused effects, including climate change, habitat destruction and species displacement.
Divided into mammals, reptiles, amphibians and birds, the interactive bar graph allows users to browse horizontally through the vast amount of species by order and family, and vertically by genus.
Species in risk are highlighted in red, so that dense clusters denote related families (e.g. bears, parrots, turtles) that are specially threatened over the next 100 years.
Bing's prediction team has a feature live on the site right now that predicts Scotland will not become an independant nation as a result of today's referendum.
GitHut [githut.info], developed by Microsoft data visualization designer Carlo Zapponi, is an interactive small multiples visualization revealing the complexity of the wide range of programming languages used across the repositories hosted on GitHub.
GitHub is a web-based repository service which offers the distributed revision control and source code management (SCM) functionality of Git, enjoying more than 3 million users.
Accordingly, by representing the distribution and frequency of programming languages, one can observe the continuous quest for better ways to solve problems, to facilitate collaboration between people and to reuse the effort of others.
Programming languages are ranked by various parameters, ranging from the number of active repositories to new pushes, forks or issues. The data can be filtered over discrete moments in time, while evolutions can be explored by a collection of timelines.
Visualize Pi [tumblr.com] is a mural project that aimed to use popular mathematics to connect Brooklyn students to the community with a visualization of Pi. It was funded by a successful KickStarter project as proposed by visual artist artist Ellie Balk, The Green School Students, staff and Assistant Principal Nathan Affield.
The mural seems to consist of different parts. A reflective line graph, reminiscent of a sound wave, represents the number Pi (3.14159...) by way of colors that are coded by the sequence of the prime numbers found in Pi (2,3,5,7), as well as height.
Additionally, a golden spiral was drawn based on the Fibonacci Sequence, as an exploration of the relationship between the golden ratio and Pi. The number Pi was represented in a color-coded graph within the golden spiral. In this, the numbers are seen as color blocks that vary in size proportionately within the shrinking space of the spiral, representing the 'shape' of Pi.
"By focusing on the single, transcendental concept of Pi across courses, the mathematics department plans to not only deepen student understanding of shape and irrational number, but more importantly, connect these foundational mental schema for students while dealing with the concrete issues of neighborhood beautification and how proportion can inform aesthetic which can in turn improve quality of life."
Via @mariuswatz .
Next to its expressive aesthetic, the interactive features allow users to highlight individual nodes and its direct connections to others, as well as filter between the kind of possible relationships, such as "hate", "strained", "good" or "love".
Reminds me a bit of Mapping the Relationships between the Artists who Invented Abstraction.
The right way to do personalization is to prove you're useful first. Personalization is just a tool. If a new tool doesn't work better than the old tool, it's useless. There's no reason to use personalized education unless it works better than unpersonalized education. A tool needs to be useful.
Teachers are already overworked and, after having been burned too many times on supposedly exciting new technologies that fail to help, correctly are cynical about tech startups coming in and demanding something of them. If some tech startup isn't helping a teacher get something done they need to get done, it's a bad tool and it's useless.
Parents are leery of companies who say they only want to help and what corporations are doing with the data they have on their children, correctly so given all the marketing abuses that have happened in the past.
Kids don't want more boring busywork to do -- they get enough of that already -- and don't see why anything this company is talking about helps them or is useful to them.
If a company wants to succeed in personalized education, it should:
- Be useful, noticeably raise test scores
- Not require additional busy work
- Be optional
- Have no marketing whatsoever, only use data to help
I would like to see a company use the existing standardized tests required by several states, analyze the incorrect answers to identify concepts a student is not understanding, and then print short worksheets targeting only those missed concepts for teachers to hand out to each student. The worksheets would be free and arrive in teachers' mailboxes. If the teacher doesn't want to hand them out, that's not a problem, but test scores go up for the classrooms where the teachers do hand them out. So, even if most teachers don't hand them out at first and most students throw them away at first, over time, more and more teachers will start handing them out and more and more students will do them, as only helps those who do.
In both of these examples, a startup could set up from the beginning to run large scale experiments, showing different problems to different students, and learning what raises test scores, what designs and lesson lengths cause students to stop, what concepts are important and which matter less, what can be taught easily through this and what cannot, what people enjoy, and what works.
When a company comes in and says, "Give us your data, teachers, parents, and kids, and do all this work. Maybe we'll boost your test scores for you later," they're being arrogant and tone-deaf. Everyone responds, "I don't believe you. How about you prove you're useful first? I'm busy. Do something for me or go away." And they're right to do so.
There likely is a way to do personalized education that everyone would embrace. But that way probably requires proving you're useful first. After all, personalization is just a tool.
SEMANTiCS conference celebrated its 10th anniversary this September in Leipzig. And this year’s venue has been capable of opening a new age for the Semantic Web in Europe – a marketplace for the next generation of semantic technologies was born.
As Phil Archer stated in his key note, the Semantic Web is now mature, and academia and industry can be proud of the achievements so far. And exactly that fact gave the thread for the conference: Real world use cases demonstrated by industry representatives, new and already running applied projects presented by the leading consortia in the field and a vivid academia showing the next ideas and developments in the field. So this years SEMANTiCS conference brought together the European Community in Semantic Web Technology – both from academia and industry.
- Papers and Presentations: 45 (50% of them industry talks)
- Posters: 10 (out of 22)
- A marketplace with 11 permanent booths
- Presented Vocabularies at the 1st Vocabulary Carnival: 24
- Attendance: 225
- Geographic Coverage: 21 countries
This year’s SEMANTiCS was co-located and connected with a couple of other related events, like the German ISKO, the Multilingual Linked Open Data for Enterprises (MLODE 2014) and the 2nd DBpedia Community Meeting 2014. This wisely connected gatherings brought people together and allowed transdisciplinary exchange.
Recapitulatory speaking: This SEMANTiCS has opened up new sights on Semantic Technologies, when it comes to
- industry use
- problem solving capacity
- next generation development
- knowledge about top companies, institutes and people in the sector
- Save the date for SEMANTiCS 2015: 15th – 17th of September 2015, Vienna
- SEMANTiCS 2014 – picture gallery: Flickr
Visits [v.isits.in] automatically visualizes personal location histories, trips and travels by aggregating geotagged one's Flickr collection with a Google Maps history. developed by Alice Thudt, Dominkus Baur and prof. Sheelagh Carpendale, the map runs locally in the browser, so no sensitive data is uploaded to external servers.
The timeline visualization goes beyond the classical pin representation, which tend to overlap and are relatively hard to read. Instead, the data is shown as 'map-timelines', a combination of maps with a timeline that convey location histories as sequences of maps: the bigger the map, the longer the stay. This way, the temporal sequence is clear, as the trip starts with the map on the left and continues towards the right.
A place slider allows the adjusting of the map granularity, reaching from street-level to country-level.
- The overwhelming majority of smartphone users set up their phone once, then barely ever download a new app again ( )
- Cool and successful use of speculative execution in cloud computing for games, trading off extra CPU and bandwidth for the ability to hide network latency ()
- Infrared vision on your phone ( )
- How easy is it to get people to memorize hard-to-crack random 56-bit passwords, equivalent to about 12 random letters or 6 words? ( )
- Desalination needs warm water, data centers need to be cooled, why not put them together? Clever idea. ()
- It's easy to overhype this, but it's still pretty cool, transmitting data (0 and 1 bits) directly brain-to-brain without implants (using magnetic stimulation of the brain and EEG reading of the brain, both from the surface of the scalp) with relatively low error rates (5-15%). Data rates are extremely low at 2-3 bits/minute, but it's still interesting that it's possible at all. ()
- Xiaomi's remarkable iPhone clone ()
- Has Amazon sold less than 35k Fire phones? ( )
- Facebook publishes a paper which details how its ad targeting works and suggests they will be doing more personalization in the future ( )
- "Having a multiyear project with no checks along the way and the promise of one big outcome is not a highly successful approach, in or outside government" ( )
- More evidence patent trolls cause real harm. Trolled firms "dramatically reduce R&D spending". ()
- "Using nothing more than a laptop ... [they could] alter the normal timing pattern of the [traffic] lights, turning all the lights along a given route green, for instance, or freezing an intersection with all reds" ()
- Interesting data visualization showing how CD took over in music sales, then got replaced by downloads, all over the last two decades or so ()
- Neat charts on how the strike zone expands on 3 ball counts and contracts on 2 strike counts ()
- Cute SMBC comic on "What is the fastest animal?" ()
- Great SMBC comic on job interviews ()
Movies are shown as unique nodes, while their influences are depicted as directed edges. The color gradients from blue to red that originate in the1980s denote the era of postmodern cinema, the era in which movies tend to adapt and combine references from other movies.
Although the visualizations look rather minimalistic at first sight, their interactive features are quite sophisticated and the resulting insights are naturally interesting. Therefore, do not miss out the explanatory movie below.
Via @albertocairo .
A World of Terror [periscopic.com] by Periscopic shows the reach, frequency and impact of about 25 terrorism groups around the world.
The visualization exists of 25 smartly organized pixel plots that are displayed as ordered small multiples. Ranging from Al-Qa'ida and the Taliban to less known organizations like Boko Haram, the plots reveal which ones are more deadly, are more recently active, or have been historically more active. In addition, all data can be filtered over time.
The data is based on the Global Terrorism Database (GTD), the most comprehensive and open-source collection of terrorism data available.
Described as an "atmospheric puzzle action game with a mindset of its own", it's visual style has been completely based on the world of infographics. In essence, the concept of infographics seem to work as a gameplay environment not just because of its pretty aesthetics, but also because of its natural interaction with (visual) data.
Consequently, in Metrico, each action is quantified and explicitly shown, such as the number of times an avatar needs to jump up and down or shoots a projectile. Metrico's goal is thus similar to most infographics: enticing users to make sense of a complex system.
A Model of Breast Cancer Causation [cabreastcancer.org], designed by 'do good with data' visualization studio Periscopic illustrates many of the factors that can lead to breast cancer and how they may interact with others.
The interactive circos graph is meant to demonstrate the complexity of breast cancer causation, in terms of educating the general public as well as possibly stimulating new scientific research in this direction. Users can explore the different influencing factors by domain, predicted correlation strength as well as the quality of the data evidence behind.
Since we already know in what angle people put their face when taking a selfie in different cities, we now also know how they sleep differently: Which Cities Get the Most Sleep? [wsj.com] by interactive graphics editor Stuart A. Thompson of the Wall Street Journal compares the sleeping habits of citizens of different cities.
On the topic of sleep, Jawbone also just released an interesting graph revealing how the recent Napa earthquake affected the sleep of local residents [jawbone.com]. Indeed, the distance to the epicenter seems to correlate to the number of people who awoke, and the time it took for them to get back to sleep.
As the visualizations are based on a vast dataset released by Jawbone, the makers of a digitized wristband that tracks motion and sleep behavior, the data is not necessarily representative for the whole general population.