Skip navigation.
Home
Semantic Software Lab
Concordia University
Montréal, Canada

Feed aggregator

15 days and Countinghellip;

Data Mining Blog - Mon, 2009-01-05 23:30

That’s right – only 15 days until January 20th, which means only 16 days until the submission deadline for ICWSM 2009! Need a reason for going? Take a look at the line up of invited speakers:

  • Jon Kleinberg, Cornell University, USA
      Jon Kleinberg is a Professor of computer science at Cornell University. His research focuses on issues at the interface of networks and information, with an emphasis on the social and information networks that underpin the Web and other on-line media. He is a member of National Academy of Engineering and the American Academy of Arts and Sciences, and was a recipient of the 2005 MacArthur "Genius Award".
  • Lillian Lee, Cornell University, USA
      Lillian Lee is an Associate Professor in the Department of Computer Science at Cornell. She is well-known for her work in sentiment analysis, and has recently co-authored the book "Opinion mining and sentiment analysis". She has also authored influential work in many other aspects of statistical natural language processing, including in the areas of distributional similarity and natural language generation.
  • Duncan Watts, Columbia University, USA and Yahoo! Research
      Duncan J. Watts is a professor of sociology at Columbia University. His research focuses on the structure and evolution of social networks, the origins and consequences of social influence, and the nature of distributed "social" search. Among his many published works, he is particularly known for his 1998 paper with Steven Strogatz in which the two presented a mathematical theory of the small world phenomenon.

You could also take a look at the program from last year, but why do that when you can watch all the presentations online via VideoLectures.net, including invited talks from Brad Fitzpatrick, Marc Smith and David Sifry.

Categories: Blogroll

Biggest Day Ever

Searchblog - Mon, 2009-01-05 23:27
Yesterday's (well, technically, Sunday's) predictions 2009 post drew the single largest crowd ever to Searchblog in its five-plus year history. More than 50,000 people came to visit it, thanks in large part to Twitter and Digg. I'm honored and pleased as punch, and it really makes me want to... a href="http://battellemedia.com"(Go to Searchblog Main)/aa href='http://adserver.fmpub.net/adserver/adclick.php?n=aad8b786' target='_blank'img src='http://adserver.fmpub.net/adserver/adview.php?what=zone:20amp;n=aad8b786' border='0' alt='' //adiv class="feedflare" a href="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?a=Z04KD5Z5"img src="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?d=41" border="0"/img/a a href="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?a=mf5p6ytV"img src="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?d=50" border="0"/img/a a href="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?a=sgchw5Qx"img src="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?i=sgchw5Qx" border="0"/img/a /div
Categories: Blogroll

Kindle

Data Mining Blog - Mon, 2009-01-05 16:25

This blog is now available on Amazon’s Kindle. To celebrate, here’s a graph illustrating Oprah’s impact on attention around the product.

To subscribe, use the button on the left!
Categories: Blogroll

Emotions, Beliefs and Analytics

Life Analytics Blog - Mon, 2009-01-05 11:34
div style="text-align: justify;"br /When i first came across Data Mining and Machine Learning in 1997 i had no idea of the kind of applications that this field can have. As time passes by, the knowledge that can be available to a data/text miner becomes more and more a serious business....actually, a very serious one.br /br /Not long time ago i have seen a presentation where a map of emotions from the web was created spanin real time/span by aggregating specific keywords from blogs and forum posts. a href="http://twistori.com/"Twistori/a is an example of such an application. Now, let's take this idea one step further.br /br /a href="http://twitter.com/"Twitter/a is a "social messaging utility" in which users describe what they are doing -or what they are feeling/thinking- now. Users are able to send "tweets" even through SMS messages. The way that these messages are written is an ideal format for text mining : Short phrases that summarize what a user wants to say are a text miner's paradise.br /br /It is logical to assume that Text mining and Information extraction techniques will become more important, since more data will be generated in the future. It is only a matter of time until the next "killer app" like FaceBook, YouTube and Twitter appears. Data/Text miners will be able to identify common "thought clusters" of people.br /br /Now, consider the following example : By visiting a href="http://search.twitter.com/search?max_id=1080629354amp;page=1amp;q=%22i+don%27t+want+to%22"this link/a you will get a list of people that have written in their "tweets" the phrase "I don't want to....".br /br /Once this textual information is captured, preprocessed and then analyzed through clustering analysis we could end up with the following clusters of "I don't want-er's " :br /br /br /- The cluster of users that do not want to work again/tomorrow/today (18.5%)br /br /- The cluster of users that do not want to go to sleep (6%)br /br /- The cluster of users that do not want to hurt someone (4.2%)br /br /br /What is also interesting, is the ability to quantify the proportion of cases belonging to each cluster to the total of tweets. As shown in the example above, the most frequently occurring thought is from people that do not feel like working.br /br /br /Now in the same way one could perform this type of analysis for :br /br /"I Believe...."br /"I wish i...."br /"I want to buy..."br /br /Essentially, what we are talking about is the extraction of the values, hopes and beliefs of hundreds of thousands -or even millions- of users...and in span style="font-style: italic;"descending/span order. Once a first run is performed and clusters are extracted one could run this process again every month and see the trends of those clusters in time. It would be also interesting to see how these thought clusters change after specific World events.br /br /For some people such as marketeers and social researchers -providing that results are accurate enough- this information is invaluable. Others, might feel that such an analysis is bad practice. Of course, there are companies that already capture brand sentiment across the web : a href="http://www.crimsonhexagon.com/home/"Crimson Hexagon/a and a href="http://twitrratr.com/"Twitrattr/a are just two examples.br /br /br /This post is the first in a series of posts discussing the application of Analytics to capture the thoughts that -as we speak now- exist on the Web. We will go through ways that one could explore this information and more specifically we will look at :br /br /br /ulliHow clustering can group people's values, beliefs and emotions. /li/ulbr /ulliWhy Ontologies and Natural Language Processing are needed for better results./li/ulbr /ulliHow classification analysis might give us knowledge on what are the common characteristics of various 'categories' of users.br //li/ulbr //div
Categories: Blogroll

Crystal Ball Gazing at the Coming Year in Tech Law

Michael Geist's Blog - Mon, 2009-01-05 04:55
Technology law and policy is notoriously unpredictable and crystal ball gazing in Canada this year is particularly challenging given the current political and economic uncertainty.nbsp; With that caveat, my weekly technology law column (a href="http://www.thestar.com/sciencetech/article/561554"Toronto Star version/a, a href="http://www.michaelgeist.ca/content/view/3594/159/" homepage version/a) provides my best guess for the coming months includes the following: br / br / January.nbsp; The Copyright Board of Canada releases its much-anticipated decision on the copyright royalties payable by primary and secondary schools across Canada.nbsp; The board reduces the fees based on the Supreme Court of Canadarsquo;s liberal interpretation of fair dealing, Canada#39;s version of fair use.nbsp; At the end of the month, the government#39;s budget includes the expected stimulus package for the auto and forestry sectors, but there is little for the culture and technology sectors.br / br / February. The Canadian Radio-television and Telecommunications Commission kicks off a busy year with its new media hearings.nbsp; The positions are by-now well known - cultural groups seek the creation of a new ISP levy and increased regulation of Internet-based broadcasting, while most broadcasters and telecommunications companies support the status quo.br / br / March.nbsp; Secret negotiations on the Anti-Counterfeiting Trade Agreement resume in Morocco.nbsp; Calls for greater transparency fall on deaf ears as the U.S., Japan, and South Korea urge participants to keep the treaty under wraps and to conclude the draft treaty by year-end.br / br / April.nbsp; The U.S. Trade Representative releases its annual Special 301 Report on the status of global intellectual property laws.nbsp; Canada once again finds itself in good company as it (along with dozens of other countries) is criticized for failing to pass new copyright reform legislation.nbsp; br /br /May.nbsp; The Privacy Commissioner of Canada releases high-profile privacy findings involving deep packet inspection by Canadian Internet service providers and the privacy policies of leading social networks.nbsp; The Commissioner identifies several key concerns and urges the companies to alter their practices.br / br / June.nbsp; Industry Minister Tony Clement introduces new copyright reform legislation.nbsp; The bill is modeled after the failed Bill C-61; however, it includes several key changes that respond to some of the criticisms that emerged last year.nbsp; The month is a busy one for Clement, who expresses frustration with the state of the Canadian wireless market and the slow transition to digital television by Canadian broadcasters.br / br / July.nbsp; The CRTC holds several days of hearings on network neutrality.nbsp; Telecommunications companies argue that reasonable network management practices should not be regulated and that traffic shaping should be permitted.nbsp; A broad coalition of consumer groups, independent ISPs, technology companies, and cultural groups urge the Commission to establish net neutrality ground rules.br / br / August.nbsp; Several Canadian provinces follow the Ontario and British Columbia lead by tabling legislation to allow for the creation of quot;enhancedquot; drivers licences that embed RFID technologies.nbsp; The proposals raise the ire of the privacy and security experts across the country.nbsp; Canadian provinces also continue to the push toward universal high speed Internet access services, with the majority of provinces unveiling plans to guarantee access to rural communities.br / br / September.nbsp; The federal government follows through on its 2008 campaign promise by introducing anti-spam legislation.nbsp; The bill extends beyond spam by targeting phishing, spyware, and other Internet harms.nbsp; The one-year anniversary of the do-not-call list also arrives - millions of Canadians have by-now registered their numbers, yet the volume of unwanted telemarketing calls has barely declined.br / br / October. The CRTC releases its new media decision, which rejects the ISP tax proposal, but opens the door to Canadian content requirements for Internet-based broadcasting by regulated broadcasters.nbsp; Meanwhile, Industry Canada releases a consultation paper on its next round of spectrum auctions.br / br / November.nbsp; Canadians head back to voting booths for the fourth federal election in six years.nbsp; The election call kills the copyright and anti-spam bills, which do not advance through the Parliamentary process. nbsp;br / br / December.nbsp; The CRTC releases its net neutrality decision, ruling that Canadian ISPs should provide subscribers with more transparent disclosures about their network management practices, but the Commission stops short of blocking Internet throttling.nbsp; The decision stands in stark contrast to developments in the U.S., where Congress passes a net neutrality bill backed by President Obama.nbsp; Meanwhile, ACTA negotiators use the year#39;s fourth negotiation session to conclude a draft treaty, setting the stage for a major battle over the treaty in 2010.img src="http://feeds.feedburner.com/~r/MichaelGeistsBlog/~4/503480766" height="1" width="1"/
Categories: Blogroll

Public Domain Day 2009

Michael Geist's Blog - Sun, 2009-01-04 16:24
Wallace McLean a href="http://www.xanga.com/publicdomain/687985634/public-domain-day-2009.html"offers/a his annual public domain day list of authors whose work entered into the public domain in Canada on January 1st. br /img src="http://feeds.feedburner.com/~r/MichaelGeistsBlog/~4/503029530" height="1" width="1"/
Categories: Blogroll

New Brunswick To Implement Province-Wide Broadband Initiative

Michael Geist's Blog - Sun, 2009-01-04 16:22
The Province of New Brunswick has a href="http://www.cbc.ca/technology/story/2008/12/24/nb-rural-internet.html"announced/a plans to provide province-wide broadband connectivity within the next 18 months. br /img src="http://feeds.feedburner.com/~r/MichaelGeistsBlog/~4/503029531" height="1" width="1"/
Categories: Blogroll

Canada Border Services Opening Lawyer's Mail

Michael Geist's Blog - Sun, 2009-01-04 16:20
Cyndee Todgham Cherniak a href="http://tradelawyersblog.com/blog/archive/2008/december/article/beware-canada-border-services-agency-will-read-lawyers-mail/?tx_ttnews%5Bday%5D=30amp;cHash=91c1fc7674"reports/a that Canada Border Services opened mail addressed to her office at Lang Michener, raising signficant privacy and client confidentiality concerns. br /img src="http://feeds.feedburner.com/~r/MichaelGeistsBlog/~4/503029534" height="1" width="1"/
Categories: Blogroll

CPCC and Best Buy Resolve Private Copying Dispute

Michael Geist's Blog - Sun, 2009-01-04 16:19
The Canadian Private Copying Collective and retailer Best Buy have a href="http://www.cpcc.ca/english/pdf/CPCC-Press-Release-re-CPCC-and-Best-Buy-Canada-Ltd-resolve-dispute-20081219.pdf"resolved/a a dispute over private copying levies, with the CPCC receiving nearly $1 million in compensation.br / br /img src="http://feeds.feedburner.com/~r/MichaelGeistsBlog/~4/503029535" height="1" width="1"/
Categories: Blogroll

Predictions 2009

Searchblog - Sun, 2009-01-04 15:43
Related: 2008 Predictions 2008 How I Did 2007 Predictions 2007 How I Did 2006 Predictions 2006 How I Did 2005 Predictions 2005 How I Did 2004 Predictions 2004 How I Did In each of the past five years I've written a predictions post - usually at year's end or... a href="http://battellemedia.com"(Go to Searchblog Main)/aa href='http://adserver.fmpub.net/adserver/adclick.php?n=aad8b786' target='_blank'img src='http://adserver.fmpub.net/adserver/adview.php?what=zone:20amp;n=aad8b786' border='0' alt='' //adiv class="feedflare" a href="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?a=OAj95TrE"img src="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?d=41" border="0"/img/a a href="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?a=3zVaSZei"img src="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?d=50" border="0"/img/a a href="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?a=M38fcezC"img src="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?i=M38fcezC" border="0"/img/a /div
Categories: Blogroll

Pythagoras of Oz

Data Mining Blog - Fri, 2009-01-02 23:27

This winter has had plenty of the Wizard of Oz. First we read the book, then we watched the film, then we saw the Seattle Children’s Theatre production (we couldn’t buy the DVD as it appears to be largely unavailable). Most mathematicians are familiar with the big blooper at the end of the film in which the Scarecrow, having been given his ‘brains’ affects an intellectual tone and asserts what was intended to be Pythagoras’ theorem thus:

"The sum of the square roots of any two sides of an isosceles triangle is equal to the square root of the remaining side."

The SCT’s production (which was excellent, BTW), repeated this line verbatim. Is that good or bad?

I’d like it to be the case that the writers of the script intentionally slipped that in to make the acquisition of ‘brains’ simply via the proxy of a ‘diploma’ even more ironic, but I think that would be clutching at straws. Of course, in the book, the Scarecrow’s head is removed, scooped out and filled with bran and pins (to make him sharp – get it?).

Categories: Blogroll

Daniel Tunkelang idealizes Twitter

Text Technologies Blog - Fri, 2009-01-02 22:33

Daniel Tunkelang has a couple of recent posts decrying what amounts to, at least in his eyes, the abuse of Twitter. (My word, not his.)   For example, he writes in criticism of Loic LeMeur:

Twitter is a communication platform, not a marketing platform, and there’s a subtle difference.

But I’d disagree that there’s a bright line separating the two.  In particular, I think most business blogs serve or should serve as both, in no small part because the areas of marketing and communication overlap heavily. And in my opinion Twitter (microblogging) and ordinary blogging aren’t that far apart.

Earlier this evening I posted praise of the BI expert Twitter community — of which Daniel is indeed a member — even while admitting that unlike other members, I “follow” too many Twitterers to actually keep up with their posts.  Daniel refers to following patterns like mine as an attention Ponzi scheme, on the theory that people are following so many others in the pretense of paying attention to them, hoping to get real attention in return.

The first problem with that clever phrase is that Daniel is misusing the term “Ponzi scheme” to refer to an unrelated type of fraud. More seriously, it seems to assume that the only legitimate use of Twitter — or more precisely of following people on Twitter — is for full community engagement.  I dispute that assumption.  While I don’t follow tweetstreams in real time very often, I do occasionally dip in when I’m in the mood. And when I do, I prune my followee list for my own purposes.

I really wish the Twitter experience could be better filtered, into more manageable groups of people, topics, etc. But I’m not aware of any adequate software that does the job.  (Tweetdeck is horrific, or at least was when I regrettably tried to use it, in that you can’t temporarily close a group without losing all the entries in it forever.) In the mean time, there is a multitude of worthwhile ways to use Twitter.

Categories: Blogroll

Enterprise IT experts on Twitter

Text Technologies Blog - Fri, 2009-01-02 22:12

It was my birthday yesterday (New Year’s Day), and I remarked on Twitter that I seemed to be getting more automated greetings from message boards and the like than I was getting from real people.* Naturally, a number of folks set out to redress the imbalance :), specifically J A di Paolantonio, Rob Paller, Neil Raden, Claudia Imhoff, Gareth Horton, Donald Farmer, IdaRose Sylvester, and Seth Grimes.

*In retrospect that was a silly comment, made soon after midnight while humans were generally either partying or asleep. But it’s the set-up for the rest of this post.

Sheer self-indulgence aside — “Happy Birthday To Me!!” — I see something blogworthy in that. Indeed, it reflects the emergence over the past 6 months or so of one particular Twitter community. Takeaways include:

1. The responders weren’t a randomly selected subset from among those of my 1304 Twitter followers online when I tweeted. Every person who responded is an industry analyst, a BI expert, or both.

Yes Virginia, there are some enterprise IT folks on Twitter.

2. Members of the community seem to follow each other’s tweetstreams in their entirety. Many of their tweets are in direct reply to or otherwise inspired by each other. Indeed, based on the timing, I suspect a lot more folks were inspired by Neil Raden’s message to me than by my original post.

3. Unlike me, these other folks seem to keep their followee lists small enough to engage with. 100ish numbers of people followed is not uncommon. By way of contrast, I follow 1682 people, which means that despite considerable care about who I follow, I wind up almost never actually checking what the tweetstream contains. (Instead, I usually just tweet something and react to the @replies.)

I no doubt like the charming Claudia Imhoff at least as well as she likes me. Even so, if there were a group of tweets about her birthday, I might well miss it — especially at first — just because I follow too many people to keep up.   More on that point in another post (coming soon).

4. Twitter is really just another venue for the evolution of an already-extant community. The independent BI analysts tend to travel as a pack anyway, to venues such as TDWI and Teradata Partners conferences, or to local gettogethers they hold in Colorado.

5. But Twitter does help that community evolve. I’ve really been brought into the club via Twitter. For example, the conversations that led to my teaching at the next TDWI Conference grew out of an email from Wayne Eckerson to the effect “Hi. I follow you on Twitter, and generally read your stuff. Can you help with a particular hardcore DBMS technology question I’ve run into?”

6. Twitter connections are useful. Twitter has made it easier for me to have offline conversations with Claudia, Wayne et al. My user-focused consulting services will be much richer for that.

Six months ago I felt that Twitter was dominated by the “new-age” tech folks — search engine optimizers, podcasters, social media consultants, Web 2.0 gurus and the like. But in one particular enterprise area — business intelligence — traditional IT folks are active as well. Perhaps similar ones will emerge in other areas of IT as well.

Categories: Blogroll

Danny on Microsoft

Searchblog - Wed, 2008-12-31 15:24
The man makes some very good points.... a href="http://battellemedia.com"(Go to Searchblog Main)/aa href='http://adserver.fmpub.net/adserver/adclick.php?n=aad8b786' target='_blank'img src='http://adserver.fmpub.net/adserver/adview.php?what=zone:20amp;n=aad8b786' border='0' alt='' //adiv class="feedflare" a href="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?a=XBlb4jtB"img src="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?d=41" border="0"/img/a a href="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?a=V4r9vZv6"img src="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?d=50" border="0"/img/a a href="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?a=Wg7Ps0Dg"img src="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?i=Wg7Ps0Dg" border="0"/img/a /div
Categories: Blogroll

15,000 Words You Might Have Missed

Searchblog - Wed, 2008-12-31 13:34
One of my readers noted that I've written a lot of off-blog stuff, and I'm rather proud of it. And I've noted (in my "How did I do 2008" post) that I did not really make the progress I wish I had on my book. But working with partners... a href="http://battellemedia.com"(Go to Searchblog Main)/aa href='http://adserver.fmpub.net/adserver/adclick.php?n=aad8b786' target='_blank'img src='http://adserver.fmpub.net/adserver/adview.php?what=zone:20amp;n=aad8b786' border='0' alt='' //adiv class="feedflare" a href="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?a=Kd7nLwtJ"img src="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?d=41" border="0"/img/a a href="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?a=3WSBct6w"img src="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?d=50" border="0"/img/a a href="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?a=yoWuvaCX"img src="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?i=yoWuvaCX" border="0"/img/a /div
Categories: Blogroll

Danny Sullivan on Microsoft's Live Search

Greg Linden's Blog - Tue, 2008-12-30 19:13
Danny Sullivan wrote up a version of a talk he gave at Microsoft in June 2008 in his recent post, "a href="http://searchengineland.com/tough-love-for-microsoft-search-15968"Tough Love for Microsoft Search/a".br /br /Danny Sullivan is an insightful writer, long-time watcher of the search industry, and founder of Search Engine Watch, Search Engine Land, and the popular Search Engine Strategies (SES) conference. His thoughts are well worth reading.br /br /[Found via a href="http://www.techflash.com/microsoft/Why_not_Microsoftcom_for_search36891274.html"Todd Bishop/a]
Categories: Blogroll

Predictions 08: How Did I Do?

Searchblog - Tue, 2008-12-30 15:47
Related: 2007 Predictions 2007 How I Did 2006 Predictions 2006 How I Did 2005 Predictions 2005 How I Did 2004 Predictions 2004 How I Did Reading over my predictions for 2008, I was struck with one thing: It wasn't a list. It was more of a narrative, making decoding... a href="http://battellemedia.com"(Go to Searchblog Main)/aa href='http://adserver.fmpub.net/adserver/adclick.php?n=aad8b786' target='_blank'img src='http://adserver.fmpub.net/adserver/adview.php?what=zone:20amp;n=aad8b786' border='0' alt='' //adiv class="feedflare" a href="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?a=tafosGAN"img src="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?d=41" border="0"/img/a a href="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?a=XwMKvjPH"img src="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?d=50" border="0"/img/a a href="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?a=CtC8ggM7"img src="http://feedproxy.google.com/~f/JohnBattellesSearchblogexcerptsOnly?i=CtC8ggM7" border="0"/img/a /div
Categories: Blogroll

Considering consistency at Amazon

Greg Linden's Blog - Mon, 2008-12-29 21:23
Amazon CTO Werner Vogels posted an copy of his recent ACM Queue article, "a href="http://www.allthingsdistributed.com/2008/12/eventually_consistent.html"Eventually Consistent - Revisited/a". It is a nice overview of the trade-offs in large scale distributed databases and focuses on availability and consistency.br /br /An extended excerpt:blockquoteiDatabase systems of the late '70s ... [tried] to achieve distribution transparency -- that is, to the user of the system it appears as if there is only one system instead of a number of collaborating systems. Many systems during this time took the approach that it was better to fail the complete system than to break this transparency.br /br /In the mid-'90s, with the rise of larger Internet systems ... people began to consider the idea that availability was perhaps the most important property ... but they were struggling with what it should be traded off against. Eric Brewer ... presented the CAP theorem, which states that of three properties of shared-data systems -- data consistency, system availability, and tolerance to network partition -- only two can be achieved at any given time .... Relaxing consistency will allow the system to remain highly available under the partitionable conditions, whereas making consistency a priority means that under certain conditions the system will not be available.br /br /If the system emphasizes consistency, the developer has to deal with the fact that the system may not be available to take, for example, a write ... If the system emphasizes availability, it may always accept the write, but under certain conditions a read will not reflect the result of a recently completed write ... There is a range of applications that can handle slightly stale data, and they are served well under this model.br /br /[In] weak consistency ... The system does not guarantee that subsequent accesses will return the updated value. Eventual consistency ... is a specific form of weak consistency [where] the storage system guarantees that if no new updates are made to the object, eventually all accesses will return the last updated value ... The most popular system that implements eventual consistency is DNS (Domain Name System).br /br /[In] read-your-writes [eventual] consistency ... [a] process ... after it has updated a data item, always accesses the updated value ... Session [eventual] consistency ... is a practical version of [read-your-writes consistency] ... where ... as long as [a] session exists, the system guarantees read-your-writes consistency. If the session terminates because of a certain failure scenario, a new session needs to be created and the guarantees do not overlap the sessions./i/blockquoteAs Werner points out, session consistency is good enough for many web applications. When I make a change to the database, I should see it on subsequent reads, but anyone else who looks often does not need to see the latest value right away. And most apps are happy if this promise is violated in rare cases as long as we acknowledge it explicitly by terminating the session; that way, the app can establish a new session and either decide to wait for eventual consistency of any past written data or take the risk of a consistency violation.br /br /Session consistency also has the advantage of being easy to implement. As long as a client reads and writes from the same replica in the cluster for the duration of the session, you have session consistency. In the event that node goes down, you terminate the session and force the client to start a new session on a replica that is up.br /br /Werner did not talk about it, but some implementations of session consistency can cause headaches if a lot of clients doing updates to the same data where they care what the previous values were. The simplest example is a counter where two clients with sessions on different replicas both try to increment a value i and end up with i+1 in the database rather than i+2. However, there are ways to deal with this kind of data. For example, just for the data that needs it, we can use multiversioning while sending writes to all replicas or forcing all read-write sessions to the same replica. Moreover, a surprising vast amount of application data does not have this issue because there is only one writer, there are only inserts and deletes not updates, or the updates do not depend on previous values.br /br /Please see also Werner's older post, "a href="http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html"Amazon's Dynamo/a", which, in the full version of their SOSP 2007 paper at the bottom of his post, describes the data storage system that apparently is behind Amazon S3 and Amazon's shopping cart.
Categories: Blogroll

Where “semantic” technology is or isn’t important

Text Technologies Blog - Mon, 2008-12-29 19:59

At Lynda Moulton’s behest, I spoke a couple of times recently on the subject of where “semantic” technology is or isn’t likely to be important.  One was at the Gilbane conference in early December.  The slides were based on my previously posted deck for a June talk I gave on a text analytics market overview. The actual Gilbane slides may be found here.

My opinions about the applicability of semantic technology include:

  • The big bucks in web search are for “transactional” web search, and semantics isn’t the issue there. (Slides 3-4)
  • When UIs finally go beyond the simple search box — e.g. to clusters/facets or to voice — semantics should have a role to play. (Slide 5)
  • Public-facing site search depends — more than any other area of text analytics — on hand-tagging. (Slide 7)
  • “Enterprise” search that searches specialized external databases could benefit from semantic technologies. (Slide
  • True enterprise search could benefit from semantic technologies in multiple ways, but has other problems as well. (Slides 10-11)
  • Semantics — specifically extraction — is central to custom publishing. (Slide 12 — upon review I regret using the word “sophisticated”)
  • Semantics is central to text mining. (Slide 18)
  • Semantics could play a big role in all sorts of exciting future developments. (Slide 19)

So what would your list be like?

Categories: Blogroll
Syndicate content