Mining for Memes

Jon Udell has responded to my question about whether his approach to meme-tracking could be used to determine whether the increased reporting of identity breaches was leading to desensitization or increased watchfulness:

Bruce Schneier wonders if the ongoing reports of identity loss are creating a boy-who-cried-wolf situation. Are people starting to tune this stuff out? And will that result in less pressure for reform?

Kim Cameron wonders whether or not the boy really is crying wolf:

Bruce's concept of an attenuation effect is pretty interesting. But I'm not sure it's true. I really get the feeling that the public is gaining a consciousness of these issues. That is a really big deal. The increased consciousness – and thus interest – may counteract attenuation. It would be interesting to see our friend Jon Udell do one of his meme studies to see if the attenuation is really happening. I'll ask him if it's possible.

What Kim is referring to is this posting about the ACLU Pizza screencast, which lots of people had seen before he had. While it's flattering to be considered some kind of meme mining expert, though, that's hardly the case. All I did was chart Bloglines and references to a single URL.

A variant of this approach has been around for a long time: mining the Usenet for occurrences of keywords. Via Nat Torkington's post on PHP's 10th anniversary I found this “memegraph” from Broward Horne, who's evidently been doing meme mining for a while.

These techniques are useful, but they only scratch the surface. I can imagine a methodology that uses correlated bundles of URLs and keywords. It would deliver historical views of references to the URLs, and occurrences of the keywords, across: the Usenet; the blogosphere; the online Old Media; and segmented slices of these: left/right, corporate/citizen, etc.

When you attempt this kind of thing, as I sometimes have, you pretty quickly run into a wall. Creating these bundles and slices is a speculative and iterative game. But when you're playing the game with web crawlers and screenscrapers, it's tedious. Each iteration takes a long time, and requires you to abuse your data sources.

What you'd really like to do is query the web's aggregation engines in a structured, high-volume way. When I've mentioned this before, the pushback has always been: “Why should they offer such services for free?” And my answer has been: “They shouldn't. Offering metered versions of such services is a huge business opportunity.”

In some cases, I'm told, these mining services are available on a partner basis. But they've yet to emerge into the mainstream, and I'd love to see that happen. It would unleash a flood of creative trend analysis. It would also be a fascinating study in the economics of web services. What kinds of queries can feasibly be offered? How can the quantity or resolution of results be tuned for tiered pricing? What kinds of queries can't be released into the wild because they're so strategic that they'd erode competitive advantage?

Meanwhile, of course, if I were a Microsoft architect or developer trying to understand trends affecting my technology or product, I'd hope that my company's own aggregation engine would support me with the kinds of data mining I'm envisioning. I wonder if it does?

I like Jon's no-nonsense approach. We should be treating aggregation engines as sensors to be monitored in a kind of realtime process control sense.

He's right. I would be able to do a better job with the toolset he describes. And his questions about Microsoft's aggregation engine are good ones. Time for me to go off and think.

Published by

Kim Cameron

Work on identity.