Data Pollution and Separation of Context

Stefan Brands has posted one of the best argued and most important comments yet on the issue of identity correlation, the phenomenon giving rise to the Fourth Law of Identity.

By way of background, this was part of a conversation taking place in an ID Gang discussion group hosted by Berkman Law School. Our friend Drummond Reed posted a comment which, although perfectly innocent in its intent, sent me into Tasmanian Devil Mode.

‘Ever since I saw the shocking powers of modern correlation technology – it only takes 2 to 3 pieces of MANY kinds of perfectly innocent data (e.g., zip code and income) to uniquely identify a person with a 99+% statistical accuracy – I realized that privacy-through-obscurity was hopeless. Which means privacy-through-accountability is the only option.’

Accountability is indeed important, but not in any way a substitute for technological protections of privacy. Thus, although Drummond is a big supporter of the Fourth Law and context-specific identifiers, I felt it was necessary to underline the key importance of the distinction between probabilistic and determinate correlation. So I wrote:

‘The “modern correlation technology” argument made by Drummond easily leads to the wrong conclusions. The zip code plus income example is typical, and gets my goat because it leads some to say “you can be identified with a few pieces of information, so it doesn’t really matter if correlation handles exist.”

‘In Drummond’s example, how accurately has the income been expressed, and what is the size of the zipcode? … “Modern correlation technology” is based on approximations and fuzzy calculation and is very expensive relative to using “database keys”. It is appropriate to *keep it that way* and make it *more expensive still*.’

Stefan then entered the discussion, extending our consideration of the problem of fuzzy calculation (of correlation) to include the inevitability of correlation errors.

Many readers will understand this because they know that to rationalize their identity infrastructure, enterprises have had to go through the well-known pain of doing what Craig Burton and I, over a decade ago, described as the “identity join”. This was the process of determining how the identifiers used in disparate computer and directory systems throughout the enterprise mapped to each other.

Performing this join accurately usually proved laborious – even though that join represented a trivial problem compared to one at the scale of the Internet as a whole. Further, enterprise administrators had many advantages over those trying to employ “modern correlation techniques”. Besides dealing with a relatively small population, they enjoyed unlimited access to data and identifying information, flexible tools, and the ability to ask the data subjects to collaborate! It was still expensive to get everything right.

Stefan goes on to present an abstracted (mathematical) model which could be the basis for an economics of the phenomena described. If there isn't a ground-breaking paper waiting to be written about identity economics, I'll eat my hat.

Here is Stefan's contribution – one which I think is crucial (I've added some emphasis):

In support of Kim's defense of information privacy, here is another observation: there is a world of difference for organizations between

  1. a link between different user identifiers that is 100% guaranteed and
  2. a link that is only suspected (e.g., is Jon A. Smith really the same person as Jonathan A. Smith?).

Consider this. When organizations link up user accounts (also known as records, files, dossiers, etc.) that are indexed by different user identifiers and they have no guarantee of the correctness of the linkage, the aggregated information in the “super-account” may well become completely worthless to them and may even become a liability.

Even a 0.1% error probability in many cases may be intolerable. Imagine the consequences of hooking up the wrong health care or crime-related information on a per-user basis and making a medical or criminal justice error on that basis that is wrong.

Depending on the business of the organization, there may be a significant cost associated with acting on wrong information, not only from a liability perspective, but also from a goodwill, security, or resource cost perspective.

The more user accounts are linked up into one aggregated “super-account”, the higher the error probability. We are dealing with a geometric distribution here. In an abstracted model, if the probability of success in matching two user/account identifiers is p then the probability that n user identifiers that are hooked up contain at least one error (i.e, they do NOT all pertain to the same person/entity — a case of “data pollution“) is 1 – p^{n-1}. To appreciate how fast this error rate goes up when linking more and more user accounts, check out this site (requires java). More sophisticated statistics can be applied directly out of the text books of econometrists.

Now, on the other hand, imagine a world where each user has only one user identifier that is the basis for all his or her interactions with organizations and other users; no more error probability, no more data pollution when hooking up user accounts, regardless of how many! The strongest possible guarantee that different user identifiers (serving as account indices) really pertain to the same person occurs, of course, when user identifiers are “certified” by a trusted issuer; a national (or world…) ID chipcard with three-factor security would be the ultimate linking/profiling tool for organizations that naively believe that aggregating personal information across domains does not come with major security risks of its own.

In short, there is a major difference from the perspective or organizational value between being able to correlate with absolute infallibility and the value of merely being able to guess with high success probability which user identifiers / account indices relate to the same user.

PS For civil liberties arguments in favor of avoiding correlation handles where not strictly needed, see for instance here and here.

I think it would be good to put together a brief paper that explores the problem of obtaining accuracy in doing the identity join, combining the experience gathered from metadirectory deployments and Stephan's mathematical explanation.

[tags: , , , , ]

Published by

Kim Cameron

Work on identity.

One thought on “Data Pollution and Separation of Context”

Comments are closed.