Probabilistic versus Determinate Linking – Kim Cameron's Identity Weblog

For those following the discussion on probabilistic versus determinate linking, it might be worth rewinding for a minute to consider the Fourth Law of Identity.

In presenting the fourth law, I agreed that traditional omnidirectional identifiers, by which I mean identifiers known to all, were appropriate for public contexts.

Here are some examples of what I meant by public contexts:

A stable well-known identifier is essential for MSDN, AOL, my bank, or even my Identity Blog. It is beneficial for such public identifiers to stay constant. I want readers to share information about www.identityblog.com. The more easily they can tell each other about the pieces on this and related websites, the better. These are all public things.
A well-known identifier is similarly appropriate for a “hot spot” in a shopping center. The hot spot is obviously “there” and a fixed wireless beacon is a helpful part of its presence. Otherwise I might end up exposing payment information to the wrong parties.
A well-known identifier is appropriate for a vending machine supporting digital payment. Again, the identifier would just be an extension of its physical presence.
A well-known identifier (an email address is a typical example) could be appropriate for a public role, like my role as architect of identity at Microsoft.
I could also employ a well-known identifier associated with a protective service offering more granular control. (For example, I use the i-name =Kim.Cameron to protect myself from spam – and it works really well.)

But in defining the fourth law I also argued that omnidirectional identifiers were not sufficient. In the parts of our life where we act as private individuals, we should have access to a technology which prevents collaboration about our identities except under our strict control. In achieving this, we can have two approaches to use of identifiers:

Avoid identifiers of any kind. This means (network addresses and information content aside – both separate discussions) that interaction contexts are completely disconnected – whether separated by points in time, or by the identity of the partner.
Use unidirectional identifiers, meaning those known only to a single partner – so that an interaction context can be maintained with that partner over time, yet remain disconnected with respect to interactions with other partners. I might subsequently choose to share some unidirectional identifiers between two (or more) partners – if they give me the right incentives. But being the only one who initially knows all the identifiers, collusion between my partners is not possible without my knowledge.

Why would you want to separate interaction contexts this way?

To prevent partners with whom you have shared information of one kind from amalgamating it with information collected about you by other partners, in order to create a “super-dossier” across different aspects of your life. (If this seems improbable, click here, then read this.)

Solove and others have explained that there are outfits which even today attempt to discover the correlations between our profiles at different organizations or sites; and who then assemble super-dossiers, and sell them, even to government buyers. If such correlations are possible, why does it still makes sense to insist on unidirectional identifiers?

I think there are several reasons.

The first is that if we want people to trust the emerging identity metasystem, we need to give them the ability to predict and intuit how it will behave.

Users can easily understand that if they give a telephone number or email address to two different parties, those parties can correlate them. This happens in the so-called “real world” as well.

But if users release no identifying information whatsoever, a system which still sets up invisible correlation handles would really be failing them. If this sounds like an unlikely technical outcome, remember that this is precisely what happens in the typical use of client X.509 certificates. Even PGP is subject to this problem (and worse, reveals the membership of one's entire circle of trust).

But there is another reason – namely, that correlation handles virtually eliminate the cost of discovering correlations, while providing 100% accuracy. We know that if correlation has a significant cost, then there must be a significant and provable cost benefit to justify it. Conversely, if it comes for free, then super-dossiers come for free, and their proliferation – completely outside of the user's control – is more or less inevitable.

I would see this proliferation as catastrophic – partly because people don't want to live in a virtual world where they feel like characters in a Kafka novel; and partly because there is great likelihood it would ultimately bring about rejection of the underlying identity system by many of those most essential to its success – the opinion leaders, those who think deeply about the implications of things, those who innovate and create, those who affect public opinion.

By providing alternatives to the use of correlation handles, not only is the cost of discovering correlation increased, but the probability of the correctness of attempted correlations is reduced. This in turn, implies further hidden cost as misinformation turns into liability. These costs and liabilities combine to discourage commercial super-dossiers constructed without the permission and participation of the individual. Given a prohibitive cost model for super-dossier activity, other less alienating means of developing real relations with customers are likely to be more cost-effective and beneficial all around.

That's really the background to yesterday's discussion about “Data Pollution and Separation of Context”. And since that posting, a number of comments have been made that are helpful in breaking through to a better understanding of how to think about and explain these complex issues.

Tom Gordon‘s contribution rang very true with me, and sounds a warning about the effect Data Pollution and false correlation will have on customers in general:

I have had one large company in the UK use incorrect information when trying to contact me about services I was purchasing from them. However, since they had previously contacted me successfully (and another department in the same company had telephoned me a few days beforehand), it appears they deliberately chose to use outdated information so they would have a failed contact record.

The interesting thing is they used information that was 3 years old, even though the department in question had sent me a letter (to the correct address) a few weeks before.

Certainly that company hadn't cleaned up its customer identity data! The symptoms described often appear when previously independent entities have been brought together under a common umbrella through reorganization, including mergers and acquisitions. The same customer appears in multiple unrelated computer systems, and it's difficult to unify them. Metadirectory helps in this regard. But getting it right depends on what we call the “identity join”. How do we know two accounts refer to the same customer? And how do we keep from making mistakes when figuring this out?

On this subject, Felipe Conill writes:

In my current job I am leading an effort with the goal of presenting data to customers from separate databases that identify the customer differently (different customer identifiers).

One of the many challenges we have to address to solve this problem is the challenge of using the identity of the customer when he logs in from a browser (where the identity datasource is reachable from the internet – which in itself presents all kinds to security risks) to query other data sources giving them information specific to their company.

The risk in doing this is that you don't want to show customer A the bill of customer B. To ensure this does not happen we need to have a mapping table to match who gets to see what.

Instead of doing this messy solution we are putting a common identifier for the customer in all of the datasources where there is customer data like Stefan suggests. Basically bypassing having to do an “identity join” to solve the problem.

I need to stop Felipe for a moment. Perhaps it's just a “vocabulary thing” – I see how a SQL aficionado, for example, might take the word “join” in a much more restrictive sense – but in the vocabulary Craig Burton and I developed, you are not avoiding doing an “identity join” at all. You are performing an identity join, and then using that to push a common identifier into all your systems as a way to represent it permanently.

This makes sense since your customers presumably want a single relationship with your company. I know I have been frustrated for years by the fact that my bank, for example, has still not “gotten it together” to give me a single login to all the services I use there.

Now the question becomes one of how you do the identity join. You probably have enough data to make a very well informed guess about what account should be joined to what. But if the data is important enough, you likely need to ask the user to verify your conclusions. For example, I have seen systems where “modern correlation technology” is used to propose that various accounts might belong to a given user, but which still ask the customer to demonstrate his ability to access them before information is merged.

In the real world you would never get to do this since the same entity does not control all information. You have governments, foreign entities and business competitors that would never agree to have the same identifier from someone.

This is true, but lets suppose they did. Would the user want or accept this? If the user does not, it is my view, and Stefan's mathematical argument makes this point superbly, it is virtually impossible to accurately know what to join to what.

I agree with Kim in that privacy-through-obscurity can be achieved by technological protections of privacy. Correlation errors are inevitable if the scope is big and in a lot of cases intolerable. By the way Kim, congrats for this blog. It really stimulates thinking reading from people at your level!

Thank you for those kind words – your point of view stimulates my thinking as well.

Simon Chen then says:

I agree with Stefan's analysis regarding how the error probability associated with linking identity accounts could become intolerable. However, I disagree with the alternative vision he's proposing: “Now, on the other hand, imagine a world where each user has only one user identifier that is the basis of all his or her interactions…”

I do not feel that all the users in the world can ever agree to use a single identifier representing him/her. The Internet is simply too diverse and mutable for something like this. Therefore, I believe that we can never avoid having to map identities between organizations.

I agree totally with Simon's point, but have to clear things up for Stefan, who would never propose use of a single identifier across contexts. In fact, he later added this clarifying remark:

Regarding Simon Chen's comment, perhaps it was not sufficiently clear that my paragraph “Now, on the other hand, imagine a world where each user has only one user identifier …” was intended in an ironical manner. In fact, my own work in the past 14 years is all about technically achieving, amongst others, the user-controlled approach that Simon outlines. See, for instance, here and here.

So this is good – we are all on track both for separation of contexts and multiple identifiers, but coming at it from slightly different points of view. Simon continues:

There are fundamental error probabilities associated with identity linkage, but I also believe this error can be manageable with new trust models and infrastracture.

I think Stefan made the assumption that the service providers are responsible for updating identity mappings, which can lead to major data integrity problems.

While this is true in the current world, why not delegate this responsibility in the future to the users that own the identity information in the first place.

Now imagine, in the spirit of the First Law of identity, a user with his identity information distributed across multiple service providers, and the providers are part of a trust network with established business relationships. The user can use a personal identity management interface to update how his/her identity information can be mapped and shared with his service providers, and these user-driven changes can be propagated across the trust network through this interface.

The personal identity management interface can be hosted by any service providers in the trust network, and it simply represents a gateway for the user to tap into the trust network and manage his/her identity. In this model, the user can control the number of different service providers (or contexts) that can store identity information about him and how his identity information can be shared across contexts.

Yes, much of this thinking is along the same line as intended by InfoCards – not to imply that they represent a “silver bullet”.

[tags: Correlation, Data Pollution,Fourth Law of Identity, Metadirectory, Handles]

Published by

Kim Cameron