Linkage – Page 7 – Kim Cameron's Identity Weblog

Beyond maximal disclosure tokens

I concluded my last piece on linkability and identity technology by explaining that the probability of collusions between Relying Parties (RPs) CAN be greatly reduced by using SAML tokens rather than X.509 certificates. To provide an example of why this is so, I printed out the content of one of the X.509 certificates I use at work, and here's what it contained:

Version	V3
Serial Number	13 9b 3c fc 00 03 00 19 c6 e2
Signature Algorithm	sha1RSA
Issuer	CN = IDA Enterprise CA 1
Valid From	Friday, February 23, 2007 8:15:27 PM
Valid To	Saturday, February 23, 2008 8:15:27 PM
Subject	CN = Kim Cameron OU = Users DC = IDA DC = Microsoft DC = com
Public Key	25 15 e3 c4 4e d6 ca 38 fe fb d1 41 9f ee 50 05 dd e0 15 dc d6 2a c3 cc 98 53 7e 9e b4 c7 a5 4c a7 64 56 66 1e 3d 36 4a 11 72 0a eb cf c9 d2 6c 1f 2e b2 2a 67 4f 07 52 70 36 f2 89 ec 98 09 bd 61 39 b1 52 07 48 9d 36 90 9c 7d de 61 61 76 12 5e 19 a5 36 e2 11 ea 14 45 b1 ba 12 e3 e2 d5 67 81 d1 1f bb 04 b1 cc 52 c2 e5 3e df 09 3d 2b a5
Subject Key Identifier	35 4d 46 4a 13 c1 ae 81 3b b8 b5 f4 86 bb 2a c0 58 d7 ad 92
Enhanced Key Usage	Client Authentication (1.3.6.1.5.5.7.3.2)
Subject Alternative Name	Other Name – Principal Name=kc@microsoft.com
Thumbprint	b9 c6 4a 1a d9 87 f1 cb 34 6c 92 50 20 1b 51 51 87 d5 a8 ee

Everything shown is released every time I use the certificate – which is basically every time I go to a site that asks either for “any old certificate” or for a certificate from my certificate authority. (As far as I know, the information is offered up before verifying that the site isn't evil). You can see that there is a lot of information leakage. X.509 certificates were designed before the privacy implications (to both individuals and their institutions) were well understood.

Beyond leaking potentially unnecessary information (like my email address), each of the fields shown in yellow is a correlation key that links my identity in one transaction to that in another – either within a single site or across multiple sites. Put another way, each yellow field is a handle that can be used to correlate my profiles. It's nice to have so MANY potential handles available, isn't it? Choosing between serial number, subject DN, public key, key identifier, alternative name and thumbprint is pretty exhausting, but any of them will work when trying to build a super-dossier.

I call this a “maximal disclosure token” because the same information is released everywhere you go, whether it is required or not. Further, it includes not one, but a whole set of omnidirectional identifiers (see law 4).

SAML tokens represent a step forward in this regard because, being constructed at the time of usage, they only need to contain information relevant to a given transaction. With protocols like the redirect protocol described here, the identity provider knows which relying party a user is visiting.

The Liberty Alliance has been forward-thinking enough to use this knowledge to avoid leaking omnidirectional handles to relying parties, through what it calls pseudonynms. For example, “persistent” and “transient” pseudonyms can be put in the tokens by the identity provider, rather than omnidirectional identifiers, and the subject key can be different for every invocation (or skipped altogether).

As a result, while the identity provider knows more about the sites visited by its users, and about the information of relevance to those sites, the ability of the sites to create cross-site profiles without the participation of the identity provider is greatly reduced. SAML does not employ maximal disclosure tokens. So in the threat diagram shown at the right I've removed the RP/RP collusion threat, which now pales in comparison to the other two.

As we will see, this does NOT mean the SAML protocol uses minimal disclosure tokens, and the many intricate issues involved are treated in a balanced way by Stefan Brands here. One very interesting argument he makes is that the relying party (he calls it “service provider or SP), actually suffers a decrease in control relative to the identity provider (IP) in these redirection protocols, while the IP gains power at the expense of the RP. For example, if Liberty pseudonyms are used, the IP will know all the customers employing a given RP, while the RP will have no direct relationship with them. I look forward to finding out, perhaps over a drink with someone who was present, how these technology proposals aligned with various business models as they were being elaborated.

To see how a SAML token compares with an X.509 certificate, consider this example:

You'll see there is an assertionID, which is different for every token that is minted. Typically it would not link a user across transactions, either within a given site or across multiple sites. There is also a “name identifier”. In this case it is a public identifier. In others it might be a pseudonym or “unidirectional identifier” recognized only by one site. It might even be a transient identifier that will only be used once.

Then there are the attributes – hopefully not all the possible attributes, but just the ones that are necessary for a given transaction to occur.

Putting all of this together, the result is an identity provider which has a great deal of visibility into and control over what is revealed where, but more protection against cross-site linking if it handles the release of attributes on a need-to-know basis.

Linkage in “redirect” protocols like SAML

Moving on from certificates in our examination of identity technology and linkability, we'll look next at the redirection protocols – SAML, WS-Federation and OpenID. These work as shown in the following diagram. Let's take SAML as our example.

In step 1, the user goes to a relying party and requests a resource using an http “GET”. Assuming the relying party wants proof of identity, it returns 2), an http “redirect” that contains a “Location” header. This header will necessarily include the URL of the identity provider (IP), and a bunch of goop in the URL query string that encodes the SAML request.

For example, the redirect might look something like this:

HTTP/1.1 302 Object Moved Date: 21 Jan 2004 07:00:49 GMT Location: https://ServiceProvider.com/SAML/SLO/Browser?SAMLRequest=fVFdS8MwFH0f7D% 2BUvGdNsq62oSsIQyhMESc%2B%2BJYlmRbWpObeyvz3puv2IMjyFM7HPedyK1DdsZdb%........2F% 50sl9lU6RV2Dp0vsLIy7NM7YU82r9B90PrvCf85W%2FwL8zSVQzAEAAA%3D% 3D&RelayState=0043bfc1bc45110dae17004005b13a2b&SigAlg=http%3A%2F% 2Fwww.w3.org%2F200%2F09%2Fxmldsig%23rsasha1& Signature=NOTAREALSIGNATUREBUTTHEREALONEWOULDGOHERE Content-Type: text/html; charset=iso-8859-1

The user's browser receives the redirect and then behaves as a good browser should, doing the GET at the URL represented by the Location header, as shown in 3).

The question of how the relying party knows which identity provider URL to use is open ended. In a portal scenario, the address might be hard wired, pointing to the portal's identity provider. Or in OpenID, the user manually enters information that can be used to figure out the URL of the identity provider (see the associated dangers).

The next question is, “How does the identity provider return the response to the relying party?” As you might guess, the same redirection mechanism is used again in 4), but this time the identity provider fills out the Location header with the URL of the relying party, and the goop is the identity information required by the RP. As shown in 5), the browser responds to this redirection information by obediently posting back to the relying party.

Note that all of this can occur without the user being aware that anything has happened or having to take any action. For example, the user might have a cookie that identifies her to her identity provider. Then if she is sent through steps 2) to 4), she will likely see nothing but a little flicker in her status bar as different addresses flash by. (This is why I often compare redirection to a world where, when you enter a store to buy something, the sales clerk reaches into your pocket, pulls out your wallet and debits your credit card without you knowing what is going on — trust us…)

Since the identity provider is tasked with telling the browser where to send the response, it MUST know what relying party you are visiting. Because it fabricates the returned identity token, it MUST know all the contents of that token.

So, returning to the axes for linkability that we set up in Evolving Technology for Better Privacy, we see that from an identity point of view, the identity provider “sees all” – without the requirement for any collusion. Knowing each other's identity, the relying party and the identity provider can, in the absence of appropriate policy and suitable auditing, exchange any information they want, either through the redirection channel, or through a “back channel” that dispenses with the user and her browser altogether.

In fact all versions of SAML include an “artifact” binding intended to facilitate this. The intention of this mechanism is that only a “handle” need be exchanged through the browser redirection channel, with the assumption that the IP and RP can then hook up and use the handle to “collaborate” about the user without her participation.

In considering the use cases for which SAML was designed, it is important to remember that redirection was not originally designed to put the “user at the center”, but rather was “intended for cases in which the SAML requester and responder need to communicate using an HTTP user agent… for example, if the communicating parties do not share a direct path of communication.” In other words, an IP/RP collaboration use case.

As Paul Masden reminded us in a recent comment, SAML 2.0 introduced a new element called RelayState that provides another means for synchronizing or exchanging information between the identity provider and the relying party; again, this demonstrates the great amount of trust a user must place in a SAML identity provider.

There are other SAML bindings that vary slightly from the redirect binding described above (for example, there is an HTTP POST binding that gets around the payload size limitations involved with the redirected GET, as Pat Paterson has pointed out). But nothing changes in terms of the big picture. In general, we can say that the redirection protocols promote much greater visibility of the IP on the RPs than was the case with X.509.

I certainly do not see this as all bad. It can be useful in many cases – for example when you would like your financial institution to verify the identity of a commercial site before you release funds to it. But the important point is this: the protocol pattern is only appropriate for a certain set of use cases, reminding us why we need to move towards a multi-technology metasystem.

It is possible to use the same SAML payloads in more privacy-protecting ways by using a different wire protocol and putting more intelligence and control on the client. This is the case for CardSpace in non-auditing mode, and Conor Cahor points out that SAML's Enhanced Client or Proxy (ECP) Profile has similar goals. Privacy is one of the important reasons why evolving towards an “active client” has advantages.

You might ask why, given the greater visibility of IP on RP, I didn't put the redirection protocols at the extreme left of my identity technology privacy spectrum. The reason is that the probability of RP/RP collusion CAN be greatly reduced when compared to X.509 certificates, as I will show next.

Revealing patterns when there is no need to do so

Irving Reid of Controlled Flight into Terrain has come up with exactly the kind of use case I wanted to see when I was thinking about Paul Madsen's points:

Kim Cameron responds to Paul Madsen responding to Kim Cameron, and I wonder what it is about Canadians and identityâ€¦

But I have to admit that I have not personally been that interested in the use case of presenting â€œmanaged assertionsâ€ to amnesiac web sites. In other words, I think the cases where you would want a managed identity provider for completely amnesiac interactions are fairly few and far between. (If someone wants to turn me around me in this regard Iâ€™m wide open.)

Shibboleth, in particular, has a very clear requirement for this use case. FERPA requires that educational institutions disclose the least possible information about students, staff and faculty to their partners. The example I heard, back in the early days of SAML, was of an institution that had a contract with an on-line case law research provider such that anyone affiliated with the law school at that institution could look up cases.

In this case, the â€œmanaged identity providerâ€ (representing the educational institution) needs to assert that the person visiting right now is affiliated with the law school. However, the provider has no need to know anything more than that, and therefore the institution has a responsibility under FERPA to not give the provider any extra information. â€œThe person looking up Case X right now is the same person who looked up Case Y last weekâ€ is one of the pieces of information the institution shouldnâ€™t share with the provider.

Put this way it is obvious that it breaks the law of minimal disclosure to reveal that “the person looking up Case X right now is the same person who looked up Case Y last weekâ€ when there is no need to do so.

I initially didn't see that a pseudonymous link between Case X and Case Y would leak very much information. But on reflection, in the competitive world of academic research, these linkages could benefit an observer by revealing patterns the observer would not otherwise be aware of. He might not know whose research he was observing, but might nonetheless cobble a paper together faster than the original researcher, beating him in terms of publication date.

I'll include this example in discussing some of the collusion issues raised by various identity technologies.

Colluding with yourself

Further to Dave Kearn's article, here is the complete text of Paul Masden's comment:

Kim Cameron introduces a nice diagram into his series exploring linkability & correlation in different identity systems.

Kim categorizes correlation as either ‘IP sees all’, ‘RP/RP collusion’, or ‘RP/IP collusion’, depending on which two entities can ‘talk’ about the user.

A meaningful distinction for RP/RP collusion that Kim omits (at least in the diagram and in his discussion of X.509) is ‘temporal self-correlation’, i.e. that in which the same RP is able to correlate the same user's visits occurring over time.

Were an IDP to use transient (as opposed to persistent pseudonymous) identifiers within a SAML assertion each time it asserted to a RP, then not only would RP's be unable to collude with each other (based on that identifier), they'd be unable to collude with themselves (the past or future themselves).

I was working on a diagram comparable to Kim's, but got lost in the additional axis for representing time (e.g. ‘what the provider knows and when they learned it’ when considering collusion potential).

Separately, Kim will surely acknowledge at some point (or already has) that these identity systems, with their varying degrees of inhibiting correlation & subsequent collusion, will all be deployed in an environment that, by default, does not support the same degree of obfuscation. Not to say that designing identity systems to inhibit correlation isn't important & valuable for privacy, just that there is little point in deploying such a system without addressing the other vulnerabilities (like a masked bank robber writing his ‘hand over the money’ note on a monogrammed pad).

First, I love Paul's comment that he “got lost in the additional axis”, since there are many potential axes – some of which have taken me to the steps of purgatory. Perhaps we can collect them into a joint set of diagrams since the various axes are interesting in different ways.

Second, I want everyone to understand that I do not see correlation as being something which is in itself bad. It depends on the context, on what we are trying to achieve. When writing my blog, I want everyone to know it is “me again”, for better or for worse. But as I always say, I would like to be able to use my search engine and read my newspaper without fear that some profile of me, the real-world Kim Cameron, would be assembled and shared.

The one statement Paul makes that I don't agree with is this:

Were an IDP to use transient (as opposed to persistent pseudonymous) identifiers within a SAML assertion each time it asserted to a RP, then not only would RP's be unable to collude with each other (based on that identifier), they'd be unable to collude with themselves (the past or future themselves).

I've been through this thinking myself.

Suppose we got rid of the user identifier completely, and just kept the assertion ID that identifies a given SAML token (must be unique across time and space – totally transient). If the relying party received such a token and colluded with the identity provider, the assertionID could be used to tie the profile at the relying party to the person who authenticated and got the token in the first place. So it doesn't really prevent linking once you try to handle the problem of collusion.

No masks in the grocery store

Dave Kearns discusses the first part of my examination of the relation between identity technologies and linking, beginning with a reference to Paul Madsen:

Paul Madsen comments on Kim Cameron's first post in a series he's about to do on privacy and collusion in on-line identity-based transactions. He notes:

“A meaningful distinction for RP/RP collusion that Kim omits (at least in the diagram and in his discussion of X.509) is ‘temporal self-correlation’, i.e. that in which the same RP is able to correlate the same user's visits occurring over time.“

and concludes:

“Not to say that designing identity systems to inhibit correlation isn't important & valuable for privacy, just that there is little point in deploying such a system without addressing the other vulnerabilities (like a masked bank robber writing his ‘hand over the money’ note on a monogrammed pad).“

Paul makes some good points. Rereading my post I tweaked it slightly to make it somewhat clearer that correlating the same user's visits occuring over time is one possible aspect of linking.

But I have to admit that I have not personally been that interested in the use case of presenting “managed assertions” to amnesiac web sites. In other words, I think the cases where you would want a managed identity provider for completely amnesiac interactions are fairly few and far between. (If someone wants to turn me around me in this regard I'm wide open.) To me the interesting use cases have been those of pseudonymous identity – sites that respond to you over time, but are not linked to a natural person. This isn't to say that whatever architecture we come out with can simply ignore use cases people think are important.

Dave continues:

I'd like to add that Kim's posting seems to fall into what I call on-line fallacy #1 – the on-line experience must be better in some way than the “real world” experience, as defined by some non-consumer “expert”. This first surfaced for me in discussions about electronic voting (see Rock the Net Vote), where I concluded “The bottom line is that computerized voting machines – even those running Microsoft operating systems [Dave, mais vous Ãªtes trop mÃ©chant! – Kim]- are more secure and more reliable than any other ‘secret ballot’ vote tabulation method we've used in the past.”

When I re-visit a store, I expect to be recognized. I hope that the clerk will remember me and my preferences (and not have to ask “plastic or paper?” every single blasted time!). Customers like to be recognized when they return to the store. We appreciate it when we go to the saloon where “everybody knows your name” and the bartender presents you with a glass of “the usual” without you having to ask. And there is nothing wrong with that! It's what most people want. Fallacy #2 is that most Jeremiahs (those weeping, wailing, and tooth-gnashing doomsayers who wish to stop technology in it's tracks) think that what they want is what everyone should want, and would want if the hoi-polloi were only educated enough. (and people think I'm elitist! 🙂

I do wish that all those “anonymity advocates” would start trying to anonymize themselves in the physical world, too. So here's a test – next time you need to visit your bank, wear a mask. Be anonymous. But tell your lawyer to stand by the phone…

Dave, I think you are really bringing up an important issue here. But beyond the following brief comment, I would like to refrain from the discussion until I finish the technical exploration. I ask you to go with me on the idea that there are cases where you want to be treated like you are in your local pub, and there are cases where you don't. The whole world is not a pub – as much as that might have some advantages, like beer.

In the physical world we do leave impressions of the kind you describe. But in the digital world they can all be assembled and integrated automatically and communicated intercontinentally to forces unknown to you in a way that is just impossible in the physical world. There is absolutely no precedent for digital physics. We need to temper your proposed fallacies with this reality.

I'm trying to do a dispassionate examination of how the different identity technologies relate to linking, without making value judgements about use cases.

That done, let's see if we can agree on some of the digital physics versus physical reality issues.

Evolving technology for better privacy

Let's continue to explore linking, and of how it relates to CardSpace, identity protocols, token formats and cryptography.

I've summarized a number of thoughts in the following diagram, which contrasts the linking threats posed by a number of technology combinations. The diagram presents these technologies on an ordinal scale ranging from the most dangerous to the least – along the axis of linkage prevention.

X.509 with OCSP

Let's begin with Public Key Infrastructure (PKI) technology employed with X.509 user certificates.

Here the user has a key only she can “exercise”, and some Certificate Authority (CA) mints a long-lived certificate binding her key to her name, organization, country and the like. When the user visits a relying party who trusts the CA, she presents the certificate and exercises the key – typically by using it to sign a challenge created by the relying party.

In many cases, in addition to binding the key to attributes, the CA exists with the explicit mission of linking it to a “natural person” (in the sense of an identifiable person in the physical world). However, for now we'll leave the meaning of assertions aside and look only at how the technology itself impacts privacy.

Since the user presents the same long-lived certificate to every relying party who trusts the CA, the certificate and the information in it link the user accross sessions on one web site, and between one web site and another. Any two web sites obtaining this kind of certificate can compare notes and determine that the same user has visited each of them. This allows linkage of their profiles into a super-dossier (possibly including a super-dossier of a natural person).

What is good about X.509 is that if a relying party does not collude, the CA has no visibility onto the fact that a given user has visited it (we will see that in some other systems such visibility is unavoidable). But a relying party could at any point decide to collude with the CA (assuming the CA actually accepts such information, which may be a breach of policy). This might result in the transfer of information in either direction beyond that contained in the certificate itself.

So in the diagram, I express this through two risks of collusion. The first is between any two relying parties who receive the same certificate. The second is between any relying party and the certificate authority. In esssence, then, all participating parties can collude with any other party, so this represents one of the worst possible technology alternatives if privacy is your goal.

In light of this it makes sense that X.509 has been successful as a technology for public entities like corporate web sites, where correlation is actually a good thing, but not for individual identification where privacy is part of the equation.

(Continues tomorrow…).

Keys, signatures and linkability

Stefan Brands is contributing to the discussion of traceability, inkability and selective disclosure with a series of posts over at identity corner. He is one of the world's key innovators in the cryptography of unlinkability, so his participation is especially interesting.

Consider a user who self-generates several identity claims at different occassions, say â€œI am 25 years of ageâ€, â€œI am maleâ€, and â€œI am a citizen of Canadaâ€. The userâ€™s software packages these assertions into identity claims by means of attribute type/value pairs; for instance, claim 1 is encoded as â€œage = 25â€, claim 2 is â€œgender = 0â€, and claim 3 is â€œcitizenship = 1â€. Clearly, relying parties that receive these identity claims cannot trace them to their userâ€™s identity (whether that be represented in the form of a birth name, an SSN, or another identifier) by analyzing the presented claims; self-generated claims are untraceable. Similarly, they cannot decide whether or not different claims are presented by the same or by different users; self-generated claims are unlinkable.

Note that these two privacy properties (which are different but, as we will see in the next paragraph, complementary) hold â€œunconditionally;â€ no amount of computing power will enable relying parties to trace or link by analyzing incoming identity-data flows, not even if relying parties collude (indeed, they may be the same entity).

Now, consider the same self-generated identity claims, but this time their user â€œself-protectsâ€ them by means of a self-generated cryptographic key pair (e.g., a random RSA private key and its corresponding public key). The user digitally signs the identity claims with his private key; for example, claim 1 as presented to a relying party looks like â€œage = 25; PublicKey = 37AC986Bâ€¦; Signature = 21A4A5B6â€¦â€. Clearly, these self-protected claims are as untraceable as their unprotected cousins in the previous paragraph. Are they unlinkable? Well, that depends:

If the user applies the same key pair to all claims, then the public key that is present in the presented messages will be the same; thus, all presented identity claims are linkable. As a result, a relying party that receives all three claims over time knows that it is dealing with a 25-year old Canadian male. As the user over time presents more linkable claims, this may indirectly lead to traceability; for example, the relying party may be able to infer the userâ€™s birth name once the user presents a linkable identity claim that states the postal code of his home address.

If the user applies a different self-generated key pair to each identity claim, the three presented claims are as unlinkable and untraceable as in the example where no cryptographic data was appended. Note that this solution does notforce unlinkability and untraceability: in cases where the user should be identified, the user can simply provide a claim that specifies his name: â€œname=Jon Smithâ€ or â€œSSN-identifier=945278476â€, for instance. Similarly, to make self-generated identity claims linkable, an additional common attribute value can be encoded

This is a clear way to introduce the notion of how keys and signatures affect tracability and linkability of claims. However there is more to consider. Even if the user applies a different self-generated key pair for each of the three attributes discussed above, if the three attributes are transfered in a single transaction, they are still linked. The transaction itself links the attribute assertions. Convenyance of multiple claims is a very common case.

Similarly, if Stefan's three attributes are released during what can be considered to be the same session, they are linked, again regardless of the cryptography. And if they are released within a given time window from the same transport (IP) address, they should be considered linked too.

While cryptography is one factor contributing to linkability, we need to look at the protocol patterns and visibility they render possible as well. I'll be starting to do that in my next posting.

Linkage and identification

Inspired by some of Ben Laurie's recent postings, I want to continue exploring the issues of privacy and linkability (see related pieces here and here).

I have explained that CardSpace is a way of selecting and transferring a relevant digital identity – not a crypto system; and that the privacy characteristics involved depend on the nature of the transaction and the identity provider being used within CardSpace – not on CardSpace itself. I ended my last piece this way:

The question now becomes that of how identity providers behave. Given that suddenly they have no visibility onto the relying party, is linkability still possible?

But before zeroing in on specific technologies, I want to drill into two issues. First is the meaning of “identification”; and second, the meaning of “linkability” and its related concept of “traceability”.

Having done this will allow us to describe different types of linkage, and set up our look at how different cryptographic approaches and transactional architectures relate to them.

Identification

There has been much discussion of identification (which, for those new to this world, is not at all the same as digital identity). I would like to take up the definitions used in the EU Data Protection Directive, which have been nicely summarized here, but add a few precisions. First, we need to broaden the definition of “indirect identification” by dropping the requirement for unique attributes – as long as you end up with unambiguous identification. Second, we need to distinguish between identification as a technical phenomenon and personal identification.

This leads to the following taxonomy:

Personal data:
- any piece of information regarding an identified or identifiable natural person.
Direct Personal Identification:
- establishing that an entity is a specific natural person through use of basic personal data (e.g., name, address, etc.), plus a personal number, a widely known pseudo-identity, a biometric characteristic such as a fingerprint, PD, etc.
Indirect Personal Identification:
- establishing that an entity is a specific natural person through other characteristics or attributes or a combination of both – in other words, to assemble “sufficiently identifying” information
Personal Non-Identification:
- assumed if the amount and the nature of the indirectly identifying data are such that identification of the individual as a natural person is only possible with the application of disproportionate effort, or through the assistance of a third party outside the power and authority of the person responsible…

Translating to the vocabulary we often use in the software industry, direct personal identification is done through a unique personal identifier assigned to a natural person. Indirect personal identification occurs when enough claims are released – unique or not – that linkage to a natural person can be accomplished. If linkage to a natural person is not possible, you have personal non-identification. We have added the word “personal” to each of these definitions so we could withstand the paradox that when pseudonyms are used, unique identifiers may in fact lead to personal non-identification…

The notion of “disproportionate effort” is an important one. The basic idea is useful, with the proviso that when one controls computerized systems end-to-end one may accomplish very complicated tasks, computations and correlations very easily – and this does not in itself constitute “disproportionate effort”.

Linkability

If you search for “linkability”, you will find that about half the hits refer to the characteristics that make people want to link to your web site. That's NOT what's being discussed here.

Instead, we're talking about being able to link one transaction to another.

The first time I heard the word used this way was in reference to the E-Cash systems of the eighties. With physical cash, you can walk into a store and buy something with one coin, later buy something else with another coin, and be assured there is no linkage between the two transactions that is caused by the coins themselves.

This quality is hard to achieve with electronic payments. Think of how a credit card or debit card or bank account works. Use the same credit card for two transactions and you create an electronic trail that connects them together.

E-Cash was proposed as a means of getting characteristics similar to those of the physical world when dealing with electronic transactions. Non-linkability was the concept introduced to describe this. Over time it has become a key concept of privacy research, which models all identity transactions as involving similar basic issues.

Linkability is closely related to traceability. By traceability people are talking about being able to follow a transaction through all its phases by collecting transaction information and having some way of identifying the transaction payload as it moves through the system.

Traceability is often explicitly sought. For example, with credit card purchases, there is a transaction identifier which ties the same event together across the computer systems of the participating banks, clearing house and merchant. This is certainly considered “a feature.” There are other, subtler, sometimes unintended, ways of achieving traceability (timestamps and the like).

Once you can link two transactions, many different outcomes may result. Two transactions conveying direct personal identification might be linked. Or, a transaction initially characterized by personal non-identification may suddenly become subject to indirect personal identification.

To further facilitate the discussion, I think we should distinguish various types of linking:

Intra-transaction linking is the product of traceability, and provides visibility between the claims issuer, the user presenting the claims, and the relying party (for example, credit card transaction number).
Single-site transaction linking associates a number of transactions at a single site with a data subject. The phrase “data subject” is used to clarify that no linking is implied between the transactions and any “natural person”.
Multi-site transaction linking associates linked transactions at one site with those at another site.
Natural person linking associates a data subject with a natural person.

Next time I will use these ideas to help explain how specific crypto systems and protocol approaches impact privacy.