Google's Ben Laurie has a revealing link to correspondence published by the National Auditing Office relating to HMRC's recent identity disaster.
He also explains that the practice of publishing “redacted texts” is itself outmoded in light of the kinds of statistical attacks that can now be mounted. He concludes that, “those who are entrusted with our data have absolutely no idea of the threats it faces, nor the countermeasures one should take to avoid those threats.”
In the wake of the HMRC disaster (nicely summarised by Kim Cameron), the National Audit Office has published scans of correspondence relating to the lost data.
First of all, it's notable that everyone concerned seems to be far more concerned about cost than about privacy. But an interesting question arises in relation to the redactions made to protect the “innocent”. Once more, NAO and HMRC have shown their lack of competence in these matters…
A few years ago it was a popular pastime to recover redacted data from such documents, using a variety of techniques, from the hilarious cut'n'paste attacks (where the redacted data had not been removed, merely covered over with black graphics) to the much more interesting typography related attacks. The way these work is by working backwards from the way that computers typeset. For each font, there are lookup tables that show exactly how wide each character is, and also modifications for particular pairs of characters (for example, “fe” often has less of a gap between the characters than would be indicated by the widths of the two letters alone). This means that if you can accurately measure the width of some text it is possible to deduce which characters must have made up the text (and often what order those characters must appear in). Obviously this isn't guaranteed to give a single result, but often gives a very small number of possibilities, which can then be further reduced by other evidence (such as grammar or spelling).
It seems HMRC and NAO are entirely ignorant of these attacks, since they have left themselves wide open to them. For example, on page 5 of the PDF, take the first line “From: redacted (Benefits and Credits)”. We can easily measure the gap between “:” and “(“, which must span a space, one or more words (presumably names) and another space. From this measurement we can probably make a good shortlist of possible names.
Even more promising is line 3, “cc: redacted@…”. In this case the space between the : and the @ must be filled by characters that make a legal email address and contain no spaces. Another target is the second line of the letter itself “redacted has passed this over to me for my views”. Here we can measure the gap between the left hand margin and the first character of “has” – and fit into that space a capital letter and some other letters, no spaces. Should be pretty easy to recover that name.
And so on.
This clearly demonstrates that those who are entrusted with our data have absolutely no idea of the threats it faces, nor the countermeasures one should take to avoid those threats.