What Could Google Do With the Data It's Collected?

Niraj Chokshi has published a piece in The Atlantic where he grapples admirably with the issues related to Google's collection and use of device fingerprints (technically called MAC Addresses).  It is important and encouraging to have journalists like Niraj taking the time to explore these complex issues.  

But I have to say that such an exploration is really hard right now. 

Whether on purpose or by accident, the Google PR machine is still handing out contradictory messages.  In particular, the description in Google's Refresher FAQ titled “How does this location database work?” is currently completely different from (read: the opposite of) what its public relations people are telling journalists like Nitaj.  I think reestablishing credibility around location services requires the messages to be made consistent so they can be verified by data protection authorities.

Here are some excerpts from the piece – annotated with some comments by me.  [Read the whole article here.] 

The Wi-Fi data Google collected in over 30 countries could be more revealing than initially thought…

Google's CEO Eric Schmidt has said the information was hardly useful and that the company had done nothing with it. The search giant has also been ordered (or sought) to destroy the data. According to their own blog post, Google logged three things from wireless networks within range of their vans: snippets of unencrypted data; the names of available wireless networks; and a unique identifier associated with devices like wireless routers. Google blamed the collection on a rogue bit of code that was never removed after it had been inserted by an engineer during testing.

[The statement about rogue code is an example of the PR ambiguity Nitaj and other journalists must deal with.  Google blogs don't actually blame the collection of unique identifiers on rogue code, although they seem crafted to leave people with that impression.  Spokesmen only blame rogue code for the collection of unencrypted data content (e.g. email messages.) – Kim]

Each of the three types of data Google recorded has its uses, but it's that last one, the unique identifier, that could be valuable to a company of Google's scale. That ID is known as the media access control (MAC) address and it is included — unencrypted, by design — in any transfer, blogger Joe Mansfield explains.

Google says it only downloaded unencrypted data packets, which could contain information about the sites users visited. Those packets also include the MAC address of both the sending and receiving devices — the laptop and router, for example.

[Another contradiction: Google PR says it “only” collected unencrypted data packets, but Google's GStumbler report  says its cars did collect and record the MAC addresses from encrypted data frames as well. – Kim]

A company as large as Google could develop profiles of individuals based on their mobile device MAC addresses, argues Mansfield:

Get enough data points over a couple of months or years and the database will certainly contain many repeat detections of mobile MAC addresses at many different locations, with a decent chance of being able to identify a home or work address to go with it.

Now, to be fair, we don't know whether Google actually scrubbed the packets it collected for MAC addresses and the company's statements indicate they did not. [Yet the GStumbler report says ALL MAC addresses were recorded – Kim].  The search giant even said it “cannot identify an individual from the location data Google collects via its Street View cars.”  Add a step, however, and Google could deduce an individual from the location data, argues Avi Bar-Zeev, an employee of Microsoft, a Google competitor.

[Google] could (opposite of cannot) yield your identity if you've used Google's services or otherwise revealed it to them in association with your IP address (which would be the public IP of your router in most cases, visible to web servers during routine queries like HTTP GET). If Google remembered that connection (and why not, if they remember your search history?), they now have your likely home address and identity at the same time. Whether they actually do this or not is unclear to me, since they say they can't do A but surely they could do B if they wanted to.

Theoretically, Google could use the MAC address for a mobile device — an iPod, a laptop, etc. — to build profiles of an individual's activity. (It's unclear whether they did and Google has indicated that they have not.) But there's also value in the MAC addresses of wireless routers.

Once a router has been associated with a real-world location, it becomes useful as a reference point. The Boston company Skyhook Wireless, for example, has long maintained a database of MAC addresses, collected in a (slightly) less-intrusive way. Skyhook is the primary wireless positioning system used by Apple's iPhone and iPod Touch. (See a map of their U.S. coverage here.) When your iPod Touch wants to retrieve the current location, it shares the MAC addresses of nearby routers with Skyhook which pings its database to figure out where you are.

Google Latitude, which lets users share their current location, has at least 3 million active users and works in a similar way. When a user decides to share his location with any Google service on a non-GPS device, he sends all visible MAC addresses in the vicinity to the search giant, according to the company's own description of how its location services works.

[Update: Google's own “refresher FAQ” states that a user of its geo-location services, such as Latitude, sends all MAC addresses “currently visible to the device” to Google, but a spokesman said the service only collects the MAC addresses of routers. That FAQ statment is the basis of the following argument.]

This is disturbing, argues blogger Kim Cameron (also a Microsoft employee), because it could mean the company is getting not only router addresses, but also the MAC addresses of devices such as laptops and iPods. If you are sitting next to a Google Latitude user who shares his location, Google could know the address and location of your device even though you didn't opt in. That could then be compared with all other logged instances of your MAC address to develop a profile of where the device is and has been.

Google denies using the information it collected and, if the company is telling the truth, then only data from unencrypted networks was intercepted anyway, so you have less to worry about if your home wireless network is password-protected. (It's still not totally clear whether only router MAC addresses were collected. Google said it collected the information for devices “like a WiFi router.”) Whether it did or did not collect or use this information isn't clear, but Google, like many of its competitors, has a strong incentive to get this kind of location data.

[Again, and I really do feel for Niraj, the PR leaves the impression that if you have passwords and encryption turned on you have nothing to worry about, but Googles’ GStumbler report says that passwords and encryption did not prevent the collection of the MAC addresses of phones and laptops from homes and businesses. – Kim]

I really tuned in to these contradictory messages when a reader first alerted me to Niraj's article.   It looked like this:

My comments earned their strike-throughs when a Google spokesman assured the Atlantic “the Service only collects the MAC addresses of routers.”  I pointed out that my statement was actually based on Google's own FAQ, and it was their FAQ (“How does this location database work?”) – rather than my comments – that deserved to be corrected.  After verifying that this was true, Niraj agreed to remove the strikethrough.

How can anyone be expected to get this story right given the contradictions in what Google says it has done?

In light of this, I would like to see Google issue a revision to its “Refresher FAQ” that currently reads:

The “list of MAC addresses which are currently visible to the device” would include the addresses of nearby phones and laptops.  Since Google PR has assured Niraj that “the service only collects the MAC addresses of routers”, the right thing to do would be to correct the FAQ so it reads:

  • “The user’s device sends a request to the Google location server with the list of MAC addresses found in Beacon Frames announcing a Network Access Point SSID and excluding the addresses of end user devices like WiFi enabled phones and laptops.”

This would at least reassure us that Google has not delivered software with the ability to track non-subscribers and this could be verified by data protection authorities.  We could then limit our concerns to what we need to do to ensure that no such software is ever deployed in the future.

 

National Strategy for Trusted Identities in Cyberspace

Friday saw what I think is a historic post by Howard Schmidt on The Whitehouse Blog:

“Today, I am pleased to announce the latest step in moving our Nation forward in securing our cyberspace with the release of the draft National Strategy for Trusted Identities in Cyberspace (NSTIC).  This first draft of NSTIC was developed in collaboration with key government agencies, business leaders and privacy advocates. What has emerged is a blueprint to reduce cybersecurity vulnerabilities and improve online privacy protections through the use of trusted digital identities. “

I say the current draft is historic because of the grasp of identity issues it achieves

At the core of the document is a recognition that we need a solution supporting privacy-enhancing technologies and built by harnessing a user-centric Identity Ecosystem offering citizens and private enterprise plenty of choice.  

Finally we have before us a proposal that can move society forward in  protecting individual privacy and simultaneously create a secure and trustworthy infrastructure with enough protections to be resistant to insider attacks.  

Further, the work appears to have support from multiple government agencies – the Department of Homeland Security was a key partner in its creation. 

Here are the guiding principles (beginning page 8):

  • Identity solutions will be secure and resilient
  • Identity solutions will be interoperable
  • Identity solutions will be privacy enhancing and voluntary for the public
  • Identity solutions will be cost-effective and easy to use

Let's start with the final “s” on the word “solutions” – a major achievement.  The authors understand society needs a spectrum of approaches suitable for different use cases but fitting within a common interoperable framework – what I and others have called an identity metasystem. 

The report embraces the need for anonymous access as well as that for strong identification.  It stands firmly in favor of minimal disclosure.  The authors call out the requirement that solutions be privacy enhancing and voluntary for the public, rather than attempting to ram something bureaucratic down peoples’ throats.  And they are fully cognisant of the practicality and usability requirements for the initiative to be successful.  A few years ago I would not have believed this kind of progress would be possible.

Nor is the report just a theoretical treatment devoid of concrete proposals.  The section on “Commitment to Action” includes:

  • Designate a federal agency to lead the public/private sector efforts to advance the vision
  • Develop a shared, comprehensive public/private sector implementation plan
  • Accelerate the expansion of government services, pilots and policies that align with the identity ecosystem
  • Work to implement enhanced privacy protections
  • Coordinate the development and refinement of risk management and interoperability standards
  • Address liability concerns of service providers and individuals
  • Perform outreach and awareness across all stakeholders
  • Continue collaborating in international efforts
  • Identify other means to drive adoption

Readers should dive into the report – it is in a draft stage and “Public ideas and recommendations to further refine this Strategy are encouraged.”  

A number of people and organizations in the identity world have participated in getting this right, working closely with policy thinkers and those leading this initiative in government.  I don't hesitate to say that congratulations are due all round for getting this effort off to such a good start.

We can expect suggestions to be made strengthening various aspects of the report – mainly in terms of making it more internally consistent.  

For example, the report contains good vignettes about minimal disclosure and the use of claims to gain access to resources.  Yet it also retains the traditional notion that authentication is dependent on identification.  What is meant by identification?  Many will assume it means “unique identification” in the old-fashioned sense of associating someone with an identifier.  That doesn't jive with the notion of minimal disclosure present throughout the report.  Why? For many purposes association with an identifier is over-identification or unhelpful, and a simple proof of some set of claims would suffice to control access.  

But these refinements can be made fairly easily.  The real challenge will be to actually live up to the guiding principles as we move from high level statements to a widely deployed system – making it truly secure, resilient and privacy enhancing.  These are guiding principles we can use to measure our success and help select between alternatives.

 

The Consumerist says “Apple is Watching”

A reader has pointed me to this article in The Consumerist (“Shoppers bite back”) about Apple's new privacy policy


Schmegga

Apple updated its privacy policy today, with an important, and dare we say creepy new paragraph about location information. If you agree to the changes, (which you must do in order to download anything via the iTunes store) you agree to let Apple collect store and share “precise location data, including the real-time geographic location of your Apple computer or device.”

Apple says that the data is “collected anonymously in a form that does not personally identify you,” but for some reason we don't find this very comforting at all. [Good instinct ! – Kim]. There appears to be no way to opt-out of this data collection without giving up the ability to download apps.

Here's the full text [Emphasis is mine – Kim]:

Location-Based Services

“To provide location-based services on Apple products, Apple and our partners and licensees may collect, use, and share precise location data, including the real-time geographic location of your Apple computer or device. This location data is collected anonymously in a form that does not personally identify you and is used by Apple and our partners and licensees to provide and improve location-based products and services. For example, we may share geographic location with application providers when you opt in to their location services.

Some location-based services offered by Apple, such as the MobileMe “Find My iPhone” feature, require your personal information for the feature to work. “

I wonder how The Consumerist will feel when it figures out how this change ties in to the new world-wide databases linking device identifiers and home addresses?

The consumerist piece is dated June 21, 2010 9:50 PM, and seems to confirm that the change in policy has only been made public since Google's WiFi shenanigans have been discovered by data protection authorities… The point about “no opt out” is very important too.

Apple giving out your iPhone fingerprints and location

I went to the Apple App store a few days ago to download a new iPhone application.  I expected that this would be as straightforward as it had been in the past: choose a title, click on pay, and presto – a new application becomes available.

No such luck.  Apple had changed it's privacy policy, and I was taken to the screen at right,  To proceed I had to “read and accept the new Terms and Conditions”.  I pressed OK and up came page 1 of a new 45 page “privacy” policy.

I would assume “normal people” would say “uncle” and “click approve” around page 3.  But in light of what is happening in the industry around location services I kept reading the tiny, unsearchable, unzoomable print.

And there – on page 37 – you come to “the news”.  Apple's new “privacy” policy reveals that if you use Apple products Apple can disclose your device fingerprints and location to whomever it chooses and for whatever purpose:

Collection and Use of Non-Personal Information

We also collect non-personal information – data in a form that does not permit direct association with any specific individual. We may collect, use, transfer, and disclose non-personal information for any purpose. The following are some examples of non-personal information that we collect and how we may use it:

  • We may collect information such as occupation, language, zip code, area code, unique device identifier, location, and the time zone where an Apple product is used so that we can better understand customer behavior and improve our products, services, and advertising.

No “direct association with any specific individual…”

Maintaining that a personal device fingerprint has “no direct association with any specific individual” is unbelievably specious in 2010 – and even more ludicrous than it used to be now that Google and others have collected the information to build giant centralized databases linking phone MAC addresses to house addresses.  And – big surprise – my iPhone, at least, came bundled with Google's location service.

The irony here is a bit fantastic.  I was, after all, using an “iPhone”.  I assume Apple's lawyers are aware there is an “I” in the word “iPhone”.  We're not talking here about a piece of shared communal property that might be picked up by anyone in the village.  An iPhone is carried around by its owner.  If a link is established between the owner's natural identity and the device (as Google's databases have done), its “unique device identifier” becomes a digital fingerprint for the person using it. 

Apple's statements constitute more disappointing doubletalk that is suspiciously well-aligned with the statements in Google's now-infamous WiFi FAQ.  Checking with the “Wayback machine” (which is of course not guaranteed to be accurate or up to date) the last change recorded in Apple's privacy policy seems to have been made in April 2008.  It contained no reference to device identifiers or location services. 

 

ID used to save “waggle dance”

MSN reports on a fascinating use of tracking:

Bees are being fitted with tiny radio ID tags to monitor their movements as part of research into whether pesticides could be giving the insects brain disorders, scientists have revealed

The study is examining concerns that pesticides could be damaging bees’ abilities to gather food, navigate and even perform their famous “waggle dance” through which they tell other bees where nectar can be found.

I can't help wondering if wearing an antenna twice one's size might also throw off one's “waggle dance”? There is too the question of how this particular bee gets back into its hive to be tracked another day.  But I leave those questions to the researchers.

 

Digital copiers – a privacy and security timebomb

Everyone involved with software and services should watch this remarkable investigative report by CBS News and think about what it teaches us.

Nearly every digital copier built since 2002 contains a hard drive storing an image of every document copied, scanned, or emailed by the machine.  Because of this, the report shows, an office staple has turned into a digital time-bomb packed with highly-personal or sensitive data.  To quote the narrator, “If you're in the identity theft business it seems this would be a pot of gold.”

In the video, the investigators purchase some used machines and then John Juntunen of Digital Copier Security shows them what is still stored on them when they are resold.  As he says, “The type of information we see on these machines with the social security numbers, birth certificates, bank records, income tax forms… would be very valuable.”   He's been trying to warn people about the potential risk, but “Nobody wants to step up and say, ‘we see the problem, and we need to solve it.'”

The results obtained by the investigators in their random sample are stunning, turning up:

  • detailed domestic violence complaints;
  • a list of wanted sex offenders;
  • a list of targets in a major drug raid;
  • design plans for a building near Ground Zero in Manhattan;
  • 95 pages of pay stubs with names, addresses and social security numbers;
  • $40,000 in copied checks; and
  • 300 pages of individual medical records including everything from drug prescriptions, to blood test results, to a cancer diagnosis.

Why are these records sitting around on the hard disk in the first place?  Why aren't they deleted once the copy has been completed or within some minimal time?  If they are kept for audit purposes, why aren't they encrypted for the auditor? 

Is this “rainy-day data collection?” Gee, we have a hard disk, why don't we keep the scans around – they might come in useful sometime. 

It becomes clear that addressing privacy and security threats was never a concern in designing these machines – which are actually computer systems.  This was an example of “privacy chernoble by design”.  Of course I'm speaking not only about individual privacy, but that of the organizations using the machines as well.   The report makes it obvious that digital copiers, or anything else that collects or remembers information, must be designed based on the Law of Minimal Disclosure.

This story also casts an interesting light on what the French are calling “le droit à l'oubli” – the right to have things forgotten.   Most discussions I've seen call for this principle to be applied on the Internet.  But as the digital world collides with the molecular one, we will see the need to build information lifetimes into all digital systems, including smart systems in our environment.  The current and very serious problems with copiers should be seen as profoundly instructive in this regard.

[Thanks to Francis Shanahan for heads up] 

Harvesting phone and laptop fingerprints for its database

In The core of the matter at hand I gave the example of someone attending a conference while subscribed to a geo-location service.  I argued that the subscriber's cell phone would pick up all the MAC addresses (which serve as digital fingerprints) of nearby phones and laptops and send them in to the centralized database service, which would look them up and potentially use the harvested addresses to further increase its knowledge of people's behavior – for example, generating a list of those attending the conference.

A reader wrote to express disbelief that the MAC addresses of non-subscribers would be collected by a company like Google.  So I close this series on WiFi device identifiers with this quote from what Google calls its “refresher FAQ” (emphasis in the quote below is mine):  

How does this location database work?

Google location based services using WiFi access point data work as follows:

  • The user’s device sends a request to the Google location server with a list of MAC addresses which are currently visible to the device;
  • The location server compares the MAC addresses seen by the user’s device with its list of known MAC addresses, and identifies associated geocoded locations (i.e. latitude / longitude);
  • The location server then uses the geocoded locations associated with visible MAC address to triangulate the approximate location of the user;
  • and this approximate location is geocoded and sent back to the user’s device.

So certainly the MAC addresses of all nearby phones and laptops are sent in to the geo-location server – not simply the MAC addresses of wireless access points that are broadcasting SSIDs.  And this is significant from a technical point of view.

Why not edit out the MAC addresses you don't need prior to transmission, reducing transmission size, cost and the amount of work that the central database server must do? Clearly, it was considered useful to collect all the phone fingerprints – including those of non-subscribers.  Of course Google's  WiFi cars also collect the same fingerprints – while driving past peoples’ homes.  So it is clearly possible for their system to match the fingerprints of non-subscribers to their home locations, and thus to their natural identities. 

Is this matching of non-subscribers being done today?  I have no idea.  But Google has put in place all the machinery to do it and pays a premium to operate its geolocation service so as to gather this information.  Further, if allowed to mature, the market for the extra intelligence collected about our behaviors will be immense.

So there is nothing unlikely about the scenario I describe.   I have now examined all the issues I wanted to bring to light and I'll move on to other matters for a while.

 

Trip down memory lane

Joe Mansfield's comment that Bluetooth “doesn’t appear to be all that bad from a privacy leakage perspective” left me rummaging through memory lane – awakening memories that may help explain why I now believe that world-wide databases of MAC addresses constitute a central socio-technical problem of our time.

I was taken back to an unforgettable experience I had in 2005 while working on the Laws of Identity.  I had finished the Fourth Law and understood theoretically why technical systems should use “unidirectional identifiers” (meaning identifiers limited to a defined context) rather than “universal identifiers” (things like social security numbers) unless the goal was to be completely public.  But there is a difference between understanding something theoretically and right in the gut.

Rather than retell the story, here is what I wrote on my blog in Just a few scanning machines on Tuesday 6 September 2005:

Since I seem to be on the subject of Bluetooth again, I want to tell you about an experience I had recently that put a gnarly visceral edge on my opposition to technologies that serve as tracking beacons for us as private individuals.

I was having lunch in San Diego with Paul Trevithick, Stefan Brands and Mary Rundle. Everyone knows Paul for his work with Social Physics and the Berkman identity wiki; Stefan is a tremendously innovative privacy cryptographer; and Mary is pushing the envelope on cyber law with Berkman and Stanford.

Suddenly Mary recalled the closing plenary at the Computers, Freedom and PrivacyPanopticon Conference” in Seattle.

She referred off-handedly to “the presentation where they flashed a slide tracking your whereabouts throughout the conference using your Bluetooth phone.”

Essentially I was flabbergasted. I had missed the final plenary, and had no idea this had happened.

MAC Name Room Time Talk
Kim Cameron Mobile
00:09:2D:02:9A:68
Grand I (G1) Wed 09:32 09:32 ????
Grand Crescent (gc) Wed 09:35 09:35 Adware and Privacy: Finding a Common Ground
Grand I (G1) Wed 09:37 09:37 ????
Grand Crescent (gc) Wed 09:41 09:42 Adware and Privacy: Finding a Common Ground
Grand I (G1) Wed 09:46 09:47 ????
Grand III (g3) Wed 10:18 10:30 Intelligent Video Surveillance
Baker (ol) Wed 10:33 10:42 Reforming E-mail and Digital Telephonic Privacy
Grand III (g3) Wed 10:47 10:48 Intelligent Video Surveillance
Grand Crescent (gc) Wed 11:25 11:26 Adware and Privacy: Finding a Common Ground
Grand III (g3) Wed 11:46 12:22 Intelligent Video Surveillance
5th Avenue (5a) Wed 12:33 12:55 ????
Grand III (g3) Wed 13:08 14:34 Plenary: Government CPOs: Are they worth fighting for?

Of course, to some extent I'm a public figure when it comes to identity matters, and tracking my participation at a privacy conference is, I suspect, fair game. Or at any rate, it's good theatre, and drives home the message of the Fourth Law, which makes the point that private individuals must not be subjected – without their knowledge or against their will – to technologies that create tracking beacons.

A picture named kim_cameron.JPGLater Mary introduced me to Paul Holman from The Shmoo Group. He was the person who had put this presentation together, and given our mutual friends I don't doubt his motives. In fact, I look forward to meeting him in person.

He told me:

“I take it you missed our quick presentation, but essentially, we just put Bluetooth scanning machines in a few of the conference rooms and had them log the devices they saw. This was a pretty unsophisticated exercise, showing only devices in discoverable mode. To get them all would be a lot more work. You could do the same kind of thing just monitoring for cell phones or WiFi devices or whatever. We were trying to illustrate a crude version of what will be possible with RFIDs.”

The Bluetooth tracking was tied in to the conference session titles, and by clicking on a link you could see the information represented graphically – including my escape to a conference center window so I could take a phone call.

Anyway, I think I have had a foretaste of how people will feel when networks of billboards and posters start tracking their locations and behaviors. They won't like it one bit. They'll push back.

A foretaste indeed

One of my readers wrote to say I should turn my Bluetooth broadcast off, and I responded:

You’re right, and I have turned it off. Which bothers me. Because I like some of the convenience I used to enjoy.

So I write about this because I’d rather leave my Bluetooth phone enabled, interacting only with devices run by entities I’ve told it to cooperate with.

We have a lot of work to do to get things to this point. I see our work on identity as being directed to that end, at least in part.

We need to be able to easily express and select the relationships we want to participate in – and avoid – as cyberspace progressively penetrates the world of physical things.

The problems of Bluetooth all exist in current Wifi too. My portable computer broadcasts another tracking beacon. I’m not picking on Bluetooth versus other technologies. Incredibly, they all need to be fixed. They’re all misdesigned.

If anything has shocked me while working on the Laws of Identity, it has been the discovery of how naive we’ve been in the design of these systems to date – a product of our failure to understand the Fourth Law of Identity. The potential for abuse of these systems is collosal – enterprises like the UK’s Filter are just the most benign tip of an ugly iceberg.

For everyone’s sake I try to refrain from filling in what the underside of this iceberg might look like

Google's Street View group, which has been assembling a massive central registry of WiFi MAC addresses, has definitely crawled out from under this iceberg, and the project is more sinister than any I imagined only a few years ago.

But so as not to leave everyone feeling completely depressed, all the dreams of Billboards that recognize you from your Bluetooth phone have now been abandoned by Bluetooth manufacturers, and the specification has been greatly improved in light of the criticism it received.  Let's hope that geo-location providers, and Google in particular, see the same light, and assure us they will no longer collect or store the MAC address of any device unless that collection is approved by the subscriber.

What does a MAC address tell you?

Joe Mansfield at Peccavi has published a nice, clear and abridged explanation of the issues I've been discussing over the last few weeks.  

But before doing that he makes an important and novel point about why regulation may be useful even if it can't “prevent all abuses”:

I’d discounted the payload snooping issue as a distraction because I’d believed (and still do) that it was almost certainly an unfortunate error. I’d then made the point that a legal barrier to a technical problem was insufficient to prevent the bad guys doing bad things but I used that as an excuse to ignore the problem – small scale abuses of this sort of thing are not good but systematic large scale abuses “benefit” from network scaling effects. You might not be able to prevent small scale\illegal abuse through legal means but just because you can’t does not mean that you can’t control large scale abuses this way. The benefits and dangers inherent in this data become exponentially worse as the scale of the database that contains it increases. Large scale means companies and companies react to regulation by being much more careful about what they do. If a technology that is already out there has major privacy issues the regulatory approach is the only way to keep a lid on the problem while the technologists argue about how to fix the bits. Even if we assume that the law was OK about companies creating Geo-location databases using WiFi SSID\MAC mapping, effective regulation would have made the additional mistake made by Google (assuming it was a mistake) much less likely.

Next he explains how WiFi works as a layered protocol in which MAC addresses are exposed despite encryption and SSID suppression:

Now the obvious question is should scanning for identifiers that are broadcast openly by all WiFi radio signals be acceptable and legal?

802.11 WiFi signals are pretty complex things – Wikipedia has a brief overview here for those who want to see the alphabet soup of standards involved. Despite the range of encoding\modulation schemes and the number of frequency bands and channels almost all 802.11 devices revert to a couple of basic communication modes. This makes it easy for devices to connect to each other, and it’s what makes public WiFi hotspots practical. However it also makes configuring a device to monitor WiFi traffic trivially easy – the hardware does all the heavy lifting and the standards don’t really do anything to stop it happening. An important feature of WiFi is that, even though the payload encryption standards can now be pretty robust, the data link layer is not protected from snooping. This means that the content (my Google searches, the video clip I’m streaming down from Youtube etc) can be pretty well kept away from prying eyes but, at what the Ethernet folks call layer 2, the logical structures called frames that carry your encrypted data transmit some control data in the open.

So even with WPA2’s thorough key management and AES encryption your WiFi traffic still contains quite a bit of chatter that isn’t hidden away. The really critical thing for me is that the layer 2 addresses, the Media Access Control (MAC) addresses, of the sender and receiver (generally your PC\Phone’s WiFi adaptor and your Access Point) for each frame are always visible. And remember that MAC addresses are globally unique identifiers by design. Individual WiFi networks are defined by another identifier, the Service Set Identifier or SSID – when you set up your home WiFi AP and call the network “MyWLAN” you are choosing an SSID. SSID’s are very important, you can’t connect to a wireless LAN without knowing the relevant SSID, but they are not secure even though they can be sort of hidden they are never protected and can always be seen by someone just watching your wireless traffic. Interestingly SSID’s are not globally unique – there’s generally no real issue so long as my chosen SSID doesn’t match that of another network that’s relatively close by.

So SSID’s are possibly visible but MAC addresses are definitely visible, and MAC addresses are unique. While driving along a street or sitting in a coffee shop, hotel lobby or conference room your WiFi adaptor will see dozens if not hundreds of WiFi packets all of which will contain globally unique MAC addresses. It is possible to hack some WiFi hardware to change the MAC address but that practice is rare. Your PC has a couple (one for the wired Ethernet adaptor which isn’t important here, and usually one for WiFi these days), your Wii\PS3\XBox-360 has one, so does your Nintendo DS, iPhone, PSP … you get the picture. Another feature of MAC addresses is that it is very easy to differentiate between the MAC address of a Linksys Access Point, an iPhone and a Nintendo DS – Network protocol analyzers have been doing that trick for decades.

So the systematic scanners out there (Google, Navizon, Skyhook and the rest) can drive around or recruit volunteers and gather location data and build databases of unique identifiers, device types, timestamps, signal strengths and possibly other data. The simplest (and most) benign use of that would be to pull out the ID’s of devices that are known to be fixed to one place (Access Points say) and use that for enabling Geo-location.

Joe then looks at what it means to start collecting and analyzing the MAC addresses of mobile devices.

It’s not a big leap to also track the MAC addresses that are more mobile. Get enough data points over a couple of months or years and the database will certainly contain many repeat detections of mobile MAC addresses at many different locations, with a decent chance of being able to identify a home or work address to go with it. Kim Cameron describes the start of this cascade effect in his most recent post, mapping the attendees at a conference to home addresses even when they’ve never consented to any such tracking is not going to be hard if you’ve gone to the trouble of scanning every street in every city in the country. With a minor bit of further analysis the same techniques could be used to get a good idea of the travel or shopping habits of almost everyone sitting in an airport departure lounge or the home addresses of everyone participating in a Stop The War protest.

And remember that even though you can only effectively use WiFi to send and receive data over a range of a few 10’s to maybe a 100m you can detect and read WiFi signals easily from 100’s to 1000’s of metres away without any special equipment.

The plans to blanket London with “Free WiFi” start to sound quite disturbing when you think about those possibilities.

To answer my own title question – MAC addresses can tell far more about you than you think and keeping databases of where and when they’ve been seen can be extremely dangerous in terms of privacy.

Finally, he compares WiFi to Bluetooth:

Bluetooth is a slightly different animal. It’s also a short range radio standard for data communications but it was developed from the ground up to replace wires and the folks building the standard got a lot of stuff right. It doesn’t appear to be all that bad from a privacy leakage perspective – when implemented correctly nothing is sent in clear text (the entire frame is encoded, not just the payload) and the frequency hopping RF behaviour makes it much harder to casually snoop on specific conversations. Bluetooth devices have a Bluetooth Device ID that is very like a MAC address (48 bits), with a manufacturer ID that enables broad classification of devices if the ID can be discovered but most Bluetooth devices keep that hidden most of the time by defaulting to a “not visible” mode even when Bluetooth is enabled. When actively communicating (paired) all data is encrypted so the device ID’s are not visible to a third party. Almost all modern Bluetooth devices only allow themselves to remain openly visible in this way for a short period of time before they revert to a safer non broadcasting mode. The main weakness is that when devices are set to “visible” the unique identifiers and other data can be scanned remotely and used in just the same way as scanned WiFi MAC addresses. That’s not to say that Bluetooth doesn’t have its share of security problems but they made an attempt to get some of the fundamentals right. It does also show that there is a practical way to approach the wireless privacy challenge which is good to see.

All in all a very nice explanation of the issues involved here.   The only thing I would add is that the early versions of Bluetooth had few of the privacy-respecting behaviors present in the recent specifications.  The consortium has really worked to clean up its act and we should all congratulate it.  This came about because privacy concerns came to be perceived as an adoption blocker. 

Does the non-content trump the content?

In my previous post I referred to an interesting Wired story in which former U.S. federal prosecutor Paul Ohm says Google “likely” breached a U.S. federal criminal statute by intercepting the metadata and address information on residential and business WiFi networks.  The statute refers to a “pen register” – an electronic device that records all numbers dialed from a particular telephone line.  Wikipedia tells us the term has come to include any device or program that performs similar functions to an original pen register, including programs monitoring Internet communications.”  The story continues:

“I think it’s likely they committed a criminal misdemeanor of the Pen Register and Trap and Traces Device Act,” said Ohm, a prosecutor from 2001 to 2005 in the Justice Department’s Computer Crime and Intellectual Property Section. “For every packet they intercepted, not only did they get the content, they also have your IP address and destination IP address that they intercepted. The e-mail message from you to somebody else, the ‘to’ and ‘from’ line is also intercepted.”

“This is a huge irony, that this might come down to the non-content they acquired,” (.pdf) said Ohm, a professor at the University of Colorado School of Law.

I understand how people unacquainted with the emerging role of identity in the Internet can see this as an irony – a kind of side-effect – whereas in reality Google's plan to establish a vast centralized database of device identifiers has much longer-term consequences than the misappropriation of content.  Metadata is no less important than other data –  and “addresses” being referred to are really device identifiers clearly associated with individual users, much like the telephone numbers to which the statute applies.  Given the similarity to issues that arose with pre-Internet communication, we should perhaps not be surprised that there may already be regulation in place that prevents “registering” of the identifiers.

The Wired article continues:

Google said it was a coding error that led it to sniff as much as 600 gigabytes of data across dozens of countries as it was snapping photos for its Street View project. The data likely included webpages users visited and pieces of e-mail, video and document files…

The pen register act described by Ohm, which he said is rarely prosecuted, is usually thought of in terms of preventing unauthorized monitoring of outbound and inbound telephone numbers.

Violations are a misdemeanor and cannot be prosecuted by private lawyers in civil court, Ohm said. He said the act requires that Google “knew, or should have known” of the activity in question.

Google denies any wrongdoing.

In fact, Google knew about the collection of MAC addresses, and has never said otherwise or stated that their collection of these addresses was done accidently.  In fact they have been careful to never state explicitly that their collection was limited to Wireless Access Points.  The Gstumbler report makes it clear they were parsing and recording both the source and destination MAC addresses in all the WiFi frames they intercepted. 

The Wired article explains:

As far as a criminal court goes, it is not considered wiretapping “to intercept or access an electronic communication made through an electronic communication system that is configured so that such electronic communication is readily accessible to the general public.”

It is not known how many non-password-protected Wi-Fi networks there are in the United States.

What makes this especially interesting is the fact that it is not possible to configure a WiFi network so that the MAC addresses are hidden.  Use of passwords protects the communication content carried by the network, but does not protect the MAC addresses.  Configuring the WIreless Access Point not to broadcast an SSID does not prevent eavesdropping on MAC addresses either.   Yet we can hardly say the metadata is readily accessible to the general public, since it cannot be detected except acquiring and using very specialized programs. 

Wired draws the conclusion that,  “The U.S. courts have not clearly addressed the issue involved in the Google flap.”