Dave Kearns, Jackson Shaw, Dave Olds and myself had a good old time talking with Felix Gaehtgens about the “identity bus”. I had a real “aha” during the interview while I was talking with Dave about why synchronization and replication are an important part of the bus. I realized part of the disconnect we've been having derives from the differing “big problems” each of us find ourselves confronted with.
As infrastructure people one of our main goals is to get over our “information chaos” headaches… These have become even worse as the requirements of audit and compliance have matured. Storing information in one authoritative place (and one only) seems to be a way to get around these problems. We can then retrieve the information through web service queries and drastically reduce complexity…
What does this worldview make of application developers who don't want to make their queries across the network? Well, there must be something wrong with them… They aren't hip to good computing practices… Eventually they will understand the error of their ways and “come around”…
But the truth is that the world of query looks different from the point of view of an application developer.
Let's suppose an application wants to know the name corresponding to an email address. It can issue a query to a remote web service or LDAP directory and get an answer back immediately. All is well and accords with our ideal view.
But the questions application developers want to answer aren't always of the simple “do a remote search in one place” variety.
Sometimes an application needs to do complex searches involving information “mastered” in multiple locations. I'll make up a very simple “two location” example to demonstrate the issue:
“What purchases of computers were made by employees who have been at the company for less than two years?”
Here we have to query “all the purchases of computers” from the purchasing system, and “all empolyees hired within the last two years” from the HR system, and find the intersection.
Although the intersection might only represent a few records, performing this query remotely and bringing down each result set is very expensive. No doubt many computers have been purchased in a large company, and a lot of people are likely to have been hired in the last two years. If an application has to perform this type of query with great efficiency and within a controlled response time, the remote query approach of retrieving all the information from many systems and working out the intersection may be totally impractical.
Compare this to what happens if all the information necessary to respond to a query is present locally in a single database. I just do a “join” across the tables, and the SQL engine understands exactly how to optimize the query so the result involves little computing power and “even less time”. Indexes are used and distributions of values well understood: many thousands of really smart people have been working on these optimizations in many companies for the last 40 years.
So, to summarize, distributed databases (or queries done through distributed services) are not appropriate for all purposes. Doing certain queries in a distributed fashion works, while in other cases it leads to unacceptable performance.
The result is that many application developers “don't want to go there” – at least some of the time. Yet their applications must be part of the identity fabric. That is why the identity metasystem has to include application databases populated through synchronization and business rules.
On another note, I recommend the interview with Dave Kearns on the importance of context to identity.