More on distributed query

Dave Kearns responded to my post on the Identity Bus with Getting More Violent All the Time (note to the Rating Board: he's talking about violent agreement… which is really rough):

What Kim fails to note… is that a well designed virtual directory (see Radiant Logic's offering, for example) will allow you to do a SQL query to the virtual tables! You get the best of both: up to date data (today's new hires and purchases included) with the speed of an SQL join. And all without having to replicate or synchronize the data. I'm happy, the application is happy – and Kim should be happy too. We are in violent agreement about what the process should look like at the 40,000 foot level and only disagree about the size and shape of the paths – or, more likely, whether they should be concrete or asphalt.

Neil Macehiter answers by making an important distinction that I didn't emphasize enough:

But the issue is not with the language you use to perform the query: it's where the data is located. If you have data in separate physical databases then it's necessary to pull the data from the separate sources and join them locally. So, in Kim's example, if you have 5000 employees and have sold 10000 computers then you need to pull down the 15000 records over the network and perform the join locally (unless you have an incredibly smart distributed query optimiser which works across heterogeneous data stores). This is going to be more expensive than if the computer order and employee data are colocated.

Clayton Donley, who is the Senior Director of Development for Oracle Identity Management, understands exactly what I'm trying to get at and puts it well in this piece:

Dave Kearns has followed up on Kim Cameron's posting from Friday.

  1. Kim says that sometimes you need to copy data in order to join it with other data
  2. Dave says the same thing, except indicates that you wouldn't copy the data but just use “certain virtual directory functionality”

Actually, in #2, that functionality would likely be persistent cache, which if you look under the covers is exactly the same as a meta-directory in that it will copy data locally. In fact, the data may even be stored (again!) in a relational database (SQLServer in the Radiant Logic example he provides).

Let's use laser focus and only look at Kim's example of joining purchase orders with user identity.

Let's face it. Most applications aren't designed to go to one database when you're dealing solely with transactional data and another database when you're dealing with a combination of transactional data and identities.

If we model this through the virtual directory and indicate that every time an application joins purchase orders and identities that it does so (even via SQL instead of LDAP) through the virtual directory, you've now said the following:

  1. You're okay with re-modelling all of these data relationships in a virtual directory — even those representing purchase order information.
  2. You're okay with moving a lot of identity AND transactional information into a virtual directory's local database.
  3. You're okay with making this environment scalable and available for those applications.

Unfortunately, this doesn't really hold up. There are a lot more issues, but even after just these first three (or even the first one) you begin to realize that while virtual directory makes sense for identity, it may not make sense as the ONLY way to get identity. I think the same thing goes for an identity hub that ONLY thinks in terms of virtualization.

The real solution here is a combination of virtualization with more standardized publish/subscribe for delivery of changes. This gets us away from this ad-hoc change discovery that makes meta-directories miserable, while ensuring that the data gets where it needs to go for transactions within an application.

I discourage people from thinking that metadirectory implies “ad-hoc change discovery”.  That's a defect of various metadirectory implementations, not a characteristic of the technology or architecture.  As soon as applications understand they are PART OF a wider distributed fabric, they could propagate changes using a publication pattern that retains the closed-loop verification of self-converging metadirectory.  

Published by

Kim Cameron

Work on identity.

One thought on “More on distributed query”

  1. If you want to see all of this in action you should talk to us about our XDI Implementation. MyXDI is an Abstract Data Access Engine that has many of the qualities that you have described and then some…

    We have optimized indexing across heterogeneous systems as well as optimized �bulkGet� that ensures the minimum number of queries are executed based on the pointer set retrieved from the index.

    We have implemented the Higgins IDAS framework so that data sources can be added in an �open standards based framework�. We have extended the Higgins APIs to include some things that we consider �necessary� for enterprise deployment of such a system. You can use our extensions or not.

    One of the �necessary extensions� that we have added becomes clear when you look beyond �getting� information, into �updating� information. We feel it is necessary, and have therefore implemented, the multi-phase commit transactional interfaces so that you can do an update across heterogeneous systems and end-up in a deterministic state.

    We have semantic/system mapping built into the core of the engine.

    And you know the best thing is that we have a demo of this running with all of your technologies working together�

    Our cross-platform c/c++ engine (currently deployed in .net) is using (Oracles) BerkeleyDB in the back end for the persistence of indexes and caches. Our demo, that runs on top of the engine, is demonstrating accesses to distributed heterogeneous data stores and then exposing the �virtually combined� dataset as� A UI, an hCard, OpenSocial(our whole app is an OpenSocial container, we will also support the OpenSocial REST APIs) , a Managed InfoCard, (ID-WSF People Service and LDAP coming very soon).

    And the fun thing is that the engine is built with identity at its core. Access to data is all based strictly on �people have rights too� architecture. Currently we use oAuth and/or id-wsf for the user permissioning that establishes either long-lived or short lived Link Contracts.

    I could go on, but if you want to know more you can check out my blog at http://xditao.blogspot.com or contact me via my iPage at http://xri.net/=andy (where you can login using either your InfoCard, your OpenID or an email address).

Comments are closed.