Dave Kearns has expanded further on his view of distributed data, metadirectory and virtual directory. It seems like some of our disagreement is a matter of terminology. Dave grudgingly admits (poor Linus and his blanket!) that application developers should be permitted to use databases:
The application database (for those who cling to it like Linus and his blanket) now can serve two purposes – one to subscribe to virtual directory data and one to publish!
The question becomes whether we need more than publish / subscribe relationships between services. I think we do. It is this higher level (meta level) of service and information that I call metadirectory.
Let's make it clear that I see metadirectory as an evolving thing.
- First generation metadirectory dealt exclusively with a managing applications that had been conceived without reference to each other – or to any common framework (In truth, this is still an issue – see Jeff Bohren's recent posting called “Which is better, Phillips or Flat-head?“).
- Second generation metadirectory has an additional focus: providing the framework by which next-generation applications can become part of the distributed data infrastructure. This includes publishing and subscription. But that isn't enough. Other applications need ways to find it, name it, and so on.
A real distributed information architecture requires services that join objects across contexts, arbitrate truth, advertise schema possibilities and provide the grid through which virtual directory queries can be dispatched.
These services are what I call metadirectory – the framework for distributed storage. One may choose to call the queries in this framework “virtual directory”. But such “virtual directory” requires a “real” framework.
Dave suggests we read a piece called “The second wave: Linking identities to contexts” by Michel Prompt (CEO of Radiant Logic). It is good and I recommend it to everyone. It raises many issues that are worth thinking about:
If for each application, we can find the unique identifier associated with a person, and we can speak the applicationspecific protocol (LDAP, RDBMS, API, Web services, etc.,) then we can retrieve a specific identity profile associated with that person when we need it. Knowing an identifier and its associated protocol is sufficient to access a specific definition of an identity.
Common access alone, however, is not correlation. It will not tell us that UserId A is in fact EmployeeId 235, and that both underlying profiles are aspects of the identity of Person Y.
Some correlation mechanism thus needs to be deployed, based possibly on matching some common attributes for each profile. If no rules can be produced, then the matching must be done manually, a painstaking process but in many cases unavoidable for at least a subset of the identity data.
Michel has started to talk about the metadata needed to create a framework for distributed query. Some service needs to know that “UserId A is in fact EmployeeId 235″. That is clearly glue that creates a “directory of directories”. Michel might call it a “directory of contexts”, but I don't think the difference is substantive.
A directory of directories: metadirectory
By defining such a process we can create a “hub” where each person has a “global identifier” associated with the corresponding “local” source identifiers (e.g. UserId A, EmployeeId 235, etc.) If this virtual hub has the capability to write back to each source, we can use it to manage the account/identity life-cycle for each source. And when we need any specific aspect of an identity, we can retrieve it dynamically using the Identity Hub pointer.
Hmmm. Michael calls it a “hub”, not a metadirectory. But it is the same thing.
Since our Identity Hub is stripped down to the minimum information required, the amount of synchronization and data transformation (complex tasks by definition) is reduced to the strict minimum. Only the different (local) references for components of a given identity are stored or synchronized. When we need a specific aspect of identity, we can retrieve it dynamically using the Identity Hub pointer, and the common virtual access layer.
If data transformation is a complex task, it is because there are different ways of representing data in the distributed system. If that's the case, the problem doesn't go away with a virtual directory – it gets worse! The application that calls into a first data source gets its representation, and if it then calls into a second data source, it gets a second representation. The application is now on its own to figure out what is what. Far from simplifying – in fact complex transforms need to be done in more locations.
In terms of synchronization, the proposal made by Michel and Dave is good for some use cases but not right for others. Again, we need to support a spectrum of choices.
You don't always want to synchronize a common identifier. Especially when working with identity data that is in danger of breach and insider attack, it is a better strategy to use different identifiers in different systems, so knowledge of the “joining glue” is required in order to assemble information across contexts (for example, personal information and financial information).
And sometimes, you want to synchronize more than just an identifier.
A conversation like this needs real examples. In most enterprises, the Human Resources Database is the authoritative source for information on employees. We want our email address books and mail stores and message transfer agents to be up to date with the latest HR information.
According to the argument being made by Dave and Michel, all our address books and all our mail switches and mail boxes should be sending each query directly into the”authoritative” human resources database.
But everyone with any experience in the enterprise knows the people who run the HR databases WILL NOT go for this. They don't want all the technical systems of the enterprise hitting on their systems in real time with every possible query.
My point here is that it will be necessary to offload information from the HR system to other systems. No one can look seriously at these issues without admitting that SOME synchronization is required (which admittedly should be real time). On the other hand, we don't want parallel unrelated architectures.
So we are led to the conclusion that we need a spectrum of synchronization and remote access capabilities. We should be able to use policy to define what information is stored where, and how to get to information that is not stored locally – e.g., combine metadirectory and virtual directory functionality.