Friday, September 24, 2010

McCreary's Law

I am just returning from a three-day conference on Enterprise Information Management. I was very pleasantly surprised to meet many other very smart and warm independent consultants that do metadata and EIM consulting. More posts about what they taught me will be coming.

But these conferences for me to constantly answer the same question over and over again. What do I do an why don't I use traditional RDBMS systems to store metadata? What I need is some strong statement that explains the core principal of the need for agility in metadata management. So in that context, I would like to suggest a law of agile software development. To give it a label, I will just use McCreary's Law for now.

McCreary's Law states the following:

The agility of any software project in inversely proportional to the square of the number of data transformations in the developer stack.

The number of data transformation in the traditional three tier-architecture (Web to Objects to RDBMS to Objects and back to web) is four.

The number of data transformation in the traditional three-tier plus XML web services is six, since web services are usually created by one transform from objects to XML and one from XML to objects.

The number of data transformations in XRX is approximately zero, since we could make an argument that XML to HTML is a 1/2 translation since the HTML and XML tags are different. Of course if you are not changing the order all XML can be styled with CSS which is not really a transformation. This gets around the problem of infinite developer agility if the denominator is zero.

This formula is not perfect and I am sure their will be modifications to it like adding a constant for training and IT resistance to change, but I hope people can use it to predict or explain their current development agility.

Thanks to everyone I met at the EIM conference that listened to my rants and have not discarded me as a complete metadata radical.

Sunday, September 12, 2010

XLucene

For the past two years I have been quietly observing the growth of something called Structured Search. This is the use of document structure to aid in search and retrieval of documents or sub-documents in large document collections.

I have also seen a common solution to this problem. This is to use a combination of native XML database and the Lucene keyword indexing problem. But people doing this are doing far more than just simple keyword search. So calling this new approach simply “Lucene” keyword search does not really describe what is going on. So I am proposing an new meme: XLucene.

I define XLucene as using a combination of document structure and keyword searches together to create very precise search ranking or search hit scores. The key is that the design must be able to store each branch in the document tree as a separate subdocument.

The key factor in widespread adoption is that almost anyone that can identify the tag names in their document structures will soon be able to create their own XLucene search and retrieval systems.

What first got me interested in “Structure Retrieval” was the book “Introduction to Information Retrieval” by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. In this book they present the following table:

RDB search unstructured retrieval structured retrieval
objects records unstructured documents trees with text at leaves
model relational model vector space & others ?
main data structure table inverted index ?
queries SQL free text queries ?
The following quote follows this table: There is no consensus yet as to which methods work best for structured retrieval although many researchers believe that XQuery will become the standard for structured queries.

I think that the jury has returned their verdict. XQuery wins! Here is that final table with the following properties:

Objects: Trees with text at leaves
Model: Hierarchical documents
Main-data-structure: trees and inverted indexes tied to node-ids
Queries: W3C XQuery with fulltext extensions

The design of using variable-length node-ids as the document identifiers was done by Wolfgang Meier of the eXist-db.org project. There are many people now using this design today. I will be following up this posting with some case studies of how people are using this in several areas, both in the government sector, library metadata and document search and retrieval.

Since there are so many people using this same problem-solution pair, that like all good patterns, it deserves its own name. XLucene (with no embedded dash) can be picked up by search engines and we can all start to share design experience and solutions. This is why the Open Source model really is superior to closed systems. Knowledge just travels faster once the memes are created.

I will also be presenting more details on the general topic of Structured Search and XLucene at the Enterprise Information Management Conference in Toronto next Tuesday, Sept 21st.