Sunday, September 12, 2010

XLucene

For the past two years I have been quietly observing the growth of something called Structured Search. This is the use of document structure to aid in search and retrieval of documents or sub-documents in large document collections.

I have also seen a common solution to this problem. This is to use a combination of native XML database and the Lucene keyword indexing problem. But people doing this are doing far more than just simple keyword search. So calling this new approach simply “Lucene” keyword search does not really describe what is going on. So I am proposing an new meme: XLucene.

I define XLucene as using a combination of document structure and keyword searches together to create very precise search ranking or search hit scores. The key is that the design must be able to store each branch in the document tree as a separate subdocument.

The key factor in widespread adoption is that almost anyone that can identify the tag names in their document structures will soon be able to create their own XLucene search and retrieval systems.

What first got me interested in “Structure Retrieval” was the book “Introduction to Information Retrieval” by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. In this book they present the following table:

RDB search unstructured retrieval structured retrieval
objects records unstructured documents trees with text at leaves
model relational model vector space & others ?
main data structure table inverted index ?
queries SQL free text queries ?
The following quote follows this table: There is no consensus yet as to which methods work best for structured retrieval although many researchers believe that XQuery will become the standard for structured queries.

I think that the jury has returned their verdict. XQuery wins! Here is that final table with the following properties:

Objects: Trees with text at leaves
Model: Hierarchical documents
Main-data-structure: trees and inverted indexes tied to node-ids
Queries: W3C XQuery with fulltext extensions

The design of using variable-length node-ids as the document identifiers was done by Wolfgang Meier of the eXist-db.org project. There are many people now using this design today. I will be following up this posting with some case studies of how people are using this in several areas, both in the government sector, library metadata and document search and retrieval.

Since there are so many people using this same problem-solution pair, that like all good patterns, it deserves its own name. XLucene (with no embedded dash) can be picked up by search engines and we can all start to share design experience and solutions. This is why the Open Source model really is superior to closed systems. Knowledge just travels faster once the memes are created.

I will also be presenting more details on the general topic of Structured Search and XLucene at the Enterprise Information Management Conference in Toronto next Tuesday, Sept 21st.

1 comment:

Unknown said...

Hi Dan !!!
Great note about the Structured model. Ya Xquery Won in that task .. Your statement is exactly correct .. Keep it up...

By
Rajamani marimuthu
XRX Team(Lead)