Thursday, November 15, 2007

Full text Search Standards for XQuery

I have been using the eXist (http:www.exist-db.org) native XML database for almost a year now and I am constantly impressed by the power and depth of the eXist community. XQuery is a great way to search XML documents and most of my XSLTs that manage metadata are now being transferred to XQuery. Although there are still some places that XSLTs are superior, XQuery is much easier to use when I am building RESTful web services for metadata. My XForms are frequently generated directly from XQuery programs.

It is interesting to note that eXist is now dominating all the Google search results for "XML database". Leading the pack and pulling away!

From my days doing IT strategy I recall a great deal of analysis done by Adele Goldberg and Kenneth S. Rubin and documented in their excellent book Succeeding With Objects: Decision Frameworks for Project Management. This is an excellent book based on over 50 case studies on reuse and top-notch analysis of the role of trust in any high-reuse culture.

One of the key points in the Goldberg/Rubin analysis was the factors that cause people to be able to reuse assets. In their case the assists were programming objects. I my case I am concerned with the reuse of data objects. From simple data types that are part of the XML Schema standards to registered data elements in a metadata registry. Registered data elements have an approval process and have semantics that span a group, project, program or enterprise.

But the pattern that reoccurs is that people need to be able to quickly find assets before they can reuse them. One of my main points in my work is that you can not reuse what you can not find. The corollary is that the longer it takes to find an asset the higher the temptation is to recreate the asset. That is where the search functions of a metadata registry needs to come in. It must be fast and accurate.

The quality of a metadata registry also depends on strategies about informing users when two data elements have the same semantics and are potentially duplicates. To find this we need semantic nearness searches for data element definitions.

Today I am providing simple substring searches. But I know that text mining can provide semantic nearness searches. I was glad to see that the world-wide web is working on full-text search standards for XQuery. The w3c http://www.w3.org/TR/xquery-full-text/standard has now reached the working draft last-call stage and I hope that we soon start to see implementations that take advantage of these features. I know that Google's support of this project for eXist (see http://exist.sourceforge.net/soc-ft.html) is a very good sign.

I was also fortunate to work with Gary Berosik on the Advocate Agents project. Gary works for Thomson/West Publishing that has a large group of people doing text mining. Gary introduced my to the Text Mining Application Programming book by Manu Konchandy. This book has excellent sections on the decomposition of free-form text into parts of speech for information extraction, clustering, categorization and search.

I hope that eventually the w3c will approve the full text extensions to XQuery and that they will make it into the eXist system. Based on this link http://exist.sourceforge.net/soc-ft.html I am hoping that it will be in the next few years.