Dan McCreary's Blog: 2010

Friday, September 24, 2010

McCreary's Law

I am just returning from a three-day conference on Enterprise Information Management. I was very pleasantly surprised to meet many other very smart and warm independent consultants that do metadata and EIM consulting. More posts about what they taught me will be coming.

But these conferences for me to constantly answer the same question over and over again. What do I do an why don't I use traditional RDBMS systems to store metadata? What I need is some strong statement that explains the core principal of the need for agility in metadata management. So in that context, I would like to suggest a law of agile software development. To give it a label, I will just use McCreary's Law for now.

McCreary's Law states the following:

The agility of any software project in inversely proportional to the square of the number of data transformations in the developer stack.

The number of data transformation in the traditional three tier-architecture (Web to Objects to RDBMS to Objects and back to web) is four.

The number of data transformation in the traditional three-tier plus XML web services is six, since web services are usually created by one transform from objects to XML and one from XML to objects.

The number of data transformations in XRX is approximately zero, since we could make an argument that XML to HTML is a 1/2 translation since the HTML and XML tags are different. Of course if you are not changing the order all XML can be styled with CSS which is not really a transformation. This gets around the problem of infinite developer agility if the denominator is zero.

This formula is not perfect and I am sure their will be modifications to it like adding a constant for training and IT resistance to change, but I hope people can use it to predict or explain their current development agility.

Thanks to everyone I met at the EIM conference that listened to my rants and have not discarded me as a complete metadata radical.

Sunday, September 12, 2010

XLucene

For the past two years I have been quietly observing the growth of something called Structured Search. This is the use of document structure to aid in search and retrieval of documents or sub-documents in large document collections.

I have also seen a common solution to this problem. This is to use a combination of native XML database and the Lucene keyword indexing problem. But people doing this are doing far more than just simple keyword search. So calling this new approach simply “Lucene” keyword search does not really describe what is going on. So I am proposing an new meme: XLucene.

I define XLucene as using a combination of document structure and keyword searches together to create very precise search ranking or search hit scores. The key is that the design must be able to store each branch in the document tree as a separate subdocument.

The key factor in widespread adoption is that almost anyone that can identify the tag names in their document structures will soon be able to create their own XLucene search and retrieval systems.

What first got me interested in “Structure Retrieval” was the book “Introduction to Information Retrieval” by Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze. In this book they present the following table:

	RDB search	unstructured retrieval	structured retrieval
objects	records	unstructured documents	trees with text at leaves
model	relational model	vector space & others	?
main data structure	table	inverted index	?
queries	SQL	free text queries	?

The following quote follows this table: There is no consensus yet as to which methods work best for structured retrieval although many researchers believe that XQuery will become the standard for structured queries.

I think that the jury has returned their verdict. XQuery wins! Here is that final table with the following properties:

Objects: Trees with text at leaves
Model: Hierarchical documents
Main-data-structure: trees and inverted indexes tied to node-ids
Queries: W3C XQuery with fulltext extensions

The design of using variable-length node-ids as the document identifiers was done by Wolfgang Meier of the eXist-db.org project. There are many people now using this design today. I will be following up this posting with some case studies of how people are using this in several areas, both in the government sector, library metadata and document search and retrieval.

Since there are so many people using this same problem-solution pair, that like all good patterns, it deserves its own name. XLucene (with no embedded dash) can be picked up by search engines and we can all start to share design experience and solutions. This is why the Open Source model really is superior to closed systems. Knowledge just travels faster once the memes are created.

I will also be presenting more details on the general topic of Structured Search and XLucene at the Enterprise Information Management Conference in Toronto next Tuesday, Sept 21st.

Monday, August 30, 2010

What Makes a Good Solution Architect?

A recruiter friend of mine was looking for a senior "Solution Architect" and needed some help setting up a screening interview. Here is what I told him.

I look for solid experience in a diversity of solution. There are just too many people that think that HTML/Objects and RDBMS systems are the only tools in your architect's toolkit.

Here is what I suggested the interview questions would be like:

Consider the following application architectures:

Ruby on Rails on a relational database
XRX on a native XML database
A Java Client on J2EE app server using JMS
A JQuery client on an OLAP database

A senior "Solution Architect" should be able to compare the pros and cons of each system and describe how they have used each of these architectures to build solutions and give organizations a competitive advantage. I also drill down into their understanding of search and retrieval and natural language processing.

As a bonus question, I would ask how each architecture would impact an organization's business strategy and to what extent business units could be empowered to build their own applications with minimal training and little or no need for IT involvement.

Most good solution architects should have experience with 3 out of 4 solution architectures but very few usually have a strong MBA-type background to understand how solution architectures impact business strategy. And if you find anyone that has built systems with all for and has an MBA you have a serious senior solution architect. Hire them!

Friday, July 02, 2010

Impressions of SemTech 2010

Two weeks ago I attended my fifth Semantic Technology conference in San Francisco. It was a great conference! This is my fifth time attending the conference and I plan to attend in the future! My biggest dilemma was which session to attend. At times there were as many as eight concurrent sessions going on.

There were a few big trends that I spotted.

The use of RDFa to annotate web pages using semantically precise elements was clearly a big trend. Jay Myer of BestBuy described how the BestBuy sales went up 30% when they added RDFa tags to their product pages. Although many search engines are not transparent about their use of RDFa tags in page rankings, Jay’s results should make it clear that this strategy works.

Jay told a great story about how hard it was to find a fridge that met his specific criteria: black with a specific size etc. using a keyword-based search engine. The semantic web will change all of this!

The big factor here is Martin Hepps GoodRelations ontology for products and services. This can really bring the Semantic Web to the masses. Finally a way to code the hours of operation for your store so that you can use Siri to ask "What Sushi Restrauants are Open at 10pm Near Here"

There were also a lot of sessions on LinkedData and specifically on LinkedData in the government area. Jim Heldler and his students at RPI are scraping every data set on data.gov and converting them all to RDF and storing them in a huge triple store.

Jim also provided one of the best sound-bites from the conference:

“Get AWAY from the Table”

This goes far beyond the NOSQL movement that we are seeing. It gets to the core of the problems with innovation in many organizations.

Jim is a consultant to the US data.gov transparency process and a key advocate of open linked data standards. It is interesting to see a friendly rivalry between Jim and Tim Berners-Lee who is a consultant for the UK data transparency movement. Both are using RDF to convert data but each is using slightly different strategic approaches.

He was referring to the fact that many organizations over-use the relational model and they need to understand that there are many alternatives, especially when doing mashups of data sets from many sources.

There were also many presentations on natural language processing and finding the true “meaning” of words within unstructured and semi-structured data sets.

I always attend Brian Sletton’s session on REST. Many people do not understand the relationship between REST, URI’s and the semantic web stack. Joe Wicentowski’s work at the US Department of State on the URL rewriting frameworks within eXist has really reinforced how this can be done easily in a single XQuery module.

I presented a 3.5 hour tutorial session on Entity Extraction. Since it was scheduled for the last session of the last day of the conference I though the attendance would be very low. But the room was packed and most of the people stayed till the very end.

It was wonderful to finally meet Marie Wallace and DJ McCloskey from the LanguageWare team at IBM. Their support of the Apache UIMA standards will be a great step forward to the creation of interoperable language analysis piplelines. Marie and DJ introduced me to the Millennium restaurant just a few blocks from the hotel. My wife Ann and I went there over the weekend and we now own one of their cookbooks.

By the way Google is now also support recipes in their rich snippets in their search results so you will soon be able to find "only recipes that take under 30 minutes" in a Google search engine.

There were also many session on the need for good taxonomies and ontologies in the enterprise. The use of SKOS for controlled vocabularies was a very hot topic and there were dozens of presentation on how OWL is being used to capture and exchange business rules.

It was also really great to finally meet Jeni Tennison after reading her books and following her Tweets for a long time. She is deeply involved in the federal data transparency projects in the UK. Here use of RDF is breaking new best practices for the entire community.

It was also great to catch up with Mark Birbeck, one of the chief architects of the Ubiquity XForms libraries. I am looking forward to trying out some of the new tools based on his backplane JavaScript libraries.

The people from Facebook also gave a presentation on how they are adding the ability for people to add a few lines to each web site to allow users to add a “like” button on a web site using a variation of RDF. Many people were a little disappointed to hear that they will not be using namespaces in their interfaces. Facebook felt that host HTML coders could only handle a single namespace. Many people agreed with their findings and thought that until 90% of HTML coders knew what namespaces were that organizations like Facebook were stuck just adding new data to HTML meta elements.

One of the most interesting discussions I had was with the people from Cray Computer. Apparently a large unnamed government agency has given Cray Research a very large contract to build customized ASIC (FPGA) chips to do graph analysis. Their ThreadStorm XMT architecture has 128 register sets and allows the CPUs to get continuous feeds of graph queries without waiting for memory. Their claim is that the federal agency has told them they are getting a 100X improvement in complex graph queries. The challenge is that the API is currently a low-level C interface and they have not yet put a SPARQL complier in front of an XML. So on good SPARQL benchmarks are yet available to compare the results of this hardware with a typical triple store. But even if it is fast, the price would be in the six digits for one of the XMP systems. So only very large organizations and government agencies might use this unless it was provided as a service.

One of the other things I found was how many people are using OWL and the OWL reasonerers like Pellet as replacements for traditional rules engines from companies like FairIsaac. There are two reasons for this. First is that many rules engine companies only talk about their engine performance but not on the need for precise semantics in the data elements. With OWL and the methods behind the semantic web stack we attempt to put semantics higher in the requirements of a system a use pre-built and pretested ontologies.

The second big reason that people are use OWL is that the rules are created using open standards. That means you are not locked into any one vendors rule format. The new W3C standards for rule interchange (RIF) will also start to break down many of the barriers to the creation and exchange of large industry vertical rule sets to that each organization does not have to start from scratch with a rule base. This will be big for industries like insurance where claims processing ontologies are just being created. Thanks to Kendal Clark for good information on this topic.

I also had a great time talking to many people in the publishing area. Seth Maislin gave a great presentation on Taxonomies. Seth and Marlene Rockmore joined me for a lunch at nearby Indian restaurant. Marlene is an expert on Taxonomy development. She was working on a book with O’Relly on taxonomies that is on hold for now but we hope to see one in the future from her.

It was also nice to see how many federal agencies are now starting to use SPARQL in the intelligence community. I saw a good presentation by Dennis Wisnosky with the US DOD talk about how they plan to use NIEM and semantic technologies to cut their integration costs. Yeah NIEM! Most people don’t understand that the NEIM is based on the RDF model, it just uses XML syntax to store the relationships. See my web-cast on the SemanticUniverse.com web site for more details. Hopefully his slides will be available in the future. Many people were taking photos of the slides since the DOD is not great at getting their slides out. Wonder why?

I also had lunch with several people from within the intelligence community that had great discussions about the pros and cost of pulling assertions out of NLP-analyzed annotated XML documents and the challenges relating to this. Keeping the links back to the original XML documents to is critical to verify the results of an assertion in context. Traceability, linage, provenance and time-domain representations in RDF that don’t cause a 10X growth is the number of triples is a difficult problem.

If there was some way to relate RDF assertion to a source document XML Node ID (the fourth column question) is a very difficult problem and one that needs close research with native XML and RDF systems. One of the things I learned about the eXist-Lucene integration done by Wolfgang Meier is that keeping the node-id as the document ID in Lucene allows non-programmers the ability to configure customized search rank rules past on the context of the keyword in a document. This is where highly customizable structured search rocks. I think this innovation needs to be added to RDF triples projects. Keeping the XML node-id of an assertion with an RDF triple context could be a great way to prevent RDF triple bloom for context.

Finally, I felt that David Wood and James Leigh's presentation on Callimacus project had the most potential to have a "big impact" on the adoption of Semantic Web tools. This new open source project allows users to specify a simple HTML template with special XML tags that allows them to bring triples directly into a web application. This has the potential to allow far more non-SPARQL programmers to integrate RDF data directly into a web page much like XQuery does today. With some work it might be possible to auto-generate the XForms bind elements for form rules. This would be a huge win for the integration of rules into web forms without needing to write JavaScript.

You can also see my tweets on the conference here: http://twitter.com/dmccreary

I hope to see more of you in San Francisco next year. Let me know if you are interested in co-presenting any papers!