Dan McCreary's Blog: 2013

Wednesday, October 02, 2013

Agile Transformation in the Post NoSQL Era

Over the past four years we have seen the NoSQL movement grow from a small "Meetup" in the Bay Area to a technology that is touching all corners of the database world. Each year NoSQL software becomes more capable, lower cost and easier to use. Document stores, in particular, make the process of doing object-relational mapping unnecessary allowing anyone with god metadata (JSON and XML) to simply drag-and-drop their files into a centralized corporate data store. There are still a few challenges left. For example using statistical analysis of inserts and record counts to optimized indexes and putting good query languages (like JSONiq) on top of these data stores. But in large, these features are just polish on systems that are optimized to scale and be highly-available. The hard work seems like it has been done and we are now in the stage of refinement, not revolution.

We documented the emergence of the NoSQL database patterns in our book, Making Sense of NoSQL, which now available through Manning Publications. If you read this book you know that NoSQL systems have a diverse set of architectural patterns and different patterns apply to different problems. Selecting the right database architecture is a complex process of carefully understanding the subtleties of requirements and weighing the alternatives. Yet they do work well and once they are setup and configured they make data persistence a straightforward process.

So whats next?

Now that saving data has shifted from a project in its own to a smaller part of the application developer project we see the skills needed to build applications starting to shift. The need to model your data with ER modeling tools is getting less. The need to write complex joins with SQL is no longer needed. The next major skill set we would like to address is the the movement to agile transformation. How do you get you data out of your database and how do you transform it into the many formats that your application needs?

We think that the answer to this question is clear. Organizations need to be better at transforming the data in their database to other forms. This is the shift in skill sets from persistence centric to transformation centric. And it is not just the software developers that need to be able to transform data. Everyone on your team including non-programmers can play a role. They all need skills to quickly transform data from one form to another.

We call this shift the movement to toward Agile Transformation. We hope to document how organizations are waking up to this new movement and understand the tools and processes they are adopting to empower everyone on their team to quickly transform data.

This process of data extraction and transformation used to be a two-step process. SQL developers might create a series of tabular reports. These reports were then converted into the medium needed, HTML, XML, JSON, or even CSV files and other structures needed by other tools. Now the extract and transformation process can be done with a single step. Query results are no longer restricted to tabular formats.

Strategies for Agile Transformation

Over the next few months (or perhaps years), we hope to document many of the ways that organizations are attacking the agility challenges. Here are just a few strategies to get us started.

Single Source Canonical Data Models

If you are in the content management business you know that the concept of single-source publishing is central to your productivity. Using a single format to store content gets around the many-to-many transform problems that can drag down a teams productivity. We see the same principals also applying to web applications in general. Getting many data sources into a single format and then transforming this single format into many forms is the key to organizational productivity. We call these models "Canonical" since they are the standards that organization can build publish/subscribe web services around.

Flexible Query Languages

If you have every worked with tools like XQuery and JSONiq you know that they are the most flexible query languages around. These languages have benefited from years of work combining the best features of SQL, XSLT, XPath and dozens of other advanced query languages into a grammar that is designed to transform a variety of use cases.

Reusable Transformation Libraries

One of the first strategies that companies find is that many transformations are similar and can benefit from reusable code. Languages like SQL do offer a wide variety of non-portable stored procedures. Yet most of these languages limit your ability to build reusable transformation functions and modules. Modern languages need to be close enough to your data to understand how queries use indexes but abstract enough to be reused in new applications.

Using Great Tools: IDEs and Report Writers

SQL GUI Report Writers were one of the first tools that tool the complexity out of transforming tabular data. And we need more tools like these to make NoSQL reporting accessible to non-programmers. Some NoSQL products like HBASE already have SQL-like query tools. From our other blog posts you may know that we are big fans of the the oXygen IDE for managing JSON and XML data. oXygen makes the process of learning how to write XPath expressions easy for even the non-programmer. Even if their data is complex. These tools are complex in themselves and require hands-on training if non-programmers are going to get the most out of them.

Simplicity for Non-Programmers

One of the core foundations of agility is not have all your transformation be controlled by a small group of overworked developers. We learned that simple tools like GUI-based report writers or simple XPath templates can empower a non-programmer, with a bit of training, to build and maintain their own data transformations. Not needing to understand database joins is a big step in empowerment. Getting a good foundation library is another great step. Setting up small but easy to use templates is another good strategy. Building a search system to find the right library and templates also helps empower new staff and lowers the training burden on existing staff. In general, we feel that if a user "knows their data" that they should be given the tools to transform their data.

So what is the best practices to build an organization that has agile transformation competency? We would love to know your ideas. Please send us email or tweet us at @dmccreary on Twitter.

Monday, July 01, 2013

Rounding error in Java when converting strings to doubles

I came across a rounding error when I was running an XQuery.

Here is the error:

xquery version "1.0";
number(3.1) + number(3.2)

which returned:

6.300000000000001

Not the expected value of 6.3. Note that the XQuery function number() returns double precision data.

After a note from Eric Bruchez he suggested I cast the numbers to decimal:

xquery version "1.0";
xs:decimal(3.1) + xs:decimal(3.2)

and the problem seems to go away. He also showed that he could reproduce this error in other Java JVM languages like Scala.

Let me know if anyone else has seen this problem before and has any other suggestions for a fix.

Thanks! - Dan

Tuesday, May 14, 2013

Analytical Reports for Technical Books

We are wrapping up our work on our book "Makings Sense of NoSQL" and I have created a series of XQuery reports on our book that we have found useful.

Our book is stored in XML DocBook format, which for those of you that have not used it, is ideal for technical books. DocBook contains elements for almost everything you need in a technical book including chapters, sections, paragraphs, figures, glossary terms, bibliographic references, and index terms. In short DocBook is is the perfect fit for most technical books and it can easily be extended. One key aspect about DocBook is that it is easy to transform into multiple formats such as HTML, PDF or ePub. There are many open source transforms available for DocBook. DocBook is at the heart of the movement into single-source publishing for technical publications.

In additional to the standard DocBook transforms we also created a series of reports on the book and I thought they might be of interested to others. Here is a summary of some of these reports.

Chapter length report

When we started writing our book our goal was to produce a 310 page book with 12 chapters, each with approximately equal length. But logically we wanted our fist chapter to be a brief overview and we found that the chapter that described the core NoSQL patterns had more content.

This is a horizontal bar chart that shows the length of each chapter. The goal is that all chapters be roughly the same size in length.

Chapter Length Report

You can see that one chapter in our book (chapter 4) is a bit longer then the rest of the chapters. We knew this up front after running this report and were prepared to defined our decisions with our editors.

The "Naughty words" report

We quickly found out that editors have some words that they don't want to see in a technical book. Words such as "just" or "very" should be used with great caution. There were also words that were not allowed "vs.", "e.g.", "etc.") according to the style guide. This report shows you how often you use these words in each chapter.

Report counting specific words in each chapter

Book metrics report

This is just a raw counts of elements such as book parts, chapters, sections, figures and tables etc.

Book Metrics Report

Chapter metrics report

For each chapter I created a detailed report of the content. As you are writing each chapter, other people can view your progress by running this report. I tend to put in outlines first, then figures and then the actual text of each chapter.

List of figures and tables by chapter

These reports shows a listing of figures and tables sorted by chapter and location in each chapter. It also shows the caption for each figure and a thumbnail of the image. There are versions that show the type of figure (line-art vs. bitmapped) the sources of each figure.

List of Figures Report

Figure and table captions reports

Editors want to make sure a book is "browsable" which means that every other page has an interesting figure that people will see when they flip through the book. Each of these reports can list the length of the figures and table captions sorted by the length of the caption. The captions that are too short will need further work.

Paragraph and sentence length histogram reports

There are guidelines that should be used when writing technical books on sentence and paragraph length. This report shows the distribution of paragraph lengths in each chapter. You should work with your editor to find reasonable guidelines for your audience.

Paragraph Size Distribution Report

Table of contents reports

For reviewing the book structure it is always nice to have reports that list the book's structure by chapter, section, sub-section and sub-sub-section. These reports come with several parameters that limit the depth of the table of content and can also be modified to create mind maps using open source mind mapping tools.

Table of contents showing Parts, Chapters, Section 1 and Section 2

Once I wrote the basic structure of the table-of-contents report it was easy to create other output formats for different functions. Here is an example of a GraphViz output.

Book Outline using GraphViz format.

This example used the DOTML markup format and an external transform using the Chris Wallace graphviz XQuery module.

You can also convert the table of contents into an Mindmap file and open the file in FreeMind or XMind.

Here is an example of the book rendered in XMind. NoSQL Book MindMap

Glossary of terms reports

Our book puts a focus on the terminology used in the NoSQL movement. We try to create precise definitions of all the terms we use and discuss the variations in definitions in different NoSQL communities. I use these reports to list each time a term is first introduced and make sure that we have a formal definition in the Glossary Appendix at the end of the book.

Glossary Term Report Showing Glossterm IDs and Definition Status

There are also other miscellaneous reports listing the introductory chapter quotes, lists of comments by reviewer (extracted from PDF using Apache Tika), and various tools to help us gauge the completeness of each chapter.

Report Strategy

We created a central XQuery book module that had all the common functions such as $book:chapters, or that returned a sequence of all book chapters or book:word-count($node) that returned a word count of a node. I used the eXist-db database to store and execute the transforms. After we created the module the templates could be quickly customized for each report. We also used the oXygen XML IDE extensively and we want to thank George Bina for his support of our project.

I think that DocBook, oXygen, XQuery and eXist are ideal tools for managing the book creation process.

Monday, May 06, 2013

Cognitive Bias in NoSQL System Selection

After attending the Saturn 2013 Conference I was exposed to the use of "Cognitive Bias" in software architecture.

Here are some examples of cognitive bias I have seen as applied to the world of NoSQL database selection.

Anchoring bias - the tendency to produce an estimate near a cue amount - "Our managers were expecting an RDBMS solution so that’s what we gave them."

Availability heuristic - the tendency to estimate that what is easily remembered is more likely than that which is not. - "I hear that NoSQL does not support ACID." or "I hear that XML is verbose?"

Bandwagon effect - the tendency to do or believe what others do or believe - "Everyone else at this company and in our local area uses RDBMSs."

Confirmation bias - the tendency to seek out only that information that supports one's preconceptions – "We only read posts from the Oracle|Microsoft|IBM groups."

Framing effect - the tendency to react to how information is framed, beyond its factual content "We know of some NoSQL projects that failed."

Gambler's fallacy (aka sunk cost bias) the failure to reset one's expectations based on one's current situation – "We already paid for our Oracle|Microsoft|IBM license so why spend more money?"

Hindsight bias - the tendency to assess one's previous decisions as more efficacious than they were – "Our last five systems worked on RDBMS solutions".

Halo effect - the tendency to attribute unverified capabilities in a person based on an observed capability. – "Oracle|Microsoft|IBM sells billions of dollars of licenses each year, how could so many people be wrong".

Representativeness heuristic - the tendency to judge something as belonging to a class based on a few salient characteristics - "Our accounting systems work on RDBMS so why not our product search?"

Thanks to everyone at CMU/SEI and the SATURN Conference for the exposure to these ideas.

Wednesday, May 01, 2013

How to pass oXygen XML IDE Editor variables into an XQuery script.

I am a big fan of the oXygen XML Integrated Development Environment. It has many powerful features and I learn about new ones the more I use it. I recently started a project that required me to run XQuery transform on hard-drive files using Saxon HE. Since I normally run XQuery from within the eXist-db this was a new challenge for me.

I have many XML files and many different XQueries and I wanted to run them directly from test cases in an oXygen project. But unlike XSLT, XQuery does not have a concept of "standard input". You need to specify a path to the document within the XQuery file itself. So how to pass the file name from the project directly to Saxon? Luckily, the guys from Syncro Soft (the authors of oXygen) thought about this. You can use oXygen "Editor Variables" to be parameters to your XQuery.

Here is a sample program that reads two external variables, one for the project director and one for the current file name (with extension).

XQuery Program

xquery version "1.0";

(: Example of how to access external "Editor Variable" variables passed

  by oXygen. This one example gets us to a full file path within a

  project.  See:

http://www.oxygenxml.com/doc/ug-editor/topics/editor-variables.html

  for other editor variables.
:)

(: oXygen project directory :)
declare variable $pd as xs:string external;

(: file name with extension :)
declare variable $cfne as xs:string external;

(: note the triple forward slashes before the file. :)
let $path := concat('file:///', $pd, '\', $cfne)

return
<results>
   <project-dir>{$pd}</project-dir>
   <file-name>{$cfne}</file-name>
   <file-path>{$path}</file-path>
   <doc-available>{doc-available($path)}</doc-available>
</results>

Note that we put the project and file together to get the path to the file and use the doc-available() function to verify that the file is there.

Next, we set up a transformation scenario and put the file above in the transform URL. We have to add the parameters to the transform. oXygen parses your XQuery for all external variables and puts them in each line for you. For each variable put in the same variable with the ${pd} notation where the dollar sign is outside the curly braces (opposite of XQuery).

Here is a screen image of these parameters and their values.

Results

When we run this transform on any file on Windows we get the following:

<results>
<project-dir>D:\ws\project\subdir</project-dir>
<file-name>my-input-file.xml</file-name>
<file-path>file:///D:\ws\project\subdir\my-input-file.xml</file-path>
<doc-available>true</doc-available>
</results>

On Mac and UNIX systems we would not use the file:/// prefix and the separators will be forward slashes.