Dan McCreary's Blog: May 2013

Tuesday, May 14, 2013

Analytical Reports for Technical Books

We are wrapping up our work on our book "Makings Sense of NoSQL" and I have created a series of XQuery reports on our book that we have found useful.

Our book is stored in XML DocBook format, which for those of you that have not used it, is ideal for technical books. DocBook contains elements for almost everything you need in a technical book including chapters, sections, paragraphs, figures, glossary terms, bibliographic references, and index terms. In short DocBook is is the perfect fit for most technical books and it can easily be extended. One key aspect about DocBook is that it is easy to transform into multiple formats such as HTML, PDF or ePub. There are many open source transforms available for DocBook. DocBook is at the heart of the movement into single-source publishing for technical publications.

In additional to the standard DocBook transforms we also created a series of reports on the book and I thought they might be of interested to others. Here is a summary of some of these reports.

Chapter length report

When we started writing our book our goal was to produce a 310 page book with 12 chapters, each with approximately equal length. But logically we wanted our fist chapter to be a brief overview and we found that the chapter that described the core NoSQL patterns had more content.

This is a horizontal bar chart that shows the length of each chapter. The goal is that all chapters be roughly the same size in length.

Chapter Length Report

You can see that one chapter in our book (chapter 4) is a bit longer then the rest of the chapters. We knew this up front after running this report and were prepared to defined our decisions with our editors.

The "Naughty words" report

We quickly found out that editors have some words that they don't want to see in a technical book. Words such as "just" or "very" should be used with great caution. There were also words that were not allowed "vs.", "e.g.", "etc.") according to the style guide. This report shows you how often you use these words in each chapter.

Report counting specific words in each chapter

Book metrics report

This is just a raw counts of elements such as book parts, chapters, sections, figures and tables etc.

Book Metrics Report

Chapter metrics report

For each chapter I created a detailed report of the content. As you are writing each chapter, other people can view your progress by running this report. I tend to put in outlines first, then figures and then the actual text of each chapter.

List of figures and tables by chapter

These reports shows a listing of figures and tables sorted by chapter and location in each chapter. It also shows the caption for each figure and a thumbnail of the image. There are versions that show the type of figure (line-art vs. bitmapped) the sources of each figure.

List of Figures Report

Figure and table captions reports

Editors want to make sure a book is "browsable" which means that every other page has an interesting figure that people will see when they flip through the book. Each of these reports can list the length of the figures and table captions sorted by the length of the caption. The captions that are too short will need further work.

Paragraph and sentence length histogram reports

There are guidelines that should be used when writing technical books on sentence and paragraph length. This report shows the distribution of paragraph lengths in each chapter. You should work with your editor to find reasonable guidelines for your audience.

Paragraph Size Distribution Report

Table of contents reports

For reviewing the book structure it is always nice to have reports that list the book's structure by chapter, section, sub-section and sub-sub-section. These reports come with several parameters that limit the depth of the table of content and can also be modified to create mind maps using open source mind mapping tools.

Table of contents showing Parts, Chapters, Section 1 and Section 2

Once I wrote the basic structure of the table-of-contents report it was easy to create other output formats for different functions. Here is an example of a GraphViz output.

Book Outline using GraphViz format.

This example used the DOTML markup format and an external transform using the Chris Wallace graphviz XQuery module.

You can also convert the table of contents into an Mindmap file and open the file in FreeMind or XMind.

Here is an example of the book rendered in XMind. NoSQL Book MindMap

Glossary of terms reports

Our book puts a focus on the terminology used in the NoSQL movement. We try to create precise definitions of all the terms we use and discuss the variations in definitions in different NoSQL communities. I use these reports to list each time a term is first introduced and make sure that we have a formal definition in the Glossary Appendix at the end of the book.

Glossary Term Report Showing Glossterm IDs and Definition Status

There are also other miscellaneous reports listing the introductory chapter quotes, lists of comments by reviewer (extracted from PDF using Apache Tika), and various tools to help us gauge the completeness of each chapter.

Report Strategy

We created a central XQuery book module that had all the common functions such as $book:chapters, or that returned a sequence of all book chapters or book:word-count($node) that returned a word count of a node. I used the eXist-db database to store and execute the transforms. After we created the module the templates could be quickly customized for each report. We also used the oXygen XML IDE extensively and we want to thank George Bina for his support of our project.

I think that DocBook, oXygen, XQuery and eXist are ideal tools for managing the book creation process.

Monday, May 06, 2013

Cognitive Bias in NoSQL System Selection

After attending the Saturn 2013 Conference I was exposed to the use of "Cognitive Bias" in software architecture.

Here are some examples of cognitive bias I have seen as applied to the world of NoSQL database selection.

Anchoring bias - the tendency to produce an estimate near a cue amount - "Our managers were expecting an RDBMS solution so that’s what we gave them."

Availability heuristic - the tendency to estimate that what is easily remembered is more likely than that which is not. - "I hear that NoSQL does not support ACID." or "I hear that XML is verbose?"

Bandwagon effect - the tendency to do or believe what others do or believe - "Everyone else at this company and in our local area uses RDBMSs."

Confirmation bias - the tendency to seek out only that information that supports one's preconceptions – "We only read posts from the Oracle|Microsoft|IBM groups."

Framing effect - the tendency to react to how information is framed, beyond its factual content "We know of some NoSQL projects that failed."

Gambler's fallacy (aka sunk cost bias) the failure to reset one's expectations based on one's current situation – "We already paid for our Oracle|Microsoft|IBM license so why spend more money?"

Hindsight bias - the tendency to assess one's previous decisions as more efficacious than they were – "Our last five systems worked on RDBMS solutions".

Halo effect - the tendency to attribute unverified capabilities in a person based on an observed capability. – "Oracle|Microsoft|IBM sells billions of dollars of licenses each year, how could so many people be wrong".

Representativeness heuristic - the tendency to judge something as belonging to a class based on a few salient characteristics - "Our accounting systems work on RDBMS so why not our product search?"

Thanks to everyone at CMU/SEI and the SATURN Conference for the exposure to these ideas.

Wednesday, May 01, 2013

How to pass oXygen XML IDE Editor variables into an XQuery script.

I am a big fan of the oXygen XML Integrated Development Environment. It has many powerful features and I learn about new ones the more I use it. I recently started a project that required me to run XQuery transform on hard-drive files using Saxon HE. Since I normally run XQuery from within the eXist-db this was a new challenge for me.

I have many XML files and many different XQueries and I wanted to run them directly from test cases in an oXygen project. But unlike XSLT, XQuery does not have a concept of "standard input". You need to specify a path to the document within the XQuery file itself. So how to pass the file name from the project directly to Saxon? Luckily, the guys from Syncro Soft (the authors of oXygen) thought about this. You can use oXygen "Editor Variables" to be parameters to your XQuery.

Here is a sample program that reads two external variables, one for the project director and one for the current file name (with extension).

XQuery Program

xquery version "1.0";

(: Example of how to access external "Editor Variable" variables passed

  by oXygen. This one example gets us to a full file path within a

  project.  See:

http://www.oxygenxml.com/doc/ug-editor/topics/editor-variables.html

  for other editor variables.
:)

(: oXygen project directory :)
declare variable $pd as xs:string external;

(: file name with extension :)
declare variable $cfne as xs:string external;

(: note the triple forward slashes before the file. :)
let $path := concat('file:///', $pd, '\', $cfne)

return
<results>
   <project-dir>{$pd}</project-dir>
   <file-name>{$cfne}</file-name>
   <file-path>{$path}</file-path>
   <doc-available>{doc-available($path)}</doc-available>
</results>

Note that we put the project and file together to get the path to the file and use the doc-available() function to verify that the file is there.

Next, we set up a transformation scenario and put the file above in the transform URL. We have to add the parameters to the transform. oXygen parses your XQuery for all external variables and puts them in each line for you. For each variable put in the same variable with the ${pd} notation where the dollar sign is outside the curly braces (opposite of XQuery).

Here is a screen image of these parameters and their values.

Results

When we run this transform on any file on Windows we get the following:

<results>
<project-dir>D:\ws\project\subdir</project-dir>
<file-name>my-input-file.xml</file-name>
<file-path>file:///D:\ws\project\subdir\my-input-file.xml</file-path>
<doc-available>true</doc-available>
</results>

On Mac and UNIX systems we would not use the file:/// prefix and the separators will be forward slashes.