“Distinct” Results from a Cloudant Search Index

Search indexes are a really nice feature of Cloudant.  I’d go so far as to say they’re essential, without them what many would consider routine database tasks would be much more complicated.  Let’s look at a simplified set of the data available at xbrl.mybluemix.net.  IBM reporting some financial information over time:

{ "date": "03/31/2010", "company": "IBM", 
"concept":"Assets", "value": 100.00}
{ "date": "06/30/2010", "company": "IBM", 
"concept":"Assets", "value": 110.00}

...repeats...

{ "date": "12/31/2014", "company": "IBM", 
"concept":"Assets", "value": 200.00}
{ "date": "03/31/2010", "company": "IBM", 
"concept":"Current Assets", "value": 50.00}

...continues

Nothing too complicated so far.  Let’s create a search index on the concept field,  it would look like this:

function(doc){
	index("default", 
		doc.company, 
		{"store": true});
	index("concept", 
		doc.concept, 
		{"store": true});
}

So far everything looks good.  In order to populate the second auto-complete combo-box on the main page of xbrl.mybluemix.net we need to search for all the concepts that a company filed that contain a certain string. Let’s suppose we’re looking for all concepts that IBM filed containing the text “Assets”, limited to 10 results. The query URI would look like something this:

https://myaccount.cloudant.com/facts/_design/factsSearches/_search/nameSearch?q="IBM"%20AND%20name:Assets&limit=10

The actual JSON for the results isn’t important yet, what’s important is the response from this URI would contain the following results:

{ "date": "03/31/2010", "company": "IBM", 
"concept":"Assets", "value": 100.00}
{ "date": "06/30/2010", "company": "IBM", 
"concept":"Assets", "value": 110.00}

... repeats ...

{ "date": "06/32/2012", "company": "IBM", 
"concept":"Assets", "value": 150.00}

“Current Assets” is not there! Nor should it be, the search index did exactly what it was asked to do. What we actually wanted was to find distinct concepts with “Assets” in them. In SQL terms we were looking for:

SELECT DISTINCT 
    concept 
FROM 
    facts 
WHERE 
    company='IBM' 
    AND
    concept LIKE '%Assets%'

A naive solution would be to increase the limit, say to 100, then simply process the list on the client or server, keeping only distinct results, before populating the combo-box.  This would work for our simple example, but suppose IBM reported its Assets 100 times, or 1000 times?  Cloudant imposes a maximum limit of 200 results.  If IBM had reported “Assets” 1000 times (note that this is a bit of a contrived example) we’d be out of luck.

Since we only need a list of distinct values of a single field (concept) we can use the counts facet. Simply rebuild the index like so:

function(doc){
	index("company", 
		doc.company, 
		{"store": true});
	index("concept", 
		doc.concept, 
		{"store": true, "facet": true});
}

Now the query URI looks like this:

https://myaccount.cloudant.com/facts/_design/factsSearches/_search/nameSearch?q="IBM"%20AND%20name:Assets&limit=0&ampcounts=%5B"name"%5D

We use limit=0 since we aren’t really concerned with the actual documents, just the list of distinct values, counts=[“name”] gives us this. The JSON resulting from this query would look something like:

{"total_rows":17,"bookmark":"g2o","rows":[],"counts":{"name":{"Assets":16,"CurrentAssets":1}))

That final associative array is what we’re after, a list of permutations of the name field containing “Assets” and how many times each permutation appears.

Caveat: There is no “limit=x” option when working with the counts facet. So if the query were to be something like all documents where the concept contains “e” the query could take a very long time and return an extremely long list of counts.

“Distinct” Results from a Cloudant Search Index

A Cloudant “Tip”: Separate Design Documents for Each View

This is not an official “best practice” or anything, just a habit I fell into while working with Cloudant and large data sets.

To very quickly review, there are no “queries” as such when working with a Cloudant database.  Instead one can create secondary or search indexes or “views” against a database.  Both are functions written in Javascript that are executed against every document in a database and the results stored.  This makes getting results from large data sets blazing fast, as all possible results are pre-calculated, but creating the index can be very slow if there is lots of data or the index function is complex.

Each view is part of a design document, and each design document can contain any number of views. As we can see from the documentation:

View indexes are rebuilt when a design document is updated. An update to any one of the views causes all the views in the document to be rebuilt.

That second sentence is the important bit. xbrl.mybluemix.net as of the time of this post has just under 70 million documents in its main database.  Building a simple secondary index against this data usually takes about 72 hours.  If you have an application depending on such a view it will either show very incomplete data, or no data at all, during this time.

There are other articles out there that say that each design document conceptually represents an “application” and each view a part of that application.  That sounds nice, but if you’ve got hundreds of gigabytes or terabytes of data it means that if you want to change one part of your application, or even just add to it, you’re going to bring the whole thing down for days.  Even if it’s just dev and not prod, that’s really annoying.

So whether or not it’s an official “best practice” or not, I’ve fallen into the habit of having as few views per design document as possible, usually one or two.  It may not be as pretty, but if I want to change one part of an application without taking down all the other parts it’s really nice.

A Cloudant “Tip”: Separate Design Documents for Each View

CamelCase Tokenizer For Cloudant Search Indexes

This post will be part of a series of things I learned while building xbrl.mybluemix.net.  Some things may sound very simple to Node.js or Cloudant gurus, but as I started with zero knowledge of either these were things I had to learn the hard way.  Often with Google searches producing no meaningful or simple example solution.  So here we go…

When creating a search index over a Cloudant database a number of analyzers are available for parsing/tokenizing full text data.  When building xbrl.mybluemix.net the field I wanted to search on was camel case strings like SalesRevenueNet.

There is no built in camel case tokenizer in Cloudant.  It’s possible to build your own Lucene.Net tokenizer for camel case strings, but this is of no use when using Cloudant.  Luckily the flexibility of creating your search index as a javascript function allows you to tokenize pretty much any way you want, simply apply your own tokenizer when creating the index.

function unCamelCase (str){
    return str
        // insert a space between lower & upper
        .replace(/([a-z])([A-Z])/g, '$1 $2')
        // space before last upper in a sequence followed by lower
        .replace(/\b([A-Z]+)([A-Z])([a-z])/, '$1 $2$3')
        // uppercase the first character
        .replace(/^./, function(str){ return str.toUpperCase(); });
}

function(doc){
	index("name", unCamelCase(doc.name), {"store": true});
}

It’s that easy, you don’t have to index on a field that already exists, you can create one at “run time”. For those with keen eyes I realize the above example doesn’t take into account numbers, but if that’s a concern it’s left as an exercise to the reader. Credit to this StackOverflow thread for providing a Javascript example of a camel case tokenizer.

CamelCase Tokenizer For Cloudant Search Indexes

My Mistake or Tesla’s

Take a look at the following graph:

http://xbrl.mybluemix.net/show/5bb9826826809a63184d9438ff8b66b2

We can see right away that something weird may be going on, there are two values at exactly the same time period. DeferredTaxAssetsNet can’t be two different values at the same time can it? Using the “Show XML” link on the two data points shows that Tesla’s 2012 10-K had the following:

<context id="eol_PE665410--1210-K0016_STD_0_20121231_0">
	<entity>
		<identifier scheme="http://www.sec.gov/CIK">
			0001318605
		</identifier>
	</entity>
	<period>
		<instant>
			2012-12-31
		</instant>
	</period>
</context>

<us-gaap:DeferredTaxAssetsNet 
contextRef="eol_PE665410--1210-K0016_STD_0_20121231_0" 
unitRef="iso4217_USD" 
decimals="-3" 
id="id_241467_82B22C37-AFA1-40BD-A569-D2FEA93C0951_1_8">
	387997000
</us-gaap:DeferredTaxAssetsNet>

Following on that, Tesla’s 2013 10-K had this instead:

<context id="eol_PE665410--1310-K0010_STD_0_20121231_0">
	<entity>
		<identifier scheme="http://www.sec.gov/CIK">
			0001318605
		</identifier>
	</entity>
	<period>
		<instant>
			2012-12-31
		</instant>
	</period>
</context>

<us-gaap:DeferredTaxAssetsNet 
contextRef="eol_PE665410--1310-K0010_STD_0_20121231_0" 
unitRef="iso4217_USD" 
decimals="-3" 
id="id_4522937_93BCB9A5-8733-4C68-82F8-C674BDB934EE_2_12">
16527000
</us-gaap:DeferredTaxAssetsNet>

It’s important to note that the first filing uses US-GAAP 2012, while the latter uses US-GAAP 2013. Is this the reason? Maybe DeferredTaxAssetsNet is calculated differently in the two systems? I’m not enough of an accountant to know. On the other hand maybe Tesla made some kind of mistake in reporting the value for the previous year when they were filing their 2013 10-K. On the other other hand maybe this data isn’t meant to be time contiguous and it doesn’t make sense to display it this way?

Anyone who knows please chime in.

My Mistake or Tesla’s

XBRL on IBM Bluemix

Note:  This post makes reference to my employment activities at IBM, please see the Disclaimer.

Something I’ve had going in the background for a little while is using Node.js to stream all the XBRL filing data off the SEC website, store it in a MongoDB instance, and create a web UI to display it. I’ve done every filing ever locally, but hosting the data has proven a little more restrictive, I only have space for the DOW 30. I’m using IBM Bluemix to host the Node.js “UI” portion, and the database building portion is running on a virtual machine that I use inside IBM.

I’ll post more about my experiences with Node.js, XBRL and Mongo a little later, but for now you can play with the app here:

http://xbrl.mybluemix.net

XBRL on IBM Bluemix

CDM Web

Note:  This post makes reference to my employment activities at IBM, please see the Disclaimer.

In my professional capacity I spend most of my time working in product management at IBM for a product called IBM Cognos Disclosure Managment.  Long story short, it’s a windows application that embeds Microsoft Office and allows users to collaborate on creating large documents by breaking the document down into “objects” with security and workflow.  It also connects to various data sources and can embed/refresh the data in said objects.  Since it integrates with desktop Office, there’s always been talk of somehow creating a browser version.  When Microsoft released OWA (Office Web Apps) Server, I thought I’d play around with it.

Note that OWA server is a totally stand alone re-distributable instance of the exact same technology that runs Office 365 (or whatever Microsoft is calling it these days).  Everything is being hosted locally, no content is flowing through Microsoft servers at any point.

Step One: Installing OWA Server

This took a fair bit of time just to provision the environment.

  • Must be installed on Windows Server (I used 2012)
  • Server must be joined to a domain (no idea why)
  • Nothing else “server-y” may be installed like SQL Server.

Anyway, I used these two documents to get me there:

Plan Office Web Apps Server

Deploy Office Web Apps Server

Step Two: Open a Document

Not as easy as it sounds.  Actually installing the server doesn’t get you anything.  There’s no URL to navigate to where you can just open a file, you have to implement a WOPI (Web Application Open Platform Interface) web service to provide information about documents that OWA can actually open.  Fine, I used these two documents to get me most of the way there:

Introducing WOPI

Building an Office Web Apps (OWA) WOPI Host

If I remember rightly this didn’t actually get me all the way there.  I had to do a fair bit of debugging on the OWA server, this article describes where the logs are.  They are very verbose, so I just searched for “error” and “exception”. Also you’ll need to understand what ASP.NET Web API is.

Step Three: Integrate with CDM

I decided to write an MVC 4 web application that would simply talk to the CDM application server on it’s existing web service endpoints.  Since CDM already has a desktop client, I thought this should look like a mobile app, luckily MVC4 has some really nice and easy routes to building a mobile app.

The toughest part for me was actually coming to grips with MVC’s “fuzzy” routing.  I found this routing debugger to be insanely helpful, remember, the order that you set up your routes in matters!  That one caused me to tear my hair out for a few hours.

Step Four: Profit?

Here’s a quick video of the app in action, running inside Chrome emulating a Nexus 5.

CDM Web

Dynamic Typed Languages for Complex Object Models

A depressing amount of my professional life has revolved around XBRL.  For those too lazy to follow the link I’ll summarize by saying it’s an XML based way of describing and providing financial information.  Something that’s been commented on by many but actioned by few is that XBRL can represent multi-dimensional information, so OLAP is a natural storage mechanism.  For those who aren’t aware I currently work at IBM, and since IBM owns an OLAP engine called TM1, I thought I’d play around with whether or not storage of a complete XBRL instance and taxonomy in TM1 is possible.

In order to get started I needed an XBRL processor to load sample documents and attempt to create and populate corresponding concpets in TM1.  This is as far as I’ve gotten.  The most popular XBRL processor seems to be Arelle which is an open source implementation written in Python.  This was my first exposure to Python, and though I’m sure this has been said before, my initial impression is:

  • It would be really easy to bang out a small project in Python.
  • Attempting to work with someone else’s project, particularly in the context of a complex object model seems much more difficult.

The reason is that Python uses dynamic types meaning that auto-complete within an IDE is not available for anything other than static members.  For serious programming neophytes the distinction looks like this, a Python IDE will auto-complete a statement like this (note this is not actual Python, just an example):

File.ReadContents("test.txt")

since ReadContents is a static method of the File class.  However, the second line of this code will be a problem:


myFile = File.Open("test.txt")

myFile.Append("blah")

Since myFile is a dynamic type, it doesn’t have any “methods” until run-time.

This thread on stackoverflow is rife with people claiming support for auto-complete, but as far as I can tell they are referring to base or imported modules, which are well known so an IDE can come pre-loaded with some kind of knowledge of the “types” they provide.  Auto-complete on your own code (or someone else’s .py files, which is the critical thing here) seems to be missing.  So again we come back to:

  • Fast to write your own code, because you know your own types.
  • When working with someone else’s code directly it becomes much tougher because you have to actually inspect the .py files or read through an API reference.

This makes me wonder why Arelle went with Python.  I suspect it’s because they wanted to create XBRL tools first (quickly), and only after thought about exposing an object model.  Maybe I’m way off here, but it seems something like Java or C# would have been better tools for creating a complex object model that is meant for others to use.

Dynamic Typed Languages for Complex Object Models

Farewell tedp.net

I used to maintain a blog/forum/wall type thing at www.tedp.net.  It started when I was an undergraduate at University of Toronto Scarborough.  Each student was assigned something like 10MB of webspace to play with.  I used Perl and CGI to create a blog, and eventually created a very rudimentary user management system so that my classmates could post as well.  They could even submit a profile photo and attach pictures to their posts.  The page “went live” on March 21 2002 at 1:25pm according to the HTML I wrote at the time.  I suppose if I’d had any entrepreneurial ambition at the time I could have beat Facebook to the punch.  I eventually moved it to paid hosting when 10MB became a pretty impractical limit.

I decided recently to take the site down, mainly because it hasn’t been actively updated in years and I didn’t want to pay for the hosting anymore.  But I thought that I should maintain some record of it, as my classmates and I were pretty active in posting there for a few years.  I wrote a small C# program that would connect to the MySQL database that backed the site, download all the post information, and output an HTML document containing each post as it would have been rendered on the site.  A typical post looked something like this:

The HTML document I created was simply hundreds of these stacked on top of one another.  All good so far.  However, I also wanted to print the document to PDF, where page breaks suddenly became an issue.  Simply using Chrome’s “Save as PDF” functionality would result in documents that look like this:

badBreak

Notice how the break occurs in the middle of one “post”.  My first stab at fixing this used the CSS page-break-after property on a div placed between each post like so:

<div style="page-break-after: always;"></div>

This was slightly better, but now each post was on its own page. What I wanted was to suggest that a page break should occur there.  Turns out CSS doesn’t have a way to suggest a page break, but it does have a way to do the opposite, suggest against one.  I simply wrapped each post in a div with the page-break-inside property set as follows:

<div style="page-break-inside: avoid;">
...post goes here...
</div>

Result!

I also ran into some problems with unicode characters not rendering properly. I think because in the early days I used to edit my posts in Word then copy/paste them into the website I ended up with many unicode characters from the MS Windows Character Set in the posts.  They rendered with varying degrees of success but looked terrible in Chrome.  I had problems with left and right single and double quotes, long dashes and upside down question marks.  The reason for the last one is that for a time there was a fad amongst my group of friends for using upside down question marks instead of normal ones on instant messaging and websites like ted.net.  Chris Sorensen started it all.  This is the code I used to strip them out:

postText = postText.Replace("\u0091", "\u0027");
postText = postText.Replace("\u0092", "\u0027");
postText = postText.Replace("\u0093", "\u0022");
postText = postText.Replace("\u0094", "\u0022");

postText = postText.Replace("\u0096", "\u002D");
postText = postText.Replace("\u0097", "\u002D");

postText = postText.Replace("\u00BF", "\u003F");

The full PDF can be found here.  It’s in chronological order, which is the opposite of how it would have been presented on the website, but makes it easier to read as a document.  At first it’s just me, others join in after a few pages.

Farewell tedp.net