Substitution Ciphers – Not Just for the Ancient Roman Military

Recall from a previous post that we were trying to get multiple “distinct” fields from a Cloudant search index.  We did this by concatenating the two fields during index creation.  Recall also that there was serious drawback with this method, that we were indexing both fields, rather than indexing one and getting both back.

This isn’t a problem if the two fields are using strictly different characters spaces.  For example, one only alphabetical characters and one only numeric characters.  Even if this isn’t the case, we can force it to be using a substitution cipher.  JSON documents are not limited to ASCII characters, the UTF-8 character space is quite large, and we can simply shift one of the fields to a totally different character set when creating the index.

Here’s how the search index would look (credit to Tim Severien’s cipher example on GitHub for the rotate function):

function rotateText(text, rotation) {
    // Surrogate pair limit
    var bound = 0x10000;

    // Force the rotation an integer and within bounds, just to be safe
    rotation = parseInt(rotation) % bound;

    // Might as well return the text if there's no change
    if(rotation === 0) return text;

    // Create string from character codes
    return String.fromCharCode.apply(null,
        // Turn string to character codes
        text.split('').map(function(v) {
            // Return current character code + rotation
            return (v.charCodeAt() + rotation + bound) % bound;
        })
    );
}

function(doc){
	index("companyName", rotateText(doc.identifier, 500) + ' ' + doc.name, {"store": true, "facet":true});
}

We’re creating the index on the concatenated string consisting of the cipher-ed identifier and the original name. This means the pair [‘0000051143’, ‘INTERNATIONAL BUSINESS MACHINES CORP’] will become the string “ȤȤȤȤȤȩȥȥȨȧ INTERNATIONAL BUSINESS MACHINES CORP”.

From here we continue as before, using faceting to return a list of distinct results. It’s then simply a matter of reversing the cipher to get the identifier back (in this example rotate the text by -500). You can either make the very reasonable assumption that no one will enter characters from the cipher-ed character space into your search field, or not allow searches that contain them.

Substitution Ciphers – Not Just for the Ancient Roman Military

Financial Concept Text Cloud

I’ve been playing with the IBM RAVE engine a bit today. I tried applying it to the data at xbrl.mybluemix.net. Here’s a text cloud showing the 100 most reported financial concepts. What does this mean? Nearly all filers are likely to report certain concepts like Assets. Furthermore, a concept that is more highly dimensionalized has more data points.  For example, here’s General Motor’s breakout of StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest.  Very highly dimensionalized.

conceptCloud

Unfortunately the query to generate this text cloud takes about 40 seconds to run, so it’s not feasible to have it live.

Financial Concept Text Cloud

“We’ll Do It Live!” -or- Just Give Me Every Result

In the previous two posts we’ve looked at ways to get distinct results from a search index. Well, suppose neither of those approaches works for you and you’ve decided to group and/or sort yourself. As we discovered earlier, there is a limit of 200 results from any single call to a search index. All hope is not lost though, let’s look at the actual JSON we get back when performing a call against a search index URI:

{
	"total_rows":980,
	"bookmark":"g1AAAAGneJzLYWBgYMtgTmGQT0lKzi9KdUhJMjTUy0zKNTCw1EvOyS9NScwr0ctLLckBKmRKZEiS____f1YGk5uD0sugB0CxJAahPbYgA-TgBhjj1J-UANJRDzfifWwD2Ah28To0NxjhNCOPBUgyNAApoDHzIebMzYKYI_bxDapTzAgYswBizH6IMV9OHwAbw6-cS6yPIMYcgBhzH2LMw9UQY9inPCDNmAcQY6CB870aakx6dRYA32qFdg",
	"rows":
	[
		{
			"id":"0bd8dab855b66e643350625a33d79b00",
			"order":[9.491182327270508,925635],
			"fields":
			{
				"default":"0000789019",
				"conceptNameSplit":"Deferred Revenue Revenue Recognized"
			}
		}
		... continues ...
	]
}

Notice the first field in the result, “total_rows”. Cloudant knows how many results there are, however it will only give them to you in batches of up to 200. Notice the second field in the result “bookmark”. In order to get the next n results, simply call the same URI with a “bookmark=” parameter, using the bookmark provided from the previous n results. Using the miracles of Node.js, streaming, and recursion, we’ll look at a simple method to first stream every result, then get a list of n distinct results. All code is shown in CoffeeScript, since I like it better than plain Javascript.

First we’ll need a stream that can extract the bookmark value and store it in a pass-by-reference variable.

stream = require('stream')

#Buffer an incoming stream until a "bookmark" field is found and store the value.
class exports.BookmarkExtractorStream extends stream.Transform

  #Ensure that @bookmark is an object with a value field so that it will
  #be passed by reference.  (e.g. myBookmark = {value:'blah'}
  constructor: (@bookmark) ->
    @buffer = ""
    @found = false
    super

  _transform: (chunk, enc, next) ->

    #If we haven't found what we're looking for, keep appending the incoming chunk to our buffer.
    #The bookmark field appears near the beginning of the JSON anyway, so we shouldn't
    #have to store very much in our buffer.
    if not @found
      @buffer += chunk.toString('utf8')
      match = @buffer.match(/\"bookmark\"\:\"([^\"]+)\"/)

      #We found it, throw out the buffer
      if match?
        @bookmark.value = match[1]
        @buffer = null

    @push chunk
    next()

Next we’ll need a stream that takes incoming objects and outputs them as a JSON array. This presumes we’re sending the results somewhere, like a response stream, a file, or the console.

stream = require('stream')

class exports.JSONArrayTransformStream extends stream.Transform

  constructor: () ->
    @first = true
    super
      objectMode: true

  _transform: (chunk, enc, next) ->

    if (@first)
      @push('[')
      @first  = false
    else
      @push(',\n')

    @push(JSON.stringify(chunk))

    next()

  _flush: (next) ->
    if (@first)
      @push('[]')
    else
      @push(']')
    next()

Finally we’ll use a recursive function to execute each API call in sequence, using the bookmark from the previous call. The JSONStream package is very useful here.

JSONStream = require('JSONStream')
objectJSONArrayStream = new JSONArrayTransformStream()

recursiveCloudantSearch = (designDocUri, query, bookmark, callback) ->

  readUrl = "#{designDocUri}?q=#{query}&limit=200"
  oldBookmarkValue = bookmark.value
  if oldBookmarkValue?
    readUrl += "&bookmark=#{oldBookmarkValue}"
  request({url: readUrl})
  .pipe(new BookmarkExtractorStream(bookmark))
  .pipe(JSONStream.parse('rows.*.fields')).on('finish', ->
    if (bookmark.value isnt oldBookmarkValue)
      recursiveCloudantSearch(designDocUri, query, bookmark, callback)
    else
      callback()
  )
  .pipe(objectJSONArrayStream)
  .pipe(<send the output somewhere>)

designDocUri = "http://myAccount.cloudant.com/myDb/_design/searches/_search/mySearch"
query = "query goes heere"

recursiveCloudantSearch(designDocUri, query, {value:null}, ->
  console.log('done')
)

If we want distinct results, we’re going to need a new stream that accumulates results before outputting at the very end.

stream = require('stream')

class exports.DistinctingTransformStream extends stream.Transform

  #@keySelector: the name of a field in the object that we wish to find distinct values of
  #@limit: stop after finding this many distinct values
  constructor: (@keySelector, @limit) ->
    @myArray = {}
    super
      objectMode: true

  _transform: (chunk, enc, next) ->

    if (Object.keys(@myArray).length < @limit)
      @myArray[chunk[@keySelector]] = chunk[@keySelector]

    next()

  _flush: (next) ->

    for k,v of @myArray
      @push(v)
    next()

Our recursive function now looks like this:

JSONStream = require('JSONStream')
objectJSONArrayStream = new ObjectStringTransformStream()
distinctingTransformStream = new DistinctingTransformStream('<key property name>', 100)

recursiveCloudantSearch = (designDocUri, query, bookmark, callback) ->

  readUrl = "#{designDocUri}?q=#{query}&limit=200"
  oldBookmarkValue = bookmark.value
  if oldBookmarkValue?
    readUrl += "&bookmark=#{oldBookmarkValue}"
  request({url: readUrl})
  .pipe(new BookmarkExtractorStream(bookmark))
  .pipe(JSONStream.parse('rows.*.fields')).on('finish', ->
    if (bookmark.value isnt oldBookmarkValue)
      recursiveCloudantSearch(designDocUri, query, bookmark, callback)
    else
      callback()
  )
  .pipe(distinctingTransformStream)
  .pipe(objectJSONArrayStream)
  .pipe(<send the output somewhere>)

designDocUri = "http://myAccount.cloudant.com/myDb/_design/searches/_search/mySearch"
query = "query goes heere"

recursiveCloudantSearch(designDocUri, query, {value:null}, ->
  console.log('done')
)
“We’ll Do It Live!” -or- Just Give Me Every Result

Multiple “Distinct” Fields from a Cloudant Search Index

In the previous post we were trying to use a search index to replicate the following SQL:

SELECT DISTINCT 
    concept 
FROM 
    facts 
WHERE 
    company='IBM' 
    AND
    concept LIKE '%Assets%'

We were able to do this thanks to faceting. Let’s now turn our attention to the other combo-box on xbrl.mybluemix.net. Here we need to search companies, not financial concepts. Why not just use faceting again? There’s a fly in the ointment, a company’s name may not be unique, what we’re interested in is a distinct list of id/name pairs. In SQL this would mean:

SELECT DISTINCT 
    identifier, 
    name 
FROM 
    companies 
WHERE 
    name LIKE '%International%'

Faceting won’t help here. It can give us a distinct list of identifiers and a distinct list of names, but not id/name pairs.

One solution is to simply concatenate the two fields when creating the index. Recall from this post that the fields being indexed need not actually exist, they can be created at run-time. Such a search index would look something like this:

function(doc){
    index("identifierNameConcat", 
        doc.identifier + ' :: ' + doc.name, 
        {"store": true, "facet": true});
}

Whatever separator we decide to use it obviously must not appear in either the “id” or “name” field. Note the spaces in the separator, we don’t want “0000051143::International” to be a single token (we want “0000051143 :: International Business Machines”). We can now use a query URL like this:

http://account.cloudant.com/companies/_design/companiesSearchIndexes/_search/nameSearch?q=identifierNameConcat:International&counts=%5B"identifierNameConcat"%5D&limit=0

There is a serious drawback with this method. The two fields need to be very different, ideally something like strictly numeric and strictly alphabetic. Since we are technically indexing both id and name the results of our query will contain documents with the query string appearing in either. For example, searching for “000*” will return “0000051143 :: International Business Machines”. This is probably not what we intended. It goes the other way as well, if we were trying to find a company whose name started with “100” we’d get back every company that had an identifier beginning with “100”. In SQL terms what we actually ended up creating was something this:

SELECT DISTINCT 
    identifier, 
    name 
FROM 
    companies 
WHERE
    identifier LIKE '%International%'
    OR
    name LIKE '%International%'

Oops. This is serious enough that this method is probably unusable in most cases. I present it here in case someone does find it both usable and useful.

Multiple “Distinct” Fields from a Cloudant Search Index

“Distinct” Results from a Cloudant Search Index

Search indexes are a really nice feature of Cloudant.  I’d go so far as to say they’re essential, without them what many would consider routine database tasks would be much more complicated.  Let’s look at a simplified set of the data available at xbrl.mybluemix.net.  IBM reporting some financial information over time:

{ "date": "03/31/2010", "company": "IBM", 
"concept":"Assets", "value": 100.00}
{ "date": "06/30/2010", "company": "IBM", 
"concept":"Assets", "value": 110.00}

...repeats...

{ "date": "12/31/2014", "company": "IBM", 
"concept":"Assets", "value": 200.00}
{ "date": "03/31/2010", "company": "IBM", 
"concept":"Current Assets", "value": 50.00}

...continues

Nothing too complicated so far.  Let’s create a search index on the concept field,  it would look like this:

function(doc){
	index("default", 
		doc.company, 
		{"store": true});
	index("concept", 
		doc.concept, 
		{"store": true});
}

So far everything looks good.  In order to populate the second auto-complete combo-box on the main page of xbrl.mybluemix.net we need to search for all the concepts that a company filed that contain a certain string. Let’s suppose we’re looking for all concepts that IBM filed containing the text “Assets”, limited to 10 results. The query URI would look like something this:

https://myaccount.cloudant.com/facts/_design/factsSearches/_search/nameSearch?q="IBM"%20AND%20name:Assets&amp;amp;limit=10

The actual JSON for the results isn’t important yet, what’s important is the response from this URI would contain the following results:

{ "date": "03/31/2010", "company": "IBM", 
"concept":"Assets", "value": 100.00}
{ "date": "06/30/2010", "company": "IBM", 
"concept":"Assets", "value": 110.00}

... repeats ...

{ "date": "06/32/2012", "company": "IBM", 
"concept":"Assets", "value": 150.00}

“Current Assets” is not there! Nor should it be, the search index did exactly what it was asked to do. What we actually wanted was to find distinct concepts with “Assets” in them. In SQL terms we were looking for:

SELECT DISTINCT 
    concept 
FROM 
    facts 
WHERE 
    company='IBM' 
    AND
    concept LIKE '%Assets%'

A naive solution would be to increase the limit, say to 100, then simply process the list on the client or server, keeping only distinct results, before populating the combo-box.  This would work for our simple example, but suppose IBM reported its Assets 100 times, or 1000 times?  Cloudant imposes a maximum limit of 200 results.  If IBM had reported “Assets” 1000 times (note that this is a bit of a contrived example) we’d be out of luck.

Since we only need a list of distinct values of a single field (concept) we can use the counts facet. Simply rebuild the index like so:

function(doc){
	index("company", 
		doc.company, 
		{"store": true});
	index("concept", 
		doc.concept, 
		{"store": true, "facet": true});
}

Now the query URI looks like this:

https://myaccount.cloudant.com/facts/_design/factsSearches/_search/nameSearch?q="IBM"%20AND%20name:Assets&amp;amp;limit=0&amp;amp;ampcounts=%5B"name"%5D

We use limit=0 since we aren’t really concerned with the actual documents, just the list of distinct values, counts=[“name”] gives us this. The JSON resulting from this query would look something like:

{"total_rows":17,"bookmark":"g2o","rows":[],"counts":{"name":{"Assets":16,"CurrentAssets":1}))

That final associative array is what we’re after, a list of permutations of the name field containing “Assets” and how many times each permutation appears.

Caveat: There is no “limit=x” option when working with the counts facet. So if the query were to be something like all documents where the concept contains “e” the query could take a very long time and return an extremely long list of counts.

“Distinct” Results from a Cloudant Search Index

A Cloudant “Tip”: Separate Design Documents for Each View

This is not an official “best practice” or anything, just a habit I fell into while working with Cloudant and large data sets.

To very quickly review, there are no “queries” as such when working with a Cloudant database.  Instead one can create secondary or search indexes or “views” against a database.  Both are functions written in Javascript that are executed against every document in a database and the results stored.  This makes getting results from large data sets blazing fast, as all possible results are pre-calculated, but creating the index can be very slow if there is lots of data or the index function is complex.

Each view is part of a design document, and each design document can contain any number of views. As we can see from the documentation:

View indexes are rebuilt when a design document is updated. An update to any one of the views causes all the views in the document to be rebuilt.

That second sentence is the important bit. xbrl.mybluemix.net as of the time of this post has just under 70 million documents in its main database.  Building a simple secondary index against this data usually takes about 72 hours.  If you have an application depending on such a view it will either show very incomplete data, or no data at all, during this time.

There are other articles out there that say that each design document conceptually represents an “application” and each view a part of that application.  That sounds nice, but if you’ve got hundreds of gigabytes or terabytes of data it means that if you want to change one part of your application, or even just add to it, you’re going to bring the whole thing down for days.  Even if it’s just dev and not prod, that’s really annoying.

So whether or not it’s an official “best practice” or not, I’ve fallen into the habit of having as few views per design document as possible, usually one or two.  It may not be as pretty, but if I want to change one part of an application without taking down all the other parts it’s really nice.

A Cloudant “Tip”: Separate Design Documents for Each View

CamelCase Tokenizer For Cloudant Search Indexes

This post will be part of a series of things I learned while building xbrl.mybluemix.net.  Some things may sound very simple to Node.js or Cloudant gurus, but as I started with zero knowledge of either these were things I had to learn the hard way.  Often with Google searches producing no meaningful or simple example solution.  So here we go…

When creating a search index over a Cloudant database a number of analyzers are available for parsing/tokenizing full text data.  When building xbrl.mybluemix.net the field I wanted to search on was camel case strings like SalesRevenueNet.

There is no built in camel case tokenizer in Cloudant.  It’s possible to build your own Lucene.Net tokenizer for camel case strings, but this is of no use when using Cloudant.  Luckily the flexibility of creating your search index as a javascript function allows you to tokenize pretty much any way you want, simply apply your own tokenizer when creating the index.

function unCamelCase (str){
    return str
        // insert a space between lower &amp;amp;amp;amp; upper
        .replace(/([a-z])([A-Z])/g, '$1 $2')
        // space before last upper in a sequence followed by lower
        .replace(/\b([A-Z]+)([A-Z])([a-z])/, '$1 $2$3')
        // uppercase the first character
        .replace(/^./, function(str){ return str.toUpperCase(); });
}

function(doc){
	index("name", unCamelCase(doc.name), {"store": true});
}

It’s that easy, you don’t have to index on a field that already exists, you can create one at “run time”. For those with keen eyes I realize the above example doesn’t take into account numbers, but if that’s a concern it’s left as an exercise to the reader. Credit to this StackOverflow thread for providing a Javascript example of a camel case tokenizer.

CamelCase Tokenizer For Cloudant Search Indexes

My Mistake or Tesla’s

Take a look at the following graph:

http://xbrl.mybluemix.net/show/5bb9826826809a63184d9438ff8b66b2

We can see right away that something weird may be going on, there are two values at exactly the same time period. DeferredTaxAssetsNet can’t be two different values at the same time can it? Using the “Show XML” link on the two data points shows that Tesla’s 2012 10-K had the following:

<context id="eol_PE665410--1210-K0016_STD_0_20121231_0">
	<entity>
		<identifier scheme="http://www.sec.gov/CIK">
			0001318605
		</identifier>
	</entity>
	<period>
		<instant>
			2012-12-31
		</instant>
	</period>
</context>

<us-gaap:DeferredTaxAssetsNet 
contextRef="eol_PE665410--1210-K0016_STD_0_20121231_0" 
unitRef="iso4217_USD" 
decimals="-3" 
id="id_241467_82B22C37-AFA1-40BD-A569-D2FEA93C0951_1_8">
	387997000
</us-gaap:DeferredTaxAssetsNet>

Following on that, Tesla’s 2013 10-K had this instead:

<context id="eol_PE665410--1310-K0010_STD_0_20121231_0">
	<entity>
		<identifier scheme="http://www.sec.gov/CIK">
			0001318605
		</identifier>
	</entity>
	<period>
		<instant>
			2012-12-31
		</instant>
	</period>
</context>

<us-gaap:DeferredTaxAssetsNet 
contextRef="eol_PE665410--1310-K0010_STD_0_20121231_0" 
unitRef="iso4217_USD" 
decimals="-3" 
id="id_4522937_93BCB9A5-8733-4C68-82F8-C674BDB934EE_2_12">
16527000
</us-gaap:DeferredTaxAssetsNet>

It’s important to note that the first filing uses US-GAAP 2012, while the latter uses US-GAAP 2013. Is this the reason? Maybe DeferredTaxAssetsNet is calculated differently in the two systems? I’m not enough of an accountant to know. On the other hand maybe Tesla made some kind of mistake in reporting the value for the previous year when they were filing their 2013 10-K. On the other other hand maybe this data isn’t meant to be time contiguous and it doesn’t make sense to display it this way?

Anyone who knows please chime in.

My Mistake or Tesla’s

XBRL on IBM Bluemix

Note:  This post makes reference to my employment activities at IBM, please see the Disclaimer.

Something I’ve had going in the background for a little while is using Node.js to stream all the XBRL filing data off the SEC website, store it in a MongoDB instance, and create a web UI to display it. I’ve done every filing ever locally, but hosting the data has proven a little more restrictive, I only have space for the DOW 30. I’m using IBM Bluemix to host the Node.js “UI” portion, and the database building portion is running on a virtual machine that I use inside IBM.

I’ll post more about my experiences with Node.js, XBRL and Mongo a little later, but for now you can play with the app here:

http://xbrl.mybluemix.net

XBRL on IBM Bluemix

CDM Web

Note:  This post makes reference to my employment activities at IBM, please see the Disclaimer.

In my professional capacity I spend most of my time working in product management at IBM for a product called IBM Cognos Disclosure Managment.  Long story short, it’s a windows application that embeds Microsoft Office and allows users to collaborate on creating large documents by breaking the document down into “objects” with security and workflow.  It also connects to various data sources and can embed/refresh the data in said objects.  Since it integrates with desktop Office, there’s always been talk of somehow creating a browser version.  When Microsoft released OWA (Office Web Apps) Server, I thought I’d play around with it.

Note that OWA server is a totally stand alone re-distributable instance of the exact same technology that runs Office 365 (or whatever Microsoft is calling it these days).  Everything is being hosted locally, no content is flowing through Microsoft servers at any point.

Step One: Installing OWA Server

This took a fair bit of time just to provision the environment.

  • Must be installed on Windows Server (I used 2012)
  • Server must be joined to a domain (no idea why)
  • Nothing else “server-y” may be installed like SQL Server.

Anyway, I used these two documents to get me there:

Plan Office Web Apps Server

Deploy Office Web Apps Server

Step Two: Open a Document

Not as easy as it sounds.  Actually installing the server doesn’t get you anything.  There’s no URL to navigate to where you can just open a file, you have to implement a WOPI (Web Application Open Platform Interface) web service to provide information about documents that OWA can actually open.  Fine, I used these two documents to get me most of the way there:

Introducing WOPI

Building an Office Web Apps (OWA) WOPI Host

If I remember rightly this didn’t actually get me all the way there.  I had to do a fair bit of debugging on the OWA server, this article describes where the logs are.  They are very verbose, so I just searched for “error” and “exception”. Also you’ll need to understand what ASP.NET Web API is.

Step Three: Integrate with CDM

I decided to write an MVC 4 web application that would simply talk to the CDM application server on it’s existing web service endpoints.  Since CDM already has a desktop client, I thought this should look like a mobile app, luckily MVC4 has some really nice and easy routes to building a mobile app.

The toughest part for me was actually coming to grips with MVC’s “fuzzy” routing.  I found this routing debugger to be insanely helpful, remember, the order that you set up your routes in matters!  That one caused me to tear my hair out for a few hours.

Step Four: Profit?

Here’s a quick video of the app in action, running inside Chrome emulating a Nexus 5.

CDM Web