Unzip a File to Memory in Scala

Recently I had some trouble unzipping a file in Scala.  After combing through many different StackOverflow threads I eventually worked out this solution:


def unzip(path: String): Map[String, Array[Byte]] = {
val zipFile = new ZipFile(path)
zipFile.entries().asScala
.filter(!_.isDirectory)
.map { entry =>
val in = zipFile.getInputStream(entry)
val out = new ByteArrayOutputStream
IOUtils.copy(in, out)
IOUtils.closeQuietly(in)
IOUtils.closeQuietly(out)
entry.getName -> out.toByteArray
} toMap
}

Unzip a File to Memory in Scala

Podcast

My good friend and colleague CSJ and I started recording some of our conversations, mostly as an experiment to see how hard podcasting is.  We decided to actually start publishing material as a podcast.  Right now no one else has listened to it but who knows, one day there may be literally dozens of listeners.

https://soundcloud.com/twodeetee

Podcast

Merged Cells and Formatted Text in SpreadsheetApp

Google Apps Script is a really nice way to interact with Google Docs programatically.  You can write add-ons or stand-alone scripts to read or modify documents and have access to a DOM, which is a much nicer way to do things than attempting to use a REST endpoint or similar. The only problem is that while the DOM/object model is reasonably complete, it isn’t totally so.  As of this writing there is no way, when dealing with spreadsheets, to determine if a cell is merged, or, more frustratingly, obtain the formatted contents of the cell.  We really needed this information recently, so we decided to parse it out of the HTML that you get when exporting the document.  This can be done in Apps Script itself.

You can see the gist here.

Basically we build up an associative array, where for a given cell index you can get back the row span, column span and formatted value.


parsedHtml['Sheet1:A1'] = {index: 'Sheet1:A1', text: '123,456', rowspan: 2, colspan: 2}

If there is no entry for a given cell index then that cell is part of a merged range or outside the data range.  The latter should be known beforehand.

NOTE: that this code requires underscore.js to work, you can read an article from Google about how to use it in apps script projects here.

Merged Cells and Formatted Text in SpreadsheetApp

View Adobe Swatch Exchange Files Online

The other day I was trying to do some web design according to IBM’s design guidelines.  They provide color swatches as .ase files, which are read by Adobe products like Illustrator, PhotoShop, etc.  I have none of these applications but I figured “hey how hard can it be to parse the file myself?”

Two days and a fair bit of hair tearing later witness palettes.mybluemix.net!

I later found out that IBM Design does provide a site that shows the same information, but by that time I was way to consumed with the problem.  That happens sometimes.  But hey, now I can look at any .ase file!  And so can you!

View Adobe Swatch Exchange Files Online

Why Does Apple’s Revenue Spike Every Fourth Quarter?

http://xbrl.mybluemix.net/show/b4fe1406-41f8-4cb7-a22a-efd1c86b4195

Could be again that I’m just not displaying the data the right way.  Or, that the fourth quarter is more important from an analyst perspective so Apple will try and “close” more revenue in Q4?

These are the things I find confusing about accounting, though it’s interesting to learn the reasons why things are the way they are.

Why Does Apple’s Revenue Spike Every Fourth Quarter?

Substitution Ciphers – Not Just for the Ancient Roman Military

Recall from a previous post that we were trying to get multiple “distinct” fields from a Cloudant search index.  We did this by concatenating the two fields during index creation.  Recall also that there was serious drawback with this method, that we were indexing both fields, rather than indexing one and getting both back.

This isn’t a problem if the two fields are using strictly different characters spaces.  For example, one only alphabetical characters and one only numeric characters.  Even if this isn’t the case, we can force it to be using a substitution cipher.  JSON documents are not limited to ASCII characters, the UTF-8 character space is quite large, and we can simply shift one of the fields to a totally different character set when creating the index.

Here’s how the search index would look (credit to Tim Severien’s cipher example on GitHub for the rotate function):

function rotateText(text, rotation) {
    // Surrogate pair limit
    var bound = 0x10000;

    // Force the rotation an integer and within bounds, just to be safe
    rotation = parseInt(rotation) % bound;

    // Might as well return the text if there's no change
    if(rotation === 0) return text;

    // Create string from character codes
    return String.fromCharCode.apply(null,
        // Turn string to character codes
        text.split('').map(function(v) {
            // Return current character code + rotation
            return (v.charCodeAt() + rotation + bound) % bound;
        })
    );
}

function(doc){
	index("companyName", rotateText(doc.identifier, 500) + ' ' + doc.name, {"store": true, "facet":true});
}

We’re creating the index on the concatenated string consisting of the cipher-ed identifier and the original name. This means the pair [‘0000051143’, ‘INTERNATIONAL BUSINESS MACHINES CORP’] will become the string “ȤȤȤȤȤȩȥȥȨȧ INTERNATIONAL BUSINESS MACHINES CORP”.

From here we continue as before, using faceting to return a list of distinct results. It’s then simply a matter of reversing the cipher to get the identifier back (in this example rotate the text by -500). You can either make the very reasonable assumption that no one will enter characters from the cipher-ed character space into your search field, or not allow searches that contain them.

Substitution Ciphers – Not Just for the Ancient Roman Military

Financial Concept Text Cloud

I’ve been playing with the IBM RAVE engine a bit today. I tried applying it to the data at xbrl.mybluemix.net. Here’s a text cloud showing the 100 most reported financial concepts. What does this mean? Nearly all filers are likely to report certain concepts like Assets. Furthermore, a concept that is more highly dimensionalized has more data points.  For example, here’s General Motor’s breakout of StockholdersEquityIncludingPortionAttributableToNoncontrollingInterest.  Very highly dimensionalized.

conceptCloud

Unfortunately the query to generate this text cloud takes about 40 seconds to run, so it’s not feasible to have it live.

Financial Concept Text Cloud

“We’ll Do It Live!” -or- Just Give Me Every Result

In the previous two posts we’ve looked at ways to get distinct results from a search index. Well, suppose neither of those approaches works for you and you’ve decided to group and/or sort yourself. As we discovered earlier, there is a limit of 200 results from any single call to a search index. All hope is not lost though, let’s look at the actual JSON we get back when performing a call against a search index URI:

{
	"total_rows":980,
	"bookmark":"g1AAAAGneJzLYWBgYMtgTmGQT0lKzi9KdUhJMjTUy0zKNTCw1EvOyS9NScwr0ctLLckBKmRKZEiS____f1YGk5uD0sugB0CxJAahPbYgA-TgBhjj1J-UANJRDzfifWwD2Ah28To0NxjhNCOPBUgyNAApoDHzIebMzYKYI_bxDapTzAgYswBizH6IMV9OHwAbw6-cS6yPIMYcgBhzH2LMw9UQY9inPCDNmAcQY6CB870aakx6dRYA32qFdg",
	"rows":
	[
		{
			"id":"0bd8dab855b66e643350625a33d79b00",
			"order":[9.491182327270508,925635],
			"fields":
			{
				"default":"0000789019",
				"conceptNameSplit":"Deferred Revenue Revenue Recognized"
			}
		}
		... continues ...
	]
}

Notice the first field in the result, “total_rows”. Cloudant knows how many results there are, however it will only give them to you in batches of up to 200. Notice the second field in the result “bookmark”. In order to get the next n results, simply call the same URI with a “bookmark=” parameter, using the bookmark provided from the previous n results. Using the miracles of Node.js, streaming, and recursion, we’ll look at a simple method to first stream every result, then get a list of n distinct results. All code is shown in CoffeeScript, since I like it better than plain Javascript.

First we’ll need a stream that can extract the bookmark value and store it in a pass-by-reference variable.

stream = require('stream')

#Buffer an incoming stream until a "bookmark" field is found and store the value.
class exports.BookmarkExtractorStream extends stream.Transform

  #Ensure that @bookmark is an object with a value field so that it will
  #be passed by reference.  (e.g. myBookmark = {value:'blah'}
  constructor: (@bookmark) ->
    @buffer = ""
    @found = false
    super

  _transform: (chunk, enc, next) ->

    #If we haven't found what we're looking for, keep appending the incoming chunk to our buffer.
    #The bookmark field appears near the beginning of the JSON anyway, so we shouldn't
    #have to store very much in our buffer.
    if not @found
      @buffer += chunk.toString('utf8')
      match = @buffer.match(/\"bookmark\"\:\"([^\"]+)\"/)

      #We found it, throw out the buffer
      if match?
        @bookmark.value = match[1]
        @buffer = null

    @push chunk
    next()

Next we’ll need a stream that takes incoming objects and outputs them as a JSON array. This presumes we’re sending the results somewhere, like a response stream, a file, or the console.

stream = require('stream')

class exports.JSONArrayTransformStream extends stream.Transform

  constructor: () ->
    @first = true
    super
      objectMode: true

  _transform: (chunk, enc, next) ->

    if (@first)
      @push('[')
      @first  = false
    else
      @push(',\n')

    @push(JSON.stringify(chunk))

    next()

  _flush: (next) ->
    if (@first)
      @push('[]')
    else
      @push(']')
    next()

Finally we’ll use a recursive function to execute each API call in sequence, using the bookmark from the previous call. The JSONStream package is very useful here.

JSONStream = require('JSONStream')
objectJSONArrayStream = new JSONArrayTransformStream()

recursiveCloudantSearch = (designDocUri, query, bookmark, callback) ->

  readUrl = "#{designDocUri}?q=#{query}&limit=200"
  oldBookmarkValue = bookmark.value
  if oldBookmarkValue?
    readUrl += "&bookmark=#{oldBookmarkValue}"
  request({url: readUrl})
  .pipe(new BookmarkExtractorStream(bookmark))
  .pipe(JSONStream.parse('rows.*.fields')).on('finish', ->
    if (bookmark.value isnt oldBookmarkValue)
      recursiveCloudantSearch(designDocUri, query, bookmark, callback)
    else
      callback()
  )
  .pipe(objectJSONArrayStream)
  .pipe(<send the output somewhere>)

designDocUri = "http://myAccount.cloudant.com/myDb/_design/searches/_search/mySearch"
query = "query goes heere"

recursiveCloudantSearch(designDocUri, query, {value:null}, ->
  console.log('done')
)

If we want distinct results, we’re going to need a new stream that accumulates results before outputting at the very end.

stream = require('stream')

class exports.DistinctingTransformStream extends stream.Transform

  #@keySelector: the name of a field in the object that we wish to find distinct values of
  #@limit: stop after finding this many distinct values
  constructor: (@keySelector, @limit) ->
    @myArray = {}
    super
      objectMode: true

  _transform: (chunk, enc, next) ->

    if (Object.keys(@myArray).length < @limit)
      @myArray[chunk[@keySelector]] = chunk[@keySelector]

    next()

  _flush: (next) ->

    for k,v of @myArray
      @push(v)
    next()

Our recursive function now looks like this:

JSONStream = require('JSONStream')
objectJSONArrayStream = new ObjectStringTransformStream()
distinctingTransformStream = new DistinctingTransformStream('<key property name>', 100)

recursiveCloudantSearch = (designDocUri, query, bookmark, callback) ->

  readUrl = "#{designDocUri}?q=#{query}&limit=200"
  oldBookmarkValue = bookmark.value
  if oldBookmarkValue?
    readUrl += "&bookmark=#{oldBookmarkValue}"
  request({url: readUrl})
  .pipe(new BookmarkExtractorStream(bookmark))
  .pipe(JSONStream.parse('rows.*.fields')).on('finish', ->
    if (bookmark.value isnt oldBookmarkValue)
      recursiveCloudantSearch(designDocUri, query, bookmark, callback)
    else
      callback()
  )
  .pipe(distinctingTransformStream)
  .pipe(objectJSONArrayStream)
  .pipe(<send the output somewhere>)

designDocUri = "http://myAccount.cloudant.com/myDb/_design/searches/_search/mySearch"
query = "query goes heere"

recursiveCloudantSearch(designDocUri, query, {value:null}, ->
  console.log('done')
)
“We’ll Do It Live!” -or- Just Give Me Every Result

Multiple “Distinct” Fields from a Cloudant Search Index

In the previous post we were trying to use a search index to replicate the following SQL:

SELECT DISTINCT 
    concept 
FROM 
    facts 
WHERE 
    company='IBM' 
    AND
    concept LIKE '%Assets%'

We were able to do this thanks to faceting. Let’s now turn our attention to the other combo-box on xbrl.mybluemix.net. Here we need to search companies, not financial concepts. Why not just use faceting again? There’s a fly in the ointment, a company’s name may not be unique, what we’re interested in is a distinct list of id/name pairs. In SQL this would mean:

SELECT DISTINCT 
    identifier, 
    name 
FROM 
    companies 
WHERE 
    name LIKE '%International%'

Faceting won’t help here. It can give us a distinct list of identifiers and a distinct list of names, but not id/name pairs.

One solution is to simply concatenate the two fields when creating the index. Recall from this post that the fields being indexed need not actually exist, they can be created at run-time. Such a search index would look something like this:

function(doc){
    index("identifierNameConcat", 
        doc.identifier + ' :: ' + doc.name, 
        {"store": true, "facet": true});
}

Whatever separator we decide to use it obviously must not appear in either the “id” or “name” field. Note the spaces in the separator, we don’t want “0000051143::International” to be a single token (we want “0000051143 :: International Business Machines”). We can now use a query URL like this:

http://account.cloudant.com/companies/_design/companiesSearchIndexes/_search/nameSearch?q=identifierNameConcat:International&counts=%5B"identifierNameConcat"%5D&limit=0

There is a serious drawback with this method. The two fields need to be very different, ideally something like strictly numeric and strictly alphabetic. Since we are technically indexing both id and name the results of our query will contain documents with the query string appearing in either. For example, searching for “000*” will return “0000051143 :: International Business Machines”. This is probably not what we intended. It goes the other way as well, if we were trying to find a company whose name started with “100” we’d get back every company that had an identifier beginning with “100”. In SQL terms what we actually ended up creating was something this:

SELECT DISTINCT 
    identifier, 
    name 
FROM 
    companies 
WHERE
    identifier LIKE '%International%'
    OR
    name LIKE '%International%'

Oops. This is serious enough that this method is probably unusable in most cases. I present it here in case someone does find it both usable and useful.

Multiple “Distinct” Fields from a Cloudant Search Index