CamelCase Tokenizer For Cloudant Search Indexes

This post will be part of a series of things I learned while building xbrl.mybluemix.net.  Some things may sound very simple to Node.js or Cloudant gurus, but as I started with zero knowledge of either these were things I had to learn the hard way.  Often with Google searches producing no meaningful or simple example solution.  So here we go…

When creating a search index over a Cloudant database a number of analyzers are available for parsing/tokenizing full text data.  When building xbrl.mybluemix.net the field I wanted to search on was camel case strings like SalesRevenueNet.

There is no built in camel case tokenizer in Cloudant.  It’s possible to build your own Lucene.Net tokenizer for camel case strings, but this is of no use when using Cloudant.  Luckily the flexibility of creating your search index as a javascript function allows you to tokenize pretty much any way you want, simply apply your own tokenizer when creating the index.

function unCamelCase (str){
    return str
        // insert a space between lower & upper
        .replace(/([a-z])([A-Z])/g, '$1 $2')
        // space before last upper in a sequence followed by lower
        .replace(/\b([A-Z]+)([A-Z])([a-z])/, '$1 $2$3')
        // uppercase the first character
        .replace(/^./, function(str){ return str.toUpperCase(); });
}

function(doc){
	index("name", unCamelCase(doc.name), {"store": true});
}

It’s that easy, you don’t have to index on a field that already exists, you can create one at “run time”. For those with keen eyes I realize the above example doesn’t take into account numbers, but if that’s a concern it’s left as an exercise to the reader. Credit to this StackOverflow thread for providing a Javascript example of a camel case tokenizer.

Advertisements
CamelCase Tokenizer For Cloudant Search Indexes

One thought on “CamelCase Tokenizer For Cloudant Search Indexes

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s