Multiple “Distinct” Fields from a Cloudant Search Index

In the previous post we were trying to use a search index to replicate the following SQL:

SELECT DISTINCT 
    concept 
FROM 
    facts 
WHERE 
    company='IBM' 
    AND
    concept LIKE '%Assets%'

We were able to do this thanks to faceting. Let’s now turn our attention to the other combo-box on xbrl.mybluemix.net. Here we need to search companies, not financial concepts. Why not just use faceting again? There’s a fly in the ointment, a company’s name may not be unique, what we’re interested in is a distinct list of id/name pairs. In SQL this would mean:

SELECT DISTINCT 
    identifier, 
    name 
FROM 
    companies 
WHERE 
    name LIKE '%International%'

Faceting won’t help here. It can give us a distinct list of identifiers and a distinct list of names, but not id/name pairs.

One solution is to simply concatenate the two fields when creating the index. Recall from this post that the fields being indexed need not actually exist, they can be created at run-time. Such a search index would look something like this:

function(doc){
    index("identifierNameConcat", 
        doc.identifier + ' :: ' + doc.name, 
        {"store": true, "facet": true});
}

Whatever separator we decide to use it obviously must not appear in either the “id” or “name” field. Note the spaces in the separator, we don’t want “0000051143::International” to be a single token (we want “0000051143 :: International Business Machines”). We can now use a query URL like this:

http://account.cloudant.com/companies/_design/companiesSearchIndexes/_search/nameSearch?q=identifierNameConcat:International&counts=%5B"identifierNameConcat"%5D&limit=0

There is a serious drawback with this method. The two fields need to be very different, ideally something like strictly numeric and strictly alphabetic. Since we are technically indexing both id and name the results of our query will contain documents with the query string appearing in either. For example, searching for “000*” will return “0000051143 :: International Business Machines”. This is probably not what we intended. It goes the other way as well, if we were trying to find a company whose name started with “100” we’d get back every company that had an identifier beginning with “100”. In SQL terms what we actually ended up creating was something this:

SELECT DISTINCT 
    identifier, 
    name 
FROM 
    companies 
WHERE
    identifier LIKE '%International%'
    OR
    name LIKE '%International%'

Oops. This is serious enough that this method is probably unusable in most cases. I present it here in case someone does find it both usable and useful.

Multiple “Distinct” Fields from a Cloudant Search Index

One thought on “Multiple “Distinct” Fields from a Cloudant Search Index

Leave a reply to Substitution Ciphers – Not Just for the Ancient Roman Military | Ted's Blog Cancel reply