Saturday, July 18, 2009

Making BooleanQueries faster

At work today, we were looking optimize boolean queries of the type:

f1:v1^b1 OR f1:v2^b2 OR f1:v3^b3 ...
f1:v1^b1 AND f1:v2^b2 AND f1:v3^b3 ...

where f1 is a field of some structured data (e.g. Analyze=No, tf=No, norm=No)

We see that this type of query has these patterns:
  1. all values are in the same field
  2. all clauses have the same operator joining them
When the number of clause is large, BooleanQuery can be rather slow.

As of Lucene 2.4, Lucene query api has adopted DocIdSet, and DocIdSetIterator abstractions, which opened doors for various query optimizations. (IMHO, this is one of the best improvements in Lucene from an API stand point.)

For our project, we have quite a few OR-clauses, and the search performance was pretty bad.

Instead, when the index loads, we load into memory a datastructure very much like the FieldCache.StringIndex. In a 5 million index shard, for a field containing N values, memory footprint is approxmiately 20MB + N Strings.

Given K clauses in our OR query, we build a BitSet with K bits turned on, each bit corresponding to an index into String array. (max bit is N)

We then built a DocIdSetIterator that iterates the order array and checks to see if the order[i]-bit is turned on in the bitset.

The resulting performance is quite encouraging: As the number of clauses increases, the iteration time is capped. From our benchmark, when we get to 10+ clauses, the performance gain is approximately 8x!

Here is a detailed write-up:

This is an example of how good API design can open doors to things such as what we were able to do.

Kudos to the Lucene team for the DocIdSet api!