Friday, August 21, 2009

Index Optimization for realtime search - Good idea?

Overview of Optimize:
There is a Lucene API: IndexWriter.optimize(), which combines all segments into 1 large segment and also expunges all deleted docids.

Searching over an optimized index is very fast because you neither pay penalty to evaluate deleted docs nor search time OR'ing over documents in different indexes. After some OS level warming, the 1 segment file is loaded into the IO cache to avoid IO costs. Hence the method name: optimize(). This is terrific for an offline indexing system, where a pristine optimized index is published for searching.

Segment merge:
Segment merge is essential incremental indexing. Lucene has a "hidden" extensible API: MergePolicy. (properly exposed in Lucene 2.9) By default, LogByteSizeMergePolicy is used. This policy will periodically choose to merge small segments into larger segments, and size of the segment is based on number of bytes of the segment file. Only during a merge, deleted docs are expunged.

Real-time indexing:
In a real-time indexing environment, indexing operations are being applied to the index constantly, and the index is fragmented quickly. A challenge here is how to maintain an optimal index for real-time indexing.

In our application, where there are many updates, e.g. old documents are deleted and then added with newer/fresher data. What happened was over time, the largest segment would contain more and more deleted docs, and they will never be expunged because the segment is never a candidate of a merge since deleted docs are merely marked, not removed from the segment, thus the segment size still remain to be large. In the worse scenario, the largest segment would contain only deleted docs.

We made an enhancement to LogMergePolicy to normalize on size taking into consideration number of deleted documents (and contributed back: LUCENE-1634)

This helped quite a bit. We still however, see the problem with the situation when segments are promoted so that they get the merge with the largest segment:

In a realtime scenario, when smaller segments are "escalated" to be merged with the larger segment, the search response time also escalates. This is because the merge itself gets more expensive as the sizes of the segments to be merged get larger. Furthermore, the newly merged segment needs to be loaded into IO cache, while that is happening, search time is impacted significantly.

To solve this problem, we have created a new MergePolicy implementation:


Instead of defining an optimized index to be 1 large segment, we redefine it to be N segments of balanced size, where N is a configurable parameter. The idea is to spread the cost of a large segment merge into smaller merge costs.


At each point of the merge operation, the segments to merge is selected to main a balanced segment structure. The selection is modeled as a state and a merge is viewed as a transition between states, and each such transition is associated with a cost function of merge. We then applied the Viterbi algorithm to identify the optimal selection(s).

Performance numbers and details be found at this wiki.

Our MergePolicy implementation has also been contributed back to Lucene: LUCENE-1924


In conclusion, I would like to emphasize how indexing can affect search performance especially in real-time search. There are often hidden problems as they are invisible to unit tests and simple performance tests. They can also be data dependent and show up after hours or even days of stressing the system. Thus, it is important to understand the details of the indexing to build a scalable and robust system.


I'd like to credit this idea and implementation to my colleague Yasuhiro Matsuda.