Tuesday, April 21, 2009

Index Partitioning and Distributed Realtime Search

When the number of documents gets to be very large in one single index, it is usually a good idea to partition the index and distribute the search load to different processes or search nodes, each serving from a section of the corpus.


Some Background:

There are many different ways of partitioning the index depending on the application requirements, here are some examples.
  • By terms - This is good because not every search node is handling every request. Only the search nodes containing terms from the query are called to serve the request. However, normally some terms are searched more than others, and on top of that, some terms contain more documents than others. Therefore it is usually very difficult to load balance this partitioning scheme.
  • By documents - This is the most common partitioning method. Even though every search node is handling every search request, (assuming documents are uniformly distributed across the partitions,) it is very easy to balance and re-balance search traffic.
  • By time - This is a good idea in corpus containing time sensitive data, e.g. job postings, news, blogs etc. In such cases, recent documents are by definition more relevant than older documents. Thus, we can add more replication on partitions with recent documents while slowly discard older partitions from the index.

Our Partitioning Scheme:



We, at LinkedIn, are in a real-time search situation while guaranteeing high availability, we want to have a partitioning scheme that would allows us to avoid re-indexing/re-partitioning as our corpus grows.

We have decided to choose partition by ranges of documents. We choose the maximum number of documents in a partition, say N. And partition along consecutive document IDs (UIDs, not Lucene docids), .e.g. partition K, would containing documents ranging from K*N to (K+1)*N-1, K = 0,1,2,...

As new documents are added to the index, we grow the partition by adding to the largest partition, and when the UID exceeds the maximum document ID on the largest partition, a new partition is created. This partitioning scheme allows us to grow to newer partitions without a "re-partition".


Cross domain "joins":

Another important benefit we enjoy with this partitioning scheme is that we can load Lucene docid to UID mapping (see my previous post) as well as the reverse: UID mapping to Lucene docid mapping, because we know the range UIDs fall in. We are thus very easily able to translate a filter set in the UID space to a Lucene Filter or a Lucene Query. This is important because it essentially allows us to do "joins" across different document spaces. The memory footprint for holding both lookup arrays are docCount*4 bytes each, nowadays, it's a piece of cake.


Realtime at the Node-Level:

To guarantee realtime on the search node level, each of are partitions are powered by Zoie. See my earlier post)

5 comments:

  1. Do you regenerate the doc-id to UID mapping for every document insert/update/delete?? as lucene doc-id's r volatile.

    ReplyDelete
  2. No. It is done when an IndexReader is loaded. At which time lucene docids are fixed for the life cycle of the reader.

    ReplyDelete
  3. dir SearchMan
    have you ever build distributed and real-time search with zoie?
    Now, my zoie only run in single machine, how to run in multi machines

    ReplyDelete
  4. Checkout the Sensei project:
    http://sna-projects.com/sensei

    ReplyDelete
  5. dir Search Man
    I've checked out and run sensei server with zookeeper but now, I don't know how to run zoie with sensei, could you help me!

    ReplyDelete