Tuesday, April 7, 2009

Zoie - a realtime search and indexing system

A realtime search system makes it possible for queries to find a new document immediately or almost immediately after it has been updated. With Lucene's incremental update functionality, we were able to extend Lucene to support realtime indexing/search.

We open-sourced this technology and called it Zoie: http://code.google.com/p/zoie.

Our design uses multiple indexes; one main index on disk, two additional indexes in memory to handle transient indexing events.

The disk index will grow and be rather large, therefore indexing updates on disk will be performed in batches. Processing updates in batches, allows us to merge updates on the same document to reduce redundant updates. Moreover, the disk index would not be fragmented as the indexer is not thrashed by a large number of small indexing calls/requests. We keep a shared disk-based IndexReader to serve search request. Once batch indexing is performed, we build/load a new IndexReader and then publish the new shared IndexReader. The cost of building/loading the IndexReader is thus hidden from the cost of search.

To ensure realtime behavior, we have two helper memory indexes (MemA and MemB) that alternate in their roles. One index, say MemA, accumulates events and directly serves realtime results. When a flush/commit event occurs, MemA stops receiving new events, these are sent to MemB. At this time search requests are served from MemA, MemB and the disk index. Once the disk merge is completed, MemA is cleared and MemB takes its place. Untill the next flush/commit searches will be served from the new disk index and MemB.

For each search request, we open and load a new IndexReader from each of the memory indexes, and along with the shared disk IndexReader, we return a MultiIndexReader built from those three IndexReaders.

Zoie has been running in production at http://www.linkedin.com, in distributed mode, it is handling almost 40 million documents (or user profiles) in realtime and serving over 6 million requests a day with an average latency below 50ms.

We welcome contributions in any form to move the project forward.

3 comments:

  1. Is it true that you wrote Zoie in Scala in order to make it so fast?

    ReplyDelete
  2. Hi,
    Can all the nodes in the cluster write the index to a single index directory?

    Or is there any restriction that only node can have the write access and all nodes can have read access?

    Thanks,
    Muthusamy J

    ReplyDelete
  3. All nodes operate independently in a cluster. They read from and write to there own to their own index location

    ReplyDelete