Sunday, April 12, 2009

Lucene docid,UID mapping and Payload

Lucene docids are internal integers representing documents in the index, and Lucene takes liberty in reassign docids to during segment merges and when expunging deleted documents. In general one needs to be very careful with dealing with docids when they are referenced in the application context because the document a docid refers to is expected to change at anytime.

In many applications, it is very useful to have a quick way to map a Lucene docid to an external UID, e.g. a primary key in a database. For example, result filtering depending on an external document set. In the case of LinkedIn people search, filtering by a person's network on a text search result set, UID in this case being the member id. We needed a mechanism to map from a Lucene docid in a HitCollector.collect() very quickly to the corresponding UID.

A common and naive practice is to keep the UID in a Stored field, e.g.

Field("uid",myUID,Store.YES,Index.No);

and retrieve the uid via:

int uid = Integer.parseInt(indexReader.document(docid)
.get("uid"));

When recall (the number of hit candidates) is large, this method is streaming from the stored file for each hit, and thus does not perform.

A better alternative is to use the FieldCache to load into an integer array a mapping from docid to uid for each docid in the index ,e.g. 0 to indexReader.maxDoc()
(assuming uid is kept in the indexed field) e.g.

Field("uid",myUID,Store.NO,
Index.NOT_ANALYZED_NO_NORMS);

When IndexReader loads, we load the "map array":

int[] uidArray = FieldCache.DEFAULT.getInts(indexReader,"uid");

This is done once for a new IndexReader, and at search time, the mapping is just an array lookup:

int uid = uidArray[docid];

which is rather fast.

The problem here is that the number of terms in the uid field is equal to or very close to the number of documents in the index. To load from the FieldCache, a random seek is done for each term, which makes loading the uidArray extremely slow. This is fine if IndexReader does not load very often, but unfortunately for us, in a realtime search system, we don't have the luxury of such assumption. (This also, as a side effect, increases the total number of terms in the index and may impact search performance)

Lucky for us, Lucene 2.2 introduced Payloads, (a huge step towards flexible indexing), which is the ability to allow arbitrary data to be added to the posting list. In our case, we would create an arbitrary term for every document, e.g. Term("uid","_UID_"), and attach a 4-byte uid value to each posting:

class UIDTokenStream extends TokenStream{
private Token token = new Token("_UID_",0,0);
private byte[] buffer = new byte[4];
private boolean returnToken = false;

void setUID(int uid){
buffer[0] = (byte)uid;
buffer[1] = (byte)(uid>>8);
buffer[2] = (byte)(uid>>16);
buffer[3] = (byte)(uid>>24);
token.setPayload(new Payload(buffer));
returnToken = true;
}

public Token next() throws IOException{
if (returnToken){ returnToken = false; return token; }
else { return null; }
}
}

UIDTokenStream tokenStream = new UIDTokenStream();
tokenStream.setUID(myUID);
Field("uid",tokenStream);

When we load the uidArray, we do:

TermPositions tp = null;
byte[] dataBuffer = new byte[4];
int[] uidArray = new int[indexReader.maxDoc()];
try{
int idx = 0;
tp = indexReader.termPositions(new Term("uid","_UID_"));
while(tp.next()){
int doc = tp.doc();
tp.nextPosition();
tp.getPayload(dataBuffer,0);
// convert buffer to int
int uid = ((dataBuffer[3]& 0xFF) <<24)>

uidArray[idx++]=uid;
}
}
finally{
if (tp!=null){
tp.close();
}
}

Looking up the uid is same as before, simply an array lookup:

int uid = uidArray[docid];

The difference here is when loading the uidArray, sequential seek is done for each docid while paying the penalty of byte[] -> int conversion. (Also to a previous point, this method introduces only 1 extra term to the entire index)

We ran some performance numbers on a 2M-document index, loading from the FieldCache took more than 16.5 seconds, while loading the uidArray from payload took 430 ms, this is an improvement of over 38x! (this time was taken a while ago from my MacBook Pro using Lucene 2.2)

This mechanism is built into Zoie Realtime Search and Indexing System used to filter out duplicate documents between in-memory indexes and the disk index. (Since they are independent indexes, only consistent id to filter from is the UID)

15 comments:

  1. Hi there,

    Have you done much usage of Lucene with Hibernate Search?

    I have implented Hibernate Search Annotations with Lucene but just looking for another bit of advice.

    Many thanks
    Eggsy

    ReplyDelete
  2. Hi Eggsy:
    I have not played with Lucene-Hibernate. Do you have some ideas, or.. ? I would be interesting to see how this would work with it.

    Thanks
    -John

    ReplyDelete
  3. Wanted to get back to another comment that was lost in the "moderation" step. All the tests above are done on optimized indexes.

    ReplyDelete
  4. Yeah its very easy to implement with a few annotations you can map your POJO Hibernate persistent objects and using Spring you integrate the two very easy!!

    ReplyDelete
  5. Thanks for the nice article.

    But during delete/optimize, the UID mapping payload will be different from that in cache since docid will be changed. Of course we can reload the cache after optimize but is it possible to maintain the docid-UID mapping during optimize?

    ReplyDelete
  6. Hi geo+:

    This is assuming index reader is reloaded after changes to the index, e.g. optimize, and uid array is reloaded.

    We are investigating ways to do what you are suggesting without modifying lucene code.

    Thanks
    -john

    ReplyDelete
  7. Thanks for great posting. Applied your approach and also achieved good performance.
    UID mapping seems to be such a common feature and I had expected the lucene would provide some interface/internal mechanism to do this easily.

    Anyway, thank you for sharing this.

    -metabot

    ReplyDelete
  8. Thanks metabo!
    We have built into Zoie both forwards and reverse lookup from uid to docid. Check it out.

    ReplyDelete
  9. Very good optimization strategy. I will try this.

    Thanks,

    ReplyDelete
  10. Hint for a new post: how would you recommend to use the FieldCache after updating to Lucene >=2.9, as the key for a fieldcache could be the segmentreader instead of the global indexreader?
    That should be way more efficient during warmup, am I right?

    ReplyDelete
  11. Hi Sanne:

    Sorry for the late reply.

    Per-segment load is definitely a plus for incremental index loads. If it is done for the entire index as part of index warming, the total cost is actually the same.

    I will do a new post after custom indexing format feature is out, I think it is scheduled for 3.1.

    ReplyDelete
  12. Another advantage of this approach is how you apply it within Zoie to support deletes by marking the UID mapping. That can be done without pessimistic locking like lucene uses even with their near realtime search IndexWriter.getReader functionality.

    The entirety of Zoie would be overkill for my application, but I am planning on using your mapping above by simply calling ZoieSegmentReader.fillDocumentID at index time and wrapping my IndexReader with a ZoieSegmentReader. Is that a reasonable use of your API rather than pasting the code above into my app?

    To support deletes I would then call ZoieSegmentReader.markDeletes(uidsToDelete, allDeletedUIDS).commitDeletes(); and after I close my normal reader I would do new DiskSearchIndex(...).updateIndex(allDeletedUIDs, emptyList, null, null) flush them to the actual lucene index. Again: reasonable use of the zoie api?

    Thanks,
    eric

    ReplyDelete
  13. This is a fair usage, but only at the segment level, not at index level.

    Another thing you might find useful is the reverse mapping, e.g. ZoieIndexReader.geDocIDMapper(), which maps uid -> docid.

    ReplyDelete
  14. Thanks! I am now using the zoie API as described. I realized I also have to call setDelDocIds() on my ZoieSegmentReader before every query in my query threads, otherwise the deletes will not take affect for them.

    ReplyDelete
  15. Hi John,

    Very interesting article - just what I'm after. Do you have a Lucene 3.x code example? I'm struggling a bit here...
    Thanks!

    - Chris

    ReplyDelete