Sunday, July 31, 2011
LinkedIn Intern Hackday 2011 and Node.js
Friday, July 8, 2011
Python, Tornado, Kafka, Oh My!
- High volume
- Data sent over the internet
Friday, October 1, 2010
LinkedIn Signal
- What are HP employees saying about Mark Hurd?
- What is Jerry Brown Campaign saying about Meg Whitman?
- What do MIT students think about Scala?
- What are the IT professions living in SF-Bayarea talking about Java?
- What are people saying about LinkedIn Signal in the last hour?
- ...
Sunday, January 31, 2010
LinkedIn Search Talk - SDForum
- realtime indexing/search
- streaming/live update
- faceted navigation
- section search
- distributed index partitioning
- Daniel Tunkelang - Former Chief-Scientist at Endeca - enterprise search solution with faceted search emphasis, and author of the blog, TheNoisyChannel.
- Stefan Groschupf - Creator of the popular distributed search open source project: Katta.
Sunday, November 15, 2009
Numeric Range Queries - A comparison
static Query buildFieldCacheQuery(final int start,final int end){
Filter f = new Filter(){
@Override
public DocIdSet getDocIdSet(IndexReader reader) throws IOException {
final int[] data = FieldCache.DEFAULT.getInts(reader, FIELDCACHE_FIELD);
return new DocIdSet(){
@Override
public DocIdSetIterator iterator() throws IOException {
return new DocIdSetIterator() {
int docid=-1;
@Override
public int advance(int target) throws IOException {
docid=target-1;
return nextDoc();
}
@Override
public int docID() {
return docid;
}
@Override
public int nextDoc() throws IOException {
int val;
docid++;
while(docid
val = data[docid];
if (val>start && val
return docid;
else docid++;
}
return DocIdSetIterator.NO_MORE_DOCS;
}
};
}
};
}
};
return new ConstantScoreQuery(f);
}
NumericField numericField = new NumericField(NUMERIC_FIELD, 4);
numericField.setIntValue(n);
doc.add(numericField);
Field fieldCacheField = new Field(FIELDCACHE_FIELD,String.valueOf(n),Store.NO,Index.NOT_ANALYZED_NO_NORMS);
fieldCacheField.setOmitTermFreqAndPositions(true);
doc.add(fieldCacheField);
Field rangeField = new Field(RANGE_FIELD,format.format(n),Store.NO,Index.NOT_ANALYZED_NO_NORMS);
rangeField.setOmitTermFreqAndPositions(true);
doc.add(rangeField);
Following are the results:
JVM settings: -server -Xms1g -Xmx1g
The test measures different range lengths with respect to the possible values in the index, e.g. Range 1% would correspond to the range [0 - 10000] on a index size 1M docs
Index Size - 1M docs:
Range %: | RangeQuery | NumericRangeQuery | FieldCacheRangeQuery |
---|---|---|---|
1% | 202 ms | 1 ms | 1 ms |
5% | 2047 ms | 3 ms | 2 ms |
20% | NA | 9 ms | 5 ms |
50% | NA | 17 ms | 9 ms |
100% | NA | 26 ms | 9 ms |
Index Size - 5M docs, No longer measuring RangeQuery to stop beating the dead horse
Range %: | NumericRangeQuery | FieldCacheRangeQuery |
---|---|---|
1% | 6 ms | 8 ms |
5% | 15 ms | 11 ms |
20% | 38 ms | 27 ms |
50% | 75 ms | 47 ms |
100% | 128 ms | 43 ms |
Index Size - 10M docs
Range %: | NumericRangeQuery | FieldCacheRangeQuery |
---|---|---|
1% | 10 ms | 16 ms |
5% | 28 ms | 23 ms |
20% | 84 ms | 53 ms |
50% | 153 ms | 97 ms |
100% | 249 ms | 92 ms |
Conclusion & Observations:
- Don't use RangeQuery! (Good thing it is deprecated in Lucene 2.9.x)
- As index size increases, when the range is small, NumericRangeQuery slows down rather gracefully (less than linear), FieldCacheQuery slows down linearly.
- As index size increases, when the range is large, NumericRangeQuery slows down almost linearly, and FieldCacheQuery plateaus at 50%.
- If you expect your range covers >=5% of the corpus, FieldCacheQuery is faster.
- FieldCacheQuery code snippet above can be easily changed to support Lucene 2.4.x for those of you that have not upgraded.
- The FieldCacheQuery idea can be applied similarly to non-numeric fields, e.g. range of texts.
- FieldCacheQuery assumes a one-time cost in index load (for the FieldCache), but the cost is necessary if you want to do sorting.
- If you want to do range query on a really large index, consider sharding your index.
Friday, August 21, 2009
Index Optimization for realtime search - Good idea?
There is a Lucene API: IndexWriter.optimize(), which combines all segments into 1 large segment and also expunges all deleted docids.
Searching over an optimized index is very fast because you neither pay penalty to evaluate deleted docs nor search time OR'ing over documents in different indexes. After some OS level warming, the 1 segment file is loaded into the IO cache to avoid IO costs. Hence the method name: optimize(). This is terrific for an offline indexing system, where a pristine optimized index is published for searching.
Segment merge:
Segment merge is essential incremental indexing. Lucene has a "hidden" extensible API: MergePolicy. (properly exposed in Lucene 2.9) By default, LogByteSizeMergePolicy is used. This policy will periodically choose to merge small segments into larger segments, and size of the segment is based on number of bytes of the segment file. Only during a merge, deleted docs are expunged.
Real-time indexing:
In a real-time indexing environment, indexing operations are being applied to the index constantly, and the index is fragmented quickly. A challenge here is how to maintain an optimal index for real-time indexing.
We made an enhancement to LogMergePolicy to normalize on size taking into consideration number of deleted documents (and contributed back: LUCENE-1634)
This helped quite a bit. We still however, see the problem with the situation when segments are promoted so that they get the merge with the largest segment:
Saturday, July 18, 2009
Making BooleanQueries faster
- all values are in the same field
- all clauses have the same operator joining them
As of Lucene 2.4, Lucene query api has adopted DocIdSet, and DocIdSetIterator abstractions, which opened doors for various query optimizations. (IMHO, this is one of the best improvements in Lucene from an API stand point.)
For our project, we have quite a few OR-clauses, and the search performance was pretty bad.
Instead, when the index loads, we load into memory a datastructure very much like the FieldCache.StringIndex. In a 5 million index shard, for a field containing N values, memory footprint is approxmiately 20MB + N Strings.
Given K clauses in our OR query, we build a BitSet with K bits turned on, each bit corresponding to an index into String array. (max bit is N)
We then built a DocIdSetIterator that iterates the order array and checks to see if the order[i]-bit is turned on in the bitset.
The resulting performance is quite encouraging: As the number of clauses increases, the iteration time is capped. From our benchmark, when we get to 10+ clauses, the performance gain is approximately 8x!
Here is a detailed write-up:
http://code.google.com/p/bobo-browse/wiki/QueryConstruction
Kudos to the Lucene team for the DocIdSet api!