static Query buildFieldCacheQuery(final int start,final int end){
Filter f = new Filter(){
@Override
public DocIdSet getDocIdSet(IndexReader reader) throws IOException {
final int[] data = FieldCache.DEFAULT.getInts(reader, FIELDCACHE_FIELD);
return new DocIdSet(){
@Override
public DocIdSetIterator iterator() throws IOException {
return new DocIdSetIterator() {
int docid=-1;
@Override
public int advance(int target) throws IOException {
docid=target-1;
return nextDoc();
}
@Override
public int docID() {
return docid;
}
@Override
public int nextDoc() throws IOException {
int val;
docid++;
while(docid
val = data[docid];
if (val>start && val
return docid;
else docid++;
}
return DocIdSetIterator.NO_MORE_DOCS;
}
};
}
};
}
};
return new ConstantScoreQuery(f);
}
NumericField numericField = new NumericField(NUMERIC_FIELD, 4);
numericField.setIntValue(n);
doc.add(numericField);
Field fieldCacheField = new Field(FIELDCACHE_FIELD,String.valueOf(n),Store.NO,Index.NOT_ANALYZED_NO_NORMS);
fieldCacheField.setOmitTermFreqAndPositions(true);
doc.add(fieldCacheField);
Field rangeField = new Field(RANGE_FIELD,format.format(n),Store.NO,Index.NOT_ANALYZED_NO_NORMS);
rangeField.setOmitTermFreqAndPositions(true);
doc.add(rangeField);
Following are the results:
JVM settings: -server -Xms1g -Xmx1g
The test measures different range lengths with respect to the possible values in the index, e.g. Range 1% would correspond to the range [0 - 10000] on a index size 1M docs
Index Size - 1M docs:
Range %: | RangeQuery | NumericRangeQuery | FieldCacheRangeQuery |
---|---|---|---|
1% | 202 ms | 1 ms | 1 ms |
5% | 2047 ms | 3 ms | 2 ms |
20% | NA | 9 ms | 5 ms |
50% | NA | 17 ms | 9 ms |
100% | NA | 26 ms | 9 ms |
Index Size - 5M docs, No longer measuring RangeQuery to stop beating the dead horse
Range %: | NumericRangeQuery | FieldCacheRangeQuery |
---|---|---|
1% | 6 ms | 8 ms |
5% | 15 ms | 11 ms |
20% | 38 ms | 27 ms |
50% | 75 ms | 47 ms |
100% | 128 ms | 43 ms |
Index Size - 10M docs
Range %: | NumericRangeQuery | FieldCacheRangeQuery |
---|---|---|
1% | 10 ms | 16 ms |
5% | 28 ms | 23 ms |
20% | 84 ms | 53 ms |
50% | 153 ms | 97 ms |
100% | 249 ms | 92 ms |
Conclusion & Observations:
- Don't use RangeQuery! (Good thing it is deprecated in Lucene 2.9.x)
- As index size increases, when the range is small, NumericRangeQuery slows down rather gracefully (less than linear), FieldCacheQuery slows down linearly.
- As index size increases, when the range is large, NumericRangeQuery slows down almost linearly, and FieldCacheQuery plateaus at 50%.
- If you expect your range covers >=5% of the corpus, FieldCacheQuery is faster.
- FieldCacheQuery code snippet above can be easily changed to support Lucene 2.4.x for those of you that have not upgraded.
- The FieldCacheQuery idea can be applied similarly to non-numeric fields, e.g. range of texts.
- FieldCacheQuery assumes a one-time cost in index load (for the FieldCache), but the cost is necessary if you want to do sorting.
- If you want to do range query on a really large index, consider sharding your index.
Hi John,
ReplyDeletethe FieldCache approach is also available in Lucene trunk, you do not need to do it yourself: FieldCacheRangeFilter, which also support alls types and also text terms.
The problems with the FieldCache approach in addition to longer loading time are:
- only one value per document supported (limitation of FieldCache, made for sorting)
- FCRF gets slower, when deleted docs are involved and the range contains 0 (because FCRF cannot determine a deleted doc, which also has a 0 value in FC)
The numbers are similar to mine on https://issues.apache.org/jira/browse/LUCENE-1461
Have you also tried other precSteps for comparison?
Uwe
Hi Uwe:
ReplyDeleteRe: the FieldCache approach is also available in Lucene trunk
Great!
Re: only one value per document supported
True. However, in both Solr and Bobo, multivalued fieldcaches are implemented. This approach can still be applied.
Re: FCRF gets slower, when deleted docs are involved and the range contains 0
True. Deletes are indeed tricky, for each hit checking against the TermDocs is definitely slow. It would be nice if delete docset is exposed at the segment reader level, this way this can be solved efficiently rather easily.
Having said this, one can at index warming time, just iterate AllTermDocs once and write deleted docids to a DocIdSet. At query time, just do parallel walk. I should probably try that.
Thanks for the pointer to the Jira issue, lotsa good information there!
I have not tried other precSteps as it is documented to use 4 for integers. Is it not the case?
-John
The default precStep in Lucene 4, you are right. I just wanted to compare maybe other precSteps like 2, 6 or 8. In the javadocs is some discussion about differences and pros/cons (lower precSteps need more seeks in TermEnum).
ReplyDeleteI agree with you, you can raise performance by doing everything in memory, but not everybody wants that. I have e.g. 13 numeric fields and keeping all of them in FieldCache or memory is no option (esp. loading time), especially if you do not want to sort against them. So NumericRange is a good option.
Thanks for the comment, Uwe!
ReplyDelete13 numeric fields for a shard of say 10 million docs, is 520MB (can be compressed furthermore), so it maybe feasible in some applications.
I agree with the loading time, but many times index is loaded in the background or being "warmed" before published, and it is done only once. This practice is often used by many applications and is actually suggested by Solr/Lucene documentation.
very valuable information on Numeric Queries.
ReplyDeleteThanks
Metadata Removal