<?xml version='1.0' encoding='UTF-8'?><?xml-stylesheet href="http://www.blogger.com/styles/atom.css" type="text/css"?><feed xmlns='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0'><id>tag:blogger.com,1999:blog-7998480320632723546</id><updated>2011-11-27T16:58:45.007-08:00</updated><category term='zoie'/><category term='index partition'/><category term='slides'/><category term='distributed search'/><category term='uid mapping'/><category term='payload'/><category term='java'/><category term='sdforum'/><category term='optimization'/><category term='search'/><category term='field cache'/><category term='realtime'/><category term='realtime search'/><category term='lucene'/><category term='performance'/><category term='indexing'/><category term='open source'/><category term='linkedin'/><category term='bobo'/><category term='faceted search'/><category term='boolean query'/><title type='text'>Inverted Index</title><subtitle type='html'>Tools and tips on building your own search engine with open source libraries.</subtitle><link rel='http://schemas.google.com/g/2005#feed' type='application/atom+xml' href='http://invertedindex.blogspot.com/feeds/posts/default'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7998480320632723546/posts/default?max-results=100'/><link rel='alternate' type='text/html' href='http://invertedindex.blogspot.com/'/><link rel='hub' href='http://pubsubhubbub.appspot.com/'/><author><name>SearchMan</name><uri>http://www.blogger.com/profile/12064596275372471624</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_iAxmLMRbouc/TTMisp38OmI/AAAAAAAAAA8/CfgVp7kqiCA/S220/Photo%2Bon%2B2011-01-07%2Bat%2B23.32%2B%25232.jpg'/></author><generator version='7.00' uri='http://www.blogger.com'>Blogger</generator><openSearch:totalResults>10</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>100</openSearch:itemsPerPage><entry><id>tag:blogger.com,1999:blog-7998480320632723546.post-6741079075914837773</id><published>2011-07-31T12:53:00.001-07:00</published><updated>2011-07-31T23:12:22.885-07:00</updated><title type='text'>LinkedIn Intern Hackday 2011 and Node.js</title><content type='html'>Yesterday I had the best birthday ever being part of the &lt;a href="http://hackday2011.linkedin.com/"&gt;LinkedIn Intern Hackday&lt;/a&gt;.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;First a little background of this this event:&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Sponsored by &lt;a href="http://www.linkedin.com/"&gt;LinkedIn&lt;/a&gt;, put together by &lt;a href="http://www.linkedin.com/in/adamnash"&gt;Adam Nash&lt;/a&gt; and &lt;a href="http://www.linkedin.com/in/jbrikman"&gt;Jim Brikman&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Any intern in the Silicon Valley is invited. Just form a small group, come to LinkedIn Mountain View headquarters, and pull an all-nighter hacking up some awesomeness. From all the projects, we would pick top three in the likes of American Idol.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Interestingly, &lt;a href="http://nodejs.org/"&gt;Node.js&lt;/a&gt; was a popular technology stack being used.&lt;/div&gt;&lt;meta charset="utf-8"&gt;&lt;meta charset="utf-8"&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Node.js:&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;The momentum behind this technology is getting stronger and stronger. I see this trend continuing from the following aspects:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Pervasiveness of JavaScript:&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;JavaScript has become the de-facto programming language on the client-side. It fulfilled the void of writing complex programs in the browser where Java Applets failed. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;As JavaScript matures, I see it become more and more pervasive on the server-side. Having a consistent language stack between client and server is desirable. With support for CommonJS by companies like Google, it does seem possible JavaScript being a strong contender in the server-side landscape.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Cloud Computing:&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;Cloud computing is here, where the cost of running a service on the cloud is measured by usage. Therefore squeezing every ounce to power from your machine instance is highly desirable. The philosophy of Node.js for asynchronous event handling makes a lot of sense in this environment. Because every bit of CPU starvation is costly, now measurable by $.&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Trend:&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;The tech trend is often set by the younger generation. The projects in this intern hackday collectively is a good sample of what the next generation of wiz's and geeks are going to work on. Node.js set a tone.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Conclusion:&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;I spent sometime learning about Node.js and find although it is still very young, but its potential impact is going to be significant. I see someday it become a serious competition for Ruby/Rails and/or Python/Django. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I see server-side Java being pushed more towards the backend and eventually finding room in custom backends like NOSQL/Search systems.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I am picking up a JavaScript book and it is going to be exciting!&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7998480320632723546-6741079075914837773?l=invertedindex.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://invertedindex.blogspot.com/feeds/6741079075914837773/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://invertedindex.blogspot.com/2011/07/linkedin-intern-hackday-2011-and-nodejs.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7998480320632723546/posts/default/6741079075914837773'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7998480320632723546/posts/default/6741079075914837773'/><link rel='alternate' type='text/html' href='http://invertedindex.blogspot.com/2011/07/linkedin-intern-hackday-2011-and-nodejs.html' title='LinkedIn Intern Hackday 2011 and Node.js'/><author><name>SearchMan</name><uri>http://www.blogger.com/profile/12064596275372471624</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_iAxmLMRbouc/TTMisp38OmI/AAAAAAAAAA8/CfgVp7kqiCA/S220/Photo%2Bon%2B2011-01-07%2Bat%2B23.32%2B%25232.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7998480320632723546.post-8964423186505144236</id><published>2011-07-08T10:10:00.000-07:00</published><updated>2011-07-08T10:36:52.767-07:00</updated><title type='text'>Python, Tornado, Kafka, Oh My!</title><content type='html'>&lt;div&gt;&lt;span class="Apple-style-span" &gt;Something about Kafka:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;Recently LinkedIn's &lt;a href="http://sna-projects.com/kafka/"&gt;Kafka&lt;/a&gt; has been accepted to Apache Incubator. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Kafka is a high-throughput distributed publish-scribe messaging system written in Scala.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Kafka scales very well with increased dataset as well subscribers. For detailed performance results, check &lt;a href="http://sna-projects.com/kafka/performance.php"&gt;this&lt;/a&gt; out.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" &gt;Capturing internet events:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;We were looking to build a data application that captures mobile activities. &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;Requirements are:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;&lt;span class="Apple-style-span"&gt;High volume&lt;/span&gt;&lt;/li&gt;&lt;li&gt;&lt;span class="Apple-style-span"&gt;Data sent over the internet&lt;/span&gt;&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;Kafka being the obvious choice for streaming message to our backend systems, but we of course don't want to expose our Kafka endpoint on the web.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;So, we need to build a http proxy to front our Kafka cluster.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span" &gt;Python and Tornado:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;Being a recent &lt;a href="http://www.python.org/"&gt;Python&lt;/a&gt; convert (by learning &lt;a href="https://www.djangoproject.com/"&gt;Django&lt;/a&gt; from &lt;a href="http://www.linkedin.com/in/wonlay"&gt;Lei&lt;/a&gt;), I wanted to build this proxy in Python.&lt;/span&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;Django is a little heavy for this use-case, all I needed is a http server. &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;Luckily &lt;a href="http://www.linkedin.com/in/ikailan"&gt;Ikai&lt;/a&gt; facebooked me his &lt;a href="http://www.slideshare.net/ikailan/rapid-web-development-using-tornado-web-and-mongodb"&gt;talk&lt;/a&gt; on &lt;a href="http://www.tornadoweb.org/"&gt;Tornado&lt;/a&gt; - a light-weight http server in Python.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;Given Kafka already has a &lt;a href="https://github.com/kafka-dev/kafka/tree/master/clients/python"&gt;Python client&lt;/a&gt;, voila, we have a http proxy listening for events pumping to Kafka.&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;Here is the code:&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;div style="font-family: 'courier new'; "&gt;import tornado.ioloop&lt;/div&gt;&lt;div style="font-family: 'courier new'; "&gt;import tornado.web&lt;/div&gt;&lt;div style="font-family: 'courier new'; "&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: 'courier new'; "&gt;from kafka import KafkaProducer&lt;/div&gt;&lt;div style="font-family: 'courier new'; "&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: 'courier new'; "&gt;class KafkaHandler(tornado.web.RequestHandler):&lt;/div&gt;&lt;div style="font-family: 'courier new'; "&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt; &lt;/span&gt;topic = "app-update"&lt;/div&gt;&lt;div style="font-family: 'courier new'; "&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt; &lt;/span&gt;producer = KafkaProducer('localhost',9092)&lt;/div&gt;&lt;div style="font-family: 'courier new'; "&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;&lt;/div&gt;&lt;div style="font-family: 'courier new'; "&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt; &lt;/span&gt;def post(self):&lt;/div&gt;&lt;div style="font-family: 'courier new'; "&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;d = self.request.body&lt;/div&gt;&lt;div style="font-family: 'courier new'; "&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;self.producer.send([d], self.topic)&lt;/div&gt;&lt;div style="font-family: 'courier new'; "&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;print d&lt;/div&gt;&lt;div style="font-family: 'courier new'; "&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: 'courier new'; "&gt;application = tornado.web.Application([&lt;/div&gt;&lt;div style="font-family: 'courier new'; "&gt;    (r"/app-update", KafkaHandler),&lt;/div&gt;&lt;div style="font-family: 'courier new'; "&gt;])&lt;/div&gt;&lt;div style="font-family: 'courier new'; "&gt;&lt;br /&gt;&lt;/div&gt;&lt;div style="font-family: 'courier new'; "&gt;if __name__ == "__main__":&lt;/div&gt;&lt;div style="font-family: 'courier new'; "&gt;    application.listen(8080)&lt;/div&gt;&lt;div style="font-family: 'courier new'; "&gt;    tornado.ioloop.IOLoop.instance().start()&lt;/div&gt;&lt;div style="font-family: 'courier new'; "&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;/span&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7998480320632723546-8964423186505144236?l=invertedindex.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://invertedindex.blogspot.com/feeds/8964423186505144236/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://invertedindex.blogspot.com/2011/07/python-tornado-kafka-oh-my.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7998480320632723546/posts/default/8964423186505144236'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7998480320632723546/posts/default/8964423186505144236'/><link rel='alternate' type='text/html' href='http://invertedindex.blogspot.com/2011/07/python-tornado-kafka-oh-my.html' title='Python, Tornado, Kafka, Oh My!'/><author><name>SearchMan</name><uri>http://www.blogger.com/profile/12064596275372471624</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_iAxmLMRbouc/TTMisp38OmI/AAAAAAAAAA8/CfgVp7kqiCA/S220/Photo%2Bon%2B2011-01-07%2Bat%2B23.32%2B%25232.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7998480320632723546.post-3968199217889678118</id><published>2010-10-01T14:52:00.000-07:00</published><updated>2010-10-01T23:43:00.176-07:00</updated><title type='text'>LinkedIn Signal</title><content type='html'>&lt;div&gt;A few days ago, LinkedIn Signal debuted at &lt;a href="http://disrupt.techcrunch.com/2010-sf/"&gt;Techcrunch Disrupt&lt;/a&gt;. Within hours, tweets, blogs, articles spread like wildfires across various mediums on the web. It was exciting to be part of the engineering team behind it.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://tctechcrunch.files.wordpress.com/2010/09/linkedin_1-1.jpg"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 630px; height: 582px;" src="http://tctechcrunch.files.wordpress.com/2010/09/linkedin_1-1.jpg" border="0" alt="" /&gt;&lt;/a&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Here is an encouraging article written about Signal:&lt;/div&gt;&lt;div&gt;&lt;a href="http://www.thedailybeast.com/blogs-and-stories/2010-09-29/linkedin-signal-combines-twitter-with-the-resume-site/"&gt;http://www.thedailybeast.com/blogs-and-stories/2010-09-29/linkedin-signal-combines-twitter-with-the-resume-site/&lt;/a&gt;&lt;br /&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Signal is a contextual social search application that leverages the realtime LinkedIn Share/Twitter stream and LinkedIn profile information. In a way, we were able to classify the realtime stream based on who the "Sharer" is, and thus tagging it with structured information.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Being a data product, having clean and abundant data is essential. With combination of LinkedIn and Twitter data, we were in heaven in building a revolutionary product: In a few clicks you are able to answer the following questions:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;What are HP employees saying about Mark Hurd?&lt;/li&gt;&lt;li&gt;What is Jerry Brown Campaign saying about Meg Whitman?&lt;/li&gt;&lt;li&gt;What do MIT students think about Scala?&lt;/li&gt;&lt;li&gt;What are the IT professions living in SF-Bayarea talking about Java?&lt;/li&gt;&lt;li&gt;What are people saying about LinkedIn Signal in the last hour?&lt;/li&gt;&lt;li&gt;...&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;These are only a few examples of the type of insights you can get from the realtime stream that has been enriched with context and precision.&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Furthermore, you can also discover trending articles you should read based on your query and selections. We were suggested to read the Techcrunch article about AOL buying Techcrunch hours before any news source picked it up via Signal, that was powerful!&lt;/div&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div&gt;I have written an under-the-hood technical post on the &lt;a href="http://sna-projects.com/blog/2010/10/linkedin-signal-a-look-under-the-hood/"&gt;LinkedIn SNA blog&lt;/a&gt;, so instead in this post I would like to talk about the development process for which Signal was created.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A few months ago, one of our rock star engineers &lt;a href="http://www.linkedin.com/in/nickd/"&gt;Nick Dellamaggiore&lt;/a&gt; wrote a data stream that merges LinkedIn shares, the Twitter stream for bounded LinkedIn accounts, LinkedIn profile, and derived LinkedIn member information, such that any LinkedIn developer can consume and build interesting applications from.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For a few of us that have been interested in making sense of semi-structured data in realtime, having access to this data stream straight from our development boxes is like mice trapped in a cheese factory, we were excited!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Then the stars aligned some more, a ridiculously awesome application developer &lt;a href="http://www.linkedin.com/in/alejandrocrosa/"&gt;Alejandro Crosa&lt;/a&gt; took our search library and presented us with a beautiful application, an incarnation of Signal appeared before our eyes. And a team was formed, led by our product counterpart: &lt;a href="http://www.linkedin.com/in/estebankozak"&gt;Esteban Kozak&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For the next couple of weeks, we went nuts with features -&gt; performance -&gt; more features, and worked through weekends and evenings, we were in startup-mode.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Judgement day came, we presented Signal to our CEO, &lt;a href="http://www.linkedin.com/in/jeffweiner08"&gt;Jeff&lt;/a&gt;, and our VP of Products, &lt;a href="http://www.linkedin.com/in/deepnishar"&gt;Deep&lt;/a&gt;. Being internet veterans, immediately saw the value, provided feedbacks and demanded execution! Yes, this is how we roll in the Silicon Valley!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We got the entire company excited, people from different groups and organizations pitched in: e.g. Operations, Engineering, Design, Products, Marketing... and a beautiful thing happened: Collaboration!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;On September 29th, we showcased Signal, our baby!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Building Signal brought out the essence of the Silicon Valley, the birth places of Google, Yahoo!, Facebook, Twitter etc.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7998480320632723546-3968199217889678118?l=invertedindex.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://invertedindex.blogspot.com/feeds/3968199217889678118/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://invertedindex.blogspot.com/2010/10/linkedin-signal.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7998480320632723546/posts/default/3968199217889678118'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7998480320632723546/posts/default/3968199217889678118'/><link rel='alternate' type='text/html' href='http://invertedindex.blogspot.com/2010/10/linkedin-signal.html' title='LinkedIn Signal'/><author><name>SearchMan</name><uri>http://www.blogger.com/profile/12064596275372471624</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_iAxmLMRbouc/TTMisp38OmI/AAAAAAAAAA8/CfgVp7kqiCA/S220/Photo%2Bon%2B2011-01-07%2Bat%2B23.32%2B%25232.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7998480320632723546.post-3754599531811413835</id><published>2010-01-31T19:58:00.000-08:00</published><updated>2010-01-31T20:24:49.757-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='zoie'/><category scheme='http://www.blogger.com/atom/ns#' term='bobo'/><category scheme='http://www.blogger.com/atom/ns#' term='distributed search'/><category scheme='http://www.blogger.com/atom/ns#' term='slides'/><category scheme='http://www.blogger.com/atom/ns#' term='faceted search'/><category scheme='http://www.blogger.com/atom/ns#' term='linkedin'/><category scheme='http://www.blogger.com/atom/ns#' term='sdforum'/><category scheme='http://www.blogger.com/atom/ns#' term='realtime search'/><title type='text'>LinkedIn Search Talk - SDForum</title><content type='html'>The past Wednesday I had the pleasure of giving a technical talk at &lt;a href="http://www.sdforum.org/index.cfm?fuseaction=Calendar.eventDetail&amp;amp;eventID=13601"&gt;SDForum&lt;/a&gt; on LinkedIn Search.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;iframe src="http://docs.google.com/present/embed?id=d7qvbkn_28cgpvm96r" frameborder="0" width="410" height="342"&gt;&lt;/iframe&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This talk came about a month after the 100% rollout of LinkedIn Faceted People Search. See &lt;a href="http://blog.linkedin.com/2009/12/14/linkedin-faceted-search/"&gt;blog&lt;/a&gt; by &lt;a href="http://www.linkedin.com/in/estebankozak"&gt;Esteban Kozak&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In this talk, I talked about the LinkedIn search infrastructure that hosts various LinkedIn search properties, e.g. people search, news search, job search etc.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The main features we built are:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;realtime indexing/search&lt;/li&gt;&lt;li&gt;streaming/live update&lt;/li&gt;&lt;li&gt;faceted navigation&lt;/li&gt;&lt;li&gt;section search&lt;/li&gt;&lt;li&gt;distributed index partitioning&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;The slides provide a glance through how we built these technologies through the following open source projects we built/working on:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;&lt;a href="http://code.google.com/p/zoie"&gt;Zoie&lt;/a&gt; - realtime indexing update system&lt;/li&gt;&lt;li&gt;&lt;a href="http://code.google.com/p/bobo-browse"&gt;Bobo&lt;/a&gt; - faceted search engine based on &lt;a href="http://lucene.apache.org/"&gt;Lucene&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;&lt;a href="http://code.google.com/p/sensei-search"&gt;Sensei&lt;/a&gt; - distributed realtime faceted search system.&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;Some of the notable attendees are:&lt;/div&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt; &lt;a href="http://www.linkedin.com/in/dtunkelang"&gt;Daniel Tunkelang&lt;/a&gt; - Former Chief-Scientist at &lt;a href="http://www.endeca.com/"&gt;Endeca&lt;/a&gt; - enterprise search solution with faceted search emphasis, and author of the blog, &lt;a href="http://thenoisychannel.com/"&gt;TheNoisyChannel&lt;/a&gt;.&lt;/li&gt;&lt;li&gt;&lt;a href="http://www.linkedin.com/in/stefangroschupf"&gt;Stefan Groschupf&lt;/a&gt; - Creator of the popular distributed search open source project: &lt;a href="http://katta.sourceforge.net/"&gt;Katta&lt;/a&gt;.&lt;/li&gt;&lt;/ul&gt;&lt;div&gt;along with representations from companies such as Google, Apple and VMWare etc.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;I am glad to have learned different uses for search technology and hope the technologies we have built to be helpful in different areas.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;For learn about our team at LinkedIn and see other open source projects we are working on, visit &lt;a href="http://sna-projects.com/"&gt;http://sna-projects.com/&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7998480320632723546-3754599531811413835?l=invertedindex.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://invertedindex.blogspot.com/feeds/3754599531811413835/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://invertedindex.blogspot.com/2010/01/linkedin-search-talke-sdforum.html#comment-form' title='2 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7998480320632723546/posts/default/3754599531811413835'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7998480320632723546/posts/default/3754599531811413835'/><link rel='alternate' type='text/html' href='http://invertedindex.blogspot.com/2010/01/linkedin-search-talke-sdforum.html' title='LinkedIn Search Talk - SDForum'/><author><name>SearchMan</name><uri>http://www.blogger.com/profile/12064596275372471624</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_iAxmLMRbouc/TTMisp38OmI/AAAAAAAAAA8/CfgVp7kqiCA/S220/Photo%2Bon%2B2011-01-07%2Bat%2B23.32%2B%25232.jpg'/></author><thr:total>2</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7998480320632723546.post-8864755192830703730</id><published>2009-11-15T10:53:00.000-08:00</published><updated>2009-11-15T13:58:45.303-08:00</updated><title type='text'>Numeric Range Queries - A comparison</title><content type='html'>&lt;div&gt;&lt;b&gt;Problem:&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;RangeQueries have long been problematic for Lucene. Internally, it constructs an OR query of TermQueries, each correspond to an entry in the term table with value that fall within the specified range. &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;When the range covers many terms, this approach has a term upper bound of BooleanQuery.getMaxClauseCount() (or a TooManyClauses runtime exception will be thrown)&lt;/div&gt;&lt;div&gt;And it can be really slow!&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Solutions:&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This blog examines the available alternatives and provides some performances analysis. (Without loss of generality, we will look at range query handling only on integer values, though the approaches to be discussed support all numeric types)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;NumericRangeQuery:&lt;/b&gt;&lt;/div&gt;As part of the recent Lucene 2.9.x release, &lt;a href="http://de.linkedin.com/in/thetaphi"&gt;Uwe Schindler&lt;/a&gt; introduced NumericRangeQuery which aims to solve this problem. (good documentation in the javadoc:&lt;a href="http://lucene.apache.org/java/2_9_1/api/core/index.html"&gt; http://lucene.apache.org/java/2_9_1/api/core/index.html&lt;/a&gt;) I will not do this any injustice by trying to explain details of this algorithm.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;FieldCacheQuery:&lt;/b&gt;&lt;/div&gt;&lt;div&gt;There is however, another approach by using the FieldCache: &lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;1) obtain the int[] from FieldCache.getInts()&lt;/div&gt;&lt;div&gt;2) iterate thru each element and collect it as a hit if it falls within the specified range.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Code snippet:&lt;/div&gt;&lt;div&gt;&lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span style="color:#7f0055;"&gt;static&lt;/span&gt; Query buildFieldCacheQuery(&lt;span style="color:#7f0055;"&gt;final&lt;/span&gt; &lt;span style="color:#7f0055;"&gt;int&lt;/span&gt; start,&lt;span style="color:#7f0055;"&gt;final&lt;/span&gt; &lt;span style="color:#7f0055;"&gt;int&lt;/span&gt; end){&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;  Filter f = &lt;span style="color:#7f0055;"&gt;new&lt;/span&gt; &lt;span style="text-decoration: underline"&gt;Filter()&lt;/span&gt;{&lt;/p&gt; &lt;p  style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco; color:#646464;"&gt;&lt;span class="Apple-style-span"  style="color:#000000;"&gt;  &lt;/span&gt;@Override&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;  &lt;span style="color:#7f0055;"&gt;public&lt;/span&gt; DocIdSet getDocIdSet(IndexReader reader) &lt;span style="color:#7f0055;"&gt;throws&lt;/span&gt; IOException {&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;    &lt;span style="color:#7f0055;"&gt;final&lt;/span&gt; &lt;span style="color:#7f0055;"&gt;int&lt;/span&gt;[] data = FieldCache.&lt;span style="color:#0000c0;"&gt;DEFAULT&lt;/span&gt;.getInts(reader, &lt;span style="color:#0000c0;"&gt;FIELDCACHE_FIELD&lt;/span&gt;);&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;    &lt;span style="color:#7f0055;"&gt;return&lt;/span&gt; &lt;span style="color:#7f0055;"&gt;new&lt;/span&gt; DocIdSet(){&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco; min-height: 15.0px"&gt;&lt;br /&gt;&lt;/p&gt; &lt;p  style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco; color:#646464;"&gt;&lt;span class="Apple-style-span"  style="color:#000000;"&gt;    &lt;/span&gt;@Override&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;    &lt;span style="color:#7f0055;"&gt;public&lt;/span&gt; DocIdSetIterator iterator() &lt;span style="color:#7f0055;"&gt;throws&lt;/span&gt; IOException {&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;      &lt;span style="color:#7f0055;"&gt;return&lt;/span&gt; &lt;span style="color:#7f0055;"&gt;new&lt;/span&gt; DocIdSetIterator() {&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt; &lt;/span&gt;          &lt;span style="color:#7f0055;"&gt;int&lt;/span&gt; &lt;span style="color:#0000c0;"&gt;docid&lt;/span&gt;=-1;&lt;/p&gt; &lt;p  style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco; color:#646464;"&gt;&lt;span style="color:#000000;"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;  &lt;/span&gt;@Override&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;  &lt;span style="color:#7f0055;"&gt;public&lt;/span&gt; &lt;span style="color:#7f0055;"&gt;int&lt;/span&gt; advance(&lt;span style="color:#7f0055;"&gt;int&lt;/span&gt; target) &lt;span style="color:#7f0055;"&gt;throws&lt;/span&gt; IOException {&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;    &lt;span style="color:#0000c0;"&gt;docid&lt;/span&gt;=target-1;&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;    &lt;span style="color:#7f0055;"&gt;return&lt;/span&gt; nextDoc();&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;  }&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco; min-height: 15.0px"&gt;&lt;br /&gt;&lt;/p&gt; &lt;p  style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco; color:#646464;"&gt;&lt;span style="color:#000000;"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;  &lt;/span&gt;@Override&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;  &lt;span style="color:#7f0055;"&gt;public&lt;/span&gt; &lt;span style="color:#7f0055;"&gt;int&lt;/span&gt; docID() {&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;    &lt;span style="color:#7f0055;"&gt;return&lt;/span&gt; &lt;span style="color:#0000c0;"&gt;docid&lt;/span&gt;;&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;  }&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco; min-height: 15.0px"&gt;&lt;br /&gt;&lt;/p&gt; &lt;p  style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco; color:#646464;"&gt;&lt;span style="color:#000000;"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;  &lt;/span&gt;@Override&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;  &lt;span style="color:#7f0055;"&gt;public&lt;/span&gt; &lt;span style="color:#7f0055;"&gt;int&lt;/span&gt; nextDoc() &lt;span style="color:#7f0055;"&gt;throws&lt;/span&gt; IOException {&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;    &lt;span style="color:#7f0055;"&gt;int&lt;/span&gt; val;&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;    &lt;span style="color:#0000c0;"&gt;docid&lt;/span&gt;++;&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;    &lt;span style="color:#7f0055;"&gt;while&lt;/span&gt;(&lt;span style="color:#0000c0;"&gt;docid&lt;/span&gt;&lt;data.&gt;&lt;span style="color:#0000c0;"&gt;length&lt;/span&gt;){&lt;/data.&gt;&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;      val = data[&lt;span style="color:#0000c0;"&gt;docid&lt;/span&gt;];&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;      &lt;span style="color:#7f0055;"&gt;if&lt;/span&gt; (val&gt;start &amp;amp;&amp;amp; val&lt;end)&gt;&lt;/end)&gt;&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;   &lt;/span&gt; &lt;span style="color:#7f0055;"&gt;return&lt;/span&gt; &lt;span style="color:#0000c0;"&gt;docid&lt;/span&gt;;&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;      &lt;span style="color:#7f0055;"&gt;else&lt;/span&gt; &lt;span style="color:#0000c0;"&gt;docid&lt;/span&gt;++;&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;    }&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;    &lt;span style="color:#7f0055;"&gt;return&lt;/span&gt; DocIdSetIterator.&lt;span style="color:#0000c0;"&gt;NO_MORE_DOCS&lt;/span&gt;;&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;  }&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;};&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt; &lt;/span&gt;}&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;      };&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;    }&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;  };&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco; min-height: 15.0px"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;  &lt;/span&gt;&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;  &lt;span style="color:#7f0055;"&gt;return&lt;/span&gt; &lt;span style="color:#7f0055;"&gt;new&lt;/span&gt; ConstantScoreQuery(f);&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;}&lt;/p&gt;&lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-style-span"  style="font-family:Monaco, -webkit-fantasy;"&gt;&lt;span class="Apple-style-span"  style="font-family:Monaco, fantasy;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;&lt;b&gt;Comparison:&lt;/b&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;Index structure:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;NumericField numericField = &lt;span style="color:#7f0055;"&gt;new&lt;/span&gt; NumericField(&lt;span style="color:#0000c0;"&gt;NUMERIC_FIELD&lt;/span&gt;, 4);&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;    &lt;/span&gt;numericField.setIntValue(n);&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;    &lt;/span&gt;doc.add(numericField);&lt;/p&gt;&lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;br /&gt;&lt;/p&gt;&lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;/p&gt;&lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;Field fieldCacheField = &lt;span style="color:#7f0055;"&gt;new&lt;/span&gt; Field(&lt;span style="color:#0000c0;"&gt;FIELDCACHE_FIELD&lt;/span&gt;,String.valueOf(n),Store.&lt;span style="color:#0000c0;"&gt;NO&lt;/span&gt;,Index.&lt;span style="color:#0000c0;"&gt;NOT_ANALYZED_NO_NORMS&lt;/span&gt;);&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;    &lt;/span&gt;fieldCacheField.setOmitTermFreqAndPositions(&lt;span style="color:#7f0055;"&gt;true&lt;/span&gt;);&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;    &lt;/span&gt;doc.add(fieldCacheField);&lt;/p&gt;&lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;br /&gt;&lt;/p&gt;&lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;/p&gt;&lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;Field rangeField = &lt;span style="color:#7f0055;"&gt;new&lt;/span&gt; Field(&lt;span style="color:#0000c0;"&gt;RANGE_FIELD&lt;/span&gt;,&lt;span style="color:#0000c0;"&gt;format&lt;/span&gt;.format(n),Store.&lt;span style="color:#0000c0;"&gt;NO&lt;/span&gt;,Index.&lt;span style="color:#0000c0;"&gt;NOT_ANALYZED_NO_NORMS&lt;/span&gt;);&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;    &lt;/span&gt;rangeField.setOmitTermFreqAndPositions(&lt;span style="color:#7f0055;"&gt;true&lt;/span&gt;);&lt;/p&gt; &lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-tab-span" style="white-space:pre"&gt;    &lt;/span&gt;doc.add(rangeField);&lt;/p&gt;&lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-style-span"  style="font-family:Monaco, -webkit-fantasy;"&gt;&lt;span class="Apple-style-span"  style="font-size:-webkit-xxx-large;"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;p style="margin: 0.0px 0.0px 0.0px 0.0px; font: 11.0px Monaco"&gt;&lt;span class="Apple-style-span"  style="font-family:Georgia, fantasy;"&gt;&lt;span class="Apple-style-span"  style="font-size:medium;"&gt;&lt;b&gt;Following are the results:&lt;/b&gt;&lt;/span&gt;&lt;/span&gt;&lt;/p&gt;&lt;/div&gt;&lt;br /&gt;JVM settings: -server -Xms1g -Xmx1g&lt;br /&gt;&lt;br /&gt;The test measures different range lengths with respect to the possible values in the index, e.g. Range 1% would correspond to the range [0 - 10000] on a index size 1M docs&lt;br /&gt;&lt;br /&gt;Index Size - 1M docs:&lt;br /&gt;&lt;br /&gt;&lt;table border="1" width="100%"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;th&gt;Range %:&lt;br /&gt;&lt;/th&gt;&lt;th&gt;RangeQuery&lt;/th&gt;&lt;th&gt;NumericRangeQuery&lt;/th&gt;&lt;th&gt;FieldCacheRangeQuery&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;1%&lt;/td&gt;&lt;td align="center"&gt;202 ms&lt;/td&gt;&lt;td align="center"&gt;1 ms&lt;/td&gt;&lt;td align="center"&gt;1 ms&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;5%&lt;/td&gt;&lt;td align="center"&gt;2047 ms&lt;/td&gt;&lt;td align="center"&gt;3 ms&lt;/td&gt;&lt;td align="center"&gt;2 ms&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;20%&lt;/td&gt;&lt;td align="center"&gt;NA&lt;/td&gt;&lt;td align="center"&gt;9 ms&lt;/td&gt;&lt;td align="center"&gt;5 ms&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;50%&lt;/td&gt;&lt;td align="center"&gt;NA&lt;/td&gt;&lt;td align="center"&gt;17 ms&lt;/td&gt;&lt;td align="center"&gt;9 ms&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;100%&lt;/td&gt;&lt;td align="center"&gt;NA&lt;/td&gt;&lt;td align="center"&gt;26 ms&lt;/td&gt;&lt;td align="center"&gt;9 ms&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;br /&gt;Index Size - 5M docs, No longer measuring RangeQuery to stop beating the dead horse&lt;br /&gt;&lt;br /&gt;&lt;table border="1" width="100%"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;th&gt;Range %:&lt;br /&gt;&lt;/th&gt;&lt;th&gt;NumericRangeQuery&lt;/th&gt;&lt;th&gt;FieldCacheRangeQuery&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;1%&lt;/td&gt;&lt;td align="center"&gt;6 ms&lt;/td&gt;&lt;td align="center"&gt;8 ms&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;5%&lt;/td&gt;&lt;td align="center"&gt;15 ms&lt;/td&gt;&lt;td align="center"&gt;11 ms&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;20%&lt;/td&gt;&lt;td align="center"&gt;38 ms&lt;/td&gt;&lt;td align="center"&gt;27 ms&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;50%&lt;/td&gt;&lt;td align="center"&gt;75 ms&lt;/td&gt;&lt;td align="center"&gt;47 ms&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;100%&lt;/td&gt;&lt;td align="center"&gt;128 ms&lt;/td&gt;&lt;td align="center"&gt;43 ms&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;br /&gt;Index Size - 10M docs&lt;br /&gt;&lt;table border="1" width="100%"&gt;&lt;tbody&gt;&lt;tr&gt;&lt;th&gt;Range %:&lt;br /&gt;&lt;/th&gt;&lt;th&gt;NumericRangeQuery&lt;/th&gt;&lt;th&gt;FieldCacheRangeQuery&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;1%&lt;/td&gt;&lt;td align="center"&gt;10 ms&lt;/td&gt;&lt;td align="center"&gt;16 ms&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;5%&lt;/td&gt;&lt;td align="center"&gt;28 ms&lt;/td&gt;&lt;td align="center"&gt;23 ms&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;20%&lt;/td&gt;&lt;td align="center"&gt;84 ms&lt;/td&gt;&lt;td align="center"&gt;53 ms&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;50%&lt;/td&gt;&lt;td align="center"&gt;153 ms&lt;/td&gt;&lt;td align="center"&gt;97 ms&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td align="center"&gt;100%&lt;/td&gt;&lt;td align="center"&gt;249 ms&lt;/td&gt;&lt;td align="center"&gt;92 ms&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;br /&gt;&lt;br /&gt;&lt;b&gt;Conclusion &amp;amp; Observations:&lt;/b&gt;&lt;div&gt;&lt;ul&gt;&lt;li&gt;Don't use RangeQuery! (Good thing it is deprecated in Lucene 2.9.x)&lt;/li&gt;&lt;li&gt;As index size increases, when the range is small, NumericRangeQuery slows down rather gracefully (less than linear), FieldCacheQuery slows down linearly.&lt;/li&gt;&lt;li&gt;As index size increases, when the range is large, NumericRangeQuery slows down almost linearly, and FieldCacheQuery plateaus at 50%.&lt;/li&gt;&lt;li&gt;If you expect your range covers &gt;=5% of the corpus, FieldCacheQuery is faster.&lt;/li&gt;&lt;li&gt;FieldCacheQuery code snippet above can be easily changed to support Lucene 2.4.x for those of you that have not upgraded.&lt;/li&gt;&lt;li&gt;The FieldCacheQuery idea can be applied similarly to non-numeric fields, e.g. range of texts.&lt;/li&gt;&lt;li&gt;FieldCacheQuery assumes a one-time cost in index load (for the FieldCache), but the cost is necessary if you want to do sorting.&lt;/li&gt;&lt;li&gt;If you want to do range query on a really large index, consider sharding your index.&lt;/li&gt;&lt;/ul&gt;&lt;/div&gt;&lt;div&gt;Don't believe my numbers? Run it yourself by checking out the source code for the test at: &lt;a href="http://code.google.com/p/lucene-book/source/checkout"&gt;http://code.google.com/p/lucene-book/source/checkout&lt;/a&gt;, and run the test: book.lid.example.range.RangeTest.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7998480320632723546-8864755192830703730?l=invertedindex.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://invertedindex.blogspot.com/feeds/8864755192830703730/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://invertedindex.blogspot.com/2009/11/numeric-range-queries-comparison.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7998480320632723546/posts/default/8864755192830703730'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7998480320632723546/posts/default/8864755192830703730'/><link rel='alternate' type='text/html' href='http://invertedindex.blogspot.com/2009/11/numeric-range-queries-comparison.html' title='Numeric Range Queries - A comparison'/><author><name>SearchMan</name><uri>http://www.blogger.com/profile/12064596275372471624</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_iAxmLMRbouc/TTMisp38OmI/AAAAAAAAAA8/CfgVp7kqiCA/S220/Photo%2Bon%2B2011-01-07%2Bat%2B23.32%2B%25232.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7998480320632723546.post-5776695486341858447</id><published>2009-08-21T15:30:00.000-07:00</published><updated>2009-11-15T10:49:58.620-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='performance'/><category scheme='http://www.blogger.com/atom/ns#' term='open source'/><category scheme='http://www.blogger.com/atom/ns#' term='indexing'/><category scheme='http://www.blogger.com/atom/ns#' term='optimization'/><category scheme='http://www.blogger.com/atom/ns#' term='realtime search'/><category scheme='http://www.blogger.com/atom/ns#' term='realtime'/><title type='text'>Index Optimization for realtime search - Good idea?</title><content type='html'>&lt;span style="font-weight: bold;"&gt;Overview of Optimize:&lt;/span&gt;&lt;br /&gt;There is a Lucene API: IndexWriter.optimize(), which combines all segments into 1 large segment and also expunges all deleted docids.&lt;br /&gt;&lt;br /&gt;Searching over an optimized index is very fast because you neither pay penalty to evaluate deleted docs nor search time OR'ing over documents in different indexes. After some OS level warming, the 1 segment file is loaded into the IO cache to avoid IO costs. Hence the method name: optimize(). This is terrific for an offline indexing system, where a pristine optimized index is published for searching.&lt;div&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Segment merge:&lt;br /&gt;&lt;/span&gt;Segment merge is essential incremental indexing. Lucene has a "hidden" extensible API: MergePolicy. (properly exposed in Lucene 2.9) By default, LogByteSizeMergePolicy is used. This policy will periodically choose to merge small segments into larger segments, and size of the segment is based on number of bytes of the segment file. Only during a merge, deleted docs are expunged.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Real-time indexing:&lt;/span&gt;&lt;br /&gt;In a real-time indexing environment, indexing operations are being applied to the index constantly, and the index is fragmented quickly. A challenge here is how to maintain an optimal index for real-time indexing.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In our application, where there are many updates, e.g. old documents are deleted and then added with newer/fresher data. What happened was over time, the largest segment would contain more and more deleted docs, and they will never be expunged because the segment is never a candidate of a merge since deleted docs are merely marked, not removed from the segment, thus the segment size still remain to be large. In the worse scenario, the largest segment would contain only deleted docs.&lt;br /&gt;&lt;br /&gt;We made an enhancement to LogMergePolicy to normalize on size taking into consideration number of deleted documents (and contributed back: &lt;a href="https://issues.apache.org/jira/browse/LUCENE-1634"&gt;LUCENE-1634)&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;This helped quite a bit. We still however, see the problem with the situation when segments are promoted so that they get the merge with the largest segment:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;In a realtime scenario, when smaller segments are "escalated" to be merged with the larger segment, the search response time also escalates. This is because the merge itself gets more expensive as the sizes of the segments to be merged get larger. Furthermore, the newly merged segment needs to be loaded into IO cache, while that is happening, search time is impacted significantly.&lt;div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;To solve this problem, we have created a new MergePolicy implementation:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Idea:&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Instead of defining an optimized index to be 1 large segment, we redefine it to be N segments of balanced size, where N is a configurable parameter. The idea is to spread the cost of a large segment merge into smaller merge costs.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Implementation:&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;At each point of the merge operation, the segments to merge is selected to main a balanced segment structure. The selection is modeled as a state and a merge is viewed as a transition between states, and each such transition is associated with a cost function of merge. We then applied the &lt;a href="http://en.wikipedia.org/wiki/Viterbi_algorithm"&gt;Viterbi algorithm&lt;/a&gt; to identify the optimal selection(s).&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Performance numbers and details be found at this &lt;a href="http://code.google.com/p/zoie/wiki/ZoieMergePolicy"&gt;wiki&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Our MergePolicy implementation has also been contributed back to Lucene: &lt;a href="https://issues.apache.org/jira/browse/LUCENE-1924"&gt;LUCENE-1924&lt;/a&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Conclusion:&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;In conclusion, I would like to emphasize how indexing can affect search performance especially in real-time search. There are often hidden problems as they are invisible to unit tests and simple performance tests. They can also be data dependent and show up after hours or even days of stressing the system. Thus, it is important to understand the details of the indexing to build a scalable and robust system.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;Credit:&lt;/b&gt;&lt;/div&gt;&lt;div&gt;&lt;b&gt;&lt;br /&gt;&lt;/b&gt;&lt;/div&gt;&lt;div&gt;I'd like to credit this idea and implementation to my colleague &lt;a href="http://www.linkedin.com/in/ymatsuda"&gt;Yasuhiro Matsuda&lt;/a&gt;.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7998480320632723546-5776695486341858447?l=invertedindex.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://invertedindex.blogspot.com/feeds/5776695486341858447/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://invertedindex.blogspot.com/2009/08/index-optimization-for-realtime-search.html#comment-form' title='4 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7998480320632723546/posts/default/5776695486341858447'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7998480320632723546/posts/default/5776695486341858447'/><link rel='alternate' type='text/html' href='http://invertedindex.blogspot.com/2009/08/index-optimization-for-realtime-search.html' title='Index Optimization for realtime search - Good idea?'/><author><name>SearchMan</name><uri>http://www.blogger.com/profile/12064596275372471624</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_iAxmLMRbouc/TTMisp38OmI/AAAAAAAAAA8/CfgVp7kqiCA/S220/Photo%2Bon%2B2011-01-07%2Bat%2B23.32%2B%25232.jpg'/></author><thr:total>4</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7998480320632723546.post-786305068164019409</id><published>2009-07-18T15:24:00.000-07:00</published><updated>2009-11-15T10:50:34.263-08:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='performance'/><category scheme='http://www.blogger.com/atom/ns#' term='open source'/><category scheme='http://www.blogger.com/atom/ns#' term='boolean query'/><category scheme='http://www.blogger.com/atom/ns#' term='search'/><category scheme='http://www.blogger.com/atom/ns#' term='optimization'/><title type='text'>Making BooleanQueries faster</title><content type='html'>At work today, we were looking optimize boolean queries of the type:&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;f1:v1^b1 OR f1:v2^b2 OR f1:v3^b3 ...&lt;/div&gt;&lt;div&gt;and&lt;/div&gt;&lt;div&gt;f1:v1^b1 AND f1:v2^b2 AND f1:v3^b3 ...&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;where f1 is a field of some structured data (e.g. Analyze=No, tf=No, norm=No)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We see that this type of query has these patterns:&lt;/div&gt;&lt;div&gt;&lt;ol&gt;&lt;li&gt;all values are in the same field&lt;/li&gt;&lt;li&gt;all clauses have the same operator joining them&lt;/li&gt;&lt;/ol&gt;When the number of clause is large, BooleanQuery can be rather slow.&lt;br /&gt;&lt;br /&gt;As of Lucene 2.4, Lucene query api has adopted &lt;span style="font-weight: bold;"&gt;DocIdSet&lt;/span&gt;, and &lt;span style="font-weight: bold;"&gt;DocIdSetIterator&lt;/span&gt; abstractions, which opened doors for various query optimizations. (IMHO, this is one of the best improvements in Lucene from an API stand point.)&lt;br /&gt;&lt;br /&gt;For our project, we have quite a few OR-clauses, and the search performance was pretty bad.&lt;br /&gt;&lt;br /&gt;Instead, when the index loads, we load into memory a datastructure very much like the FieldCache.StringIndex. In a 5 million index shard, for a field containing N values, memory footprint is approxmiately 20MB + N Strings.&lt;br /&gt;&lt;br /&gt;Given K clauses in our OR query, we build a BitSet with K bits turned on, each bit corresponding to an index into String array. (max bit is N)&lt;br /&gt;&lt;br /&gt;We then built a DocIdSetIterator that iterates the order array and checks to see if the order[i]-bit is turned on in the bitset.&lt;br /&gt;&lt;br /&gt;The resulting performance is quite encouraging: As the number of clauses increases, the iteration time is capped. From our benchmark, when we get to 10+ clauses, the performance gain is approximately 8x!&lt;br /&gt;&lt;br /&gt;Here is a detailed write-up:&lt;br /&gt;&lt;a href="http://code.google.com/p/bobo-browse/wiki/QueryConstruction"&gt;http://code.google.com/p/bobo-browse/wiki/QueryConstruction&lt;/a&gt;&lt;br /&gt;&lt;br /&gt;&lt;div&gt;This is an example of how good API design can open doors to things such as what we were able to do.&lt;br /&gt;&lt;br /&gt;Kudos to the Lucene team for the DocIdSet api!&lt;br /&gt;&lt;br /&gt;&lt;/div&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7998480320632723546-786305068164019409?l=invertedindex.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://invertedindex.blogspot.com/feeds/786305068164019409/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://invertedindex.blogspot.com/2009/07/making-booleanqueries-faster.html#comment-form' title='0 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7998480320632723546/posts/default/786305068164019409'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7998480320632723546/posts/default/786305068164019409'/><link rel='alternate' type='text/html' href='http://invertedindex.blogspot.com/2009/07/making-booleanqueries-faster.html' title='Making BooleanQueries faster'/><author><name>SearchMan</name><uri>http://www.blogger.com/profile/12064596275372471624</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_iAxmLMRbouc/TTMisp38OmI/AAAAAAAAAA8/CfgVp7kqiCA/S220/Photo%2Bon%2B2011-01-07%2Bat%2B23.32%2B%25232.jpg'/></author><thr:total>0</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7998480320632723546.post-2271815744542831187</id><published>2009-04-21T21:31:00.000-07:00</published><updated>2009-04-21T23:03:01.409-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='zoie'/><category scheme='http://www.blogger.com/atom/ns#' term='distributed search'/><category scheme='http://www.blogger.com/atom/ns#' term='index partition'/><category scheme='http://www.blogger.com/atom/ns#' term='realtime search'/><title type='text'>Index Partitioning and Distributed Realtime Search</title><content type='html'>When the number of documents gets to be very large in one single index, it is usually a good idea to partition the index and distribute the search load to different processes or search nodes, each serving from a section of the corpus.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Some Background:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;There are many different ways of partitioning the index depending on the application requirements, here are some examples.&lt;br /&gt;&lt;ul&gt;&lt;li&gt;By terms - This is good because not every search node is handling every request. Only the search nodes containing terms from the query are called to serve the request. However, normally some terms are searched more than others, and on top of that, some terms contain more documents than others. Therefore it is usually very difficult to load balance this partitioning scheme.&lt;/li&gt;&lt;li&gt;By documents - This is the most common partitioning method. Even though every search node is handling every search request, (assuming documents are uniformly distributed across the partitions,) it is very easy to balance and re-balance search traffic.&lt;/li&gt;&lt;li&gt;By time - This is a good idea in corpus containing time sensitive data, e.g. job postings, news, blogs etc. In such cases, recent documents are by definition more relevant than older documents. Thus, we can add more replication on partitions with recent documents while slowly discard older partitions from the index.&lt;/li&gt;&lt;/ul&gt;&lt;span style="font-weight: bold;"&gt;&lt;br /&gt;Our Partitioning Scheme:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;We, at &lt;a href="http://www.linkedin.com/"&gt;LinkedIn&lt;/a&gt;, are in a real-time search situation while guaranteeing high availability, we want to have a partitioning scheme that would allows us to avoid re-indexing/re-partitioning as our corpus grows.&lt;br /&gt;&lt;br /&gt;We have decided to choose partition by ranges of documents. We choose the maximum number of documents in a partition, say N. And partition along consecutive document IDs (UIDs, not Lucene docids), .e.g. partition K, would containing documents ranging from K*N to (K+1)*N-1, K = 0,1,2,...&lt;br /&gt;&lt;br /&gt;As new documents are added to the index, we grow the partition by adding to the largest partition, and when the UID exceeds the maximum document ID on the largest partition, a new partition is created. This partitioning scheme allows us to grow to newer partitions without a "re-partition".&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Cross domain "joins":&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;Another important benefit we enjoy with this partitioning scheme is that we can load Lucene docid to UID mapping (see my &lt;a href="http://invertedindex.blogspot.com/2009/04/lucene-dociduid-mapping-and-payload.html"&gt;previous post&lt;/a&gt;) as well as the reverse: UID mapping to Lucene docid mapping, because we know the range UIDs fall in. We are thus very easily able to translate a filter set in the UID space to a Lucene Filter or a Lucene Query. This is important because it essentially allows us to do "joins" across different document spaces. The memory footprint for holding both lookup arrays are docCount*4 bytes each, nowadays, it's a piece of cake.&lt;br /&gt;&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Realtime at the Node-Level:&lt;/span&gt;&lt;br /&gt;&lt;br /&gt;To guarantee realtime on the search node level, each of are partitions are powered by &lt;a href="http://code.google.com/p/zoie"&gt;Zoie&lt;/a&gt;. See my &lt;a href="http://invertedindex.blogspot.com/2009/04/zoie-realtime-search-and-indexing.html"&gt;earlier post&lt;/a&gt;)&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7998480320632723546-2271815744542831187?l=invertedindex.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://invertedindex.blogspot.com/feeds/2271815744542831187/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://invertedindex.blogspot.com/2009/04/index-partitioning-and-distributed.html#comment-form' title='5 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7998480320632723546/posts/default/2271815744542831187'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7998480320632723546/posts/default/2271815744542831187'/><link rel='alternate' type='text/html' href='http://invertedindex.blogspot.com/2009/04/index-partitioning-and-distributed.html' title='Index Partitioning and Distributed Realtime Search'/><author><name>SearchMan</name><uri>http://www.blogger.com/profile/12064596275372471624</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_iAxmLMRbouc/TTMisp38OmI/AAAAAAAAAA8/CfgVp7kqiCA/S220/Photo%2Bon%2B2011-01-07%2Bat%2B23.32%2B%25232.jpg'/></author><thr:total>5</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7998480320632723546.post-1936383310745610918</id><published>2009-04-12T05:08:00.000-07:00</published><updated>2009-04-21T23:04:16.984-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='field cache'/><category scheme='http://www.blogger.com/atom/ns#' term='lucene'/><category scheme='http://www.blogger.com/atom/ns#' term='uid mapping'/><category scheme='http://www.blogger.com/atom/ns#' term='payload'/><title type='text'>Lucene docid,UID mapping and Payload</title><content type='html'>Lucene &lt;b&gt;&lt;i&gt;docids&lt;/i&gt;&lt;/b&gt; are internal integers representing documents in the index, and Lucene takes liberty in reassign docids to during segment merges and when expunging deleted documents. In general one needs to be very careful with dealing with docids when they are referenced in the application context because the document a docid refers to is expected to change at anytime.&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;In many applications, it is very useful to have a quick way to map a Lucene docid to an external UID, e.g. a primary key in a database. For example, result filtering depending on an external document set. In the case of LinkedIn people search, filtering by a person's network on a text search result set, UID in this case being the member id. We needed a mechanism to map from a Lucene docid in a HitCollector.collect() very quickly to the corresponding UID.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A common and naive practice is to keep the UID in a Stored field, e.g.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;Field("uid",myUID,Store.YES,Index.No);&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;and retrieve the uid via:&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;int uid = Integer.parseInt(indexReader.document(docid)&lt;br /&gt;                         .get("uid"));&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;When recall (the number of hit candidates) is large, this method is streaming from the stored file for each hit, and thus does not perform.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;A better alternative is to use the &lt;b&gt;FieldCache&lt;/b&gt; to load into an integer array a mapping from docid to uid for each docid in the index ,e.g.&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt; 0 to indexReader.maxDoc()&lt;/span&gt;&lt;/div&gt;&lt;div&gt;(assuming uid is kept in the indexed field) e.g.&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;Field("uid",myUID,Store.NO,&lt;br /&gt;                Index.NOT_ANALYZED_NO_NORMS);&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;When IndexReader loads, we load the "map array":&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;int[] uidArray = FieldCache.DEFAULT.getInts(indexReader,"uid");&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;This is done once for a new IndexReader, and at search time, the mapping is just an array lookup:&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;int uid = uidArray[docid];&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;which is rather fast.&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;The problem here is that the number of terms in the uid field is equal to or very close to the number of documents in the index. To load from the FieldCache, a random seek is done for each term, which makes loading the uidArray extremely slow. This is fine if IndexReader does not load very often, but unfortunately for us, in a realtime search system, we don't have the luxury of such assumption. (This also, as a side effect, increases the total number of terms in the index and may impact search performance)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;Lucky for us, Lucene 2.2 introduced Payloads, (a huge step towards flexible indexing), which is the ability to allow arbitrary data to be added to the posting list. In our case, we would create an arbitrary term for every document, e.g. &lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;Term("uid","_UID_")&lt;/span&gt;, and attach a 4-byte uid value to each posting:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;class UIDTokenStream extends TokenStream{&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;  private Token token = new Token("_UID_",0,0);&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;  private byte[]  buffer = new byte[4];&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;  private boolean returnToken = false;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;  void setUID(int uid){&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;    buffer[0] = (byte)uid;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;    buffer[1] = (byte)(uid&gt;&gt;8);&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;    buffer[2] = (byte)(uid&gt;&gt;16);&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;    buffer[3] = (byte)(uid&gt;&gt;24);&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;    token.setPayload(new Payload(buffer));&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;    returnToken = true;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;  }&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;  public Token next() throws IOException{&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;  if (returnToken){ returnToken = false; return token; }&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;  else { return null; }&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;}&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;}&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;UIDTokenStream tokenStream = new UIDTokenStream();&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;tokenStream.setUID(myUID);&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;Field("uid",tokenStream);&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;When we load the uidArray, we do:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;TermPositions tp = null;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;byte[] dataBuffer = new byte[4];&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;int[] uidArray = new int[indexReader.maxDoc()];&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;try{&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;  int idx = 0;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;  tp = indexReader.termPositions(new Term("uid","_UID_"));&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;  while(tp.next()){&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;    int doc = tp.doc();&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;    tp.nextPosition();&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;    tp.getPayload(dataBuffer,0);&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt; &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;    // convert buffer to int&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;    int uid = ((dataBuffer[3]&amp;amp; 0xFF) &lt;&lt;24)&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;    uidArray[idx++]=uid; &lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;  }&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;}&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;finally{&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;  if (tp!=null){&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;    tp.close();&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;  }&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;}&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;Looking up the uid is same as before, simply an array lookup:&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;int uid = uidArray[docid];&lt;/span&gt;&lt;/div&gt;&lt;div&gt;&lt;span class="Apple-style-span"  style="font-family:'courier new';"&gt;&lt;br /&gt;&lt;/span&gt;&lt;/div&gt;&lt;div&gt;The difference here is when loading the uidArray, sequential seek is done for each docid while paying the penalty of byte[] -&gt; int conversion. (Also to a previous point, this method introduces only 1 extra term to the entire index)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;We ran some performance numbers on a 2M-document index, loading from the FieldCache took more than 16.5 seconds, while loading the uidArray from payload took 430 ms, this is an improvement of over 38x! (this time was taken a while ago from my MacBook Pro using Lucene 2.2)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div&gt;This mechanism is built into &lt;a href="http://code.google.com/p/zoie"&gt;Zoie Realtime Search and Indexing System&lt;/a&gt; used to filter out duplicate documents between in-memory indexes and the disk index. (Since they are independent indexes, only consistent id to filter from is the UID)&lt;/div&gt;&lt;div&gt;&lt;br /&gt;&lt;/div&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7998480320632723546-1936383310745610918?l=invertedindex.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://invertedindex.blogspot.com/feeds/1936383310745610918/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://invertedindex.blogspot.com/2009/04/lucene-dociduid-mapping-and-payload.html#comment-form' title='15 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7998480320632723546/posts/default/1936383310745610918'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7998480320632723546/posts/default/1936383310745610918'/><link rel='alternate' type='text/html' href='http://invertedindex.blogspot.com/2009/04/lucene-dociduid-mapping-and-payload.html' title='Lucene docid,UID mapping and Payload'/><author><name>SearchMan</name><uri>http://www.blogger.com/profile/12064596275372471624</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_iAxmLMRbouc/TTMisp38OmI/AAAAAAAAAA8/CfgVp7kqiCA/S220/Photo%2Bon%2B2011-01-07%2Bat%2B23.32%2B%25232.jpg'/></author><thr:total>15</thr:total></entry><entry><id>tag:blogger.com,1999:blog-7998480320632723546.post-8488395236339602615</id><published>2009-04-07T19:55:00.000-07:00</published><updated>2009-04-07T20:13:32.145-07:00</updated><category scheme='http://www.blogger.com/atom/ns#' term='open source'/><category scheme='http://www.blogger.com/atom/ns#' term='zoie'/><category scheme='http://www.blogger.com/atom/ns#' term='java'/><category scheme='http://www.blogger.com/atom/ns#' term='lucene'/><category scheme='http://www.blogger.com/atom/ns#' term='search'/><category scheme='http://www.blogger.com/atom/ns#' term='realtime'/><title type='text'>Zoie - a realtime search and indexing system</title><content type='html'>A realtime search system makes it possible for queries to find a new document immediately or almost immediately after it has been updated. With &lt;span style="font-weight: bold;"&gt;Lucene&lt;/span&gt;'s incremental update functionality, we were able to extend &lt;span style="font-weight: bold;"&gt;Lucene&lt;/span&gt; to support realtime indexing/search.&lt;br /&gt;&lt;br /&gt;We open-sourced this technology and called it &lt;span style="font-weight: bold;"&gt;Zoie&lt;/span&gt;: &lt;a href="http://code.google.com/p/zoie"&gt;http://code.google.com/p/zoie&lt;/a&gt;.&lt;br /&gt;&lt;br /&gt;Our design uses multiple indexes; one main index on disk, two additional indexes in memory to handle transient indexing events.&lt;br /&gt;&lt;br /&gt;The disk index will grow and be rather large, therefore indexing updates on disk will be performed in batches.  Processing updates in batches, allows us to merge updates on the same document to reduce redundant updates. Moreover, the disk index would not be fragmented as the indexer is not thrashed by a large number of small indexing calls/requests.  We keep a shared disk-based &lt;span style="font-style: italic;"&gt;IndexReader&lt;/span&gt; to serve search request.  Once batch indexing is performed, we build/load a new &lt;span style="font-style: italic;"&gt;IndexReader&lt;/span&gt; and then publish the new shared &lt;span style="font-style: italic;"&gt;IndexReader&lt;/span&gt;. The cost of building/loading the &lt;span style="font-style: italic;"&gt;IndexReader&lt;/span&gt; is thus hidden from the cost of search.&lt;br /&gt;&lt;br /&gt;To ensure realtime behavior, we have two helper memory indexes (MemA and MemB) that alternate in their roles.  One index, say MemA, accumulates events and directly serves realtime results.  When a flush/commit event occurs, MemA stops receiving new events, these are sent to MemB.  At this time search requests are served from MemA, MemB and the disk index.  Once the disk merge is completed, MemA is cleared and MemB takes its place.  Untill the next flush/commit searches will be served from the new disk index and MemB.&lt;br /&gt;&lt;br /&gt;For each search request, we open and load a new &lt;span style="font-style: italic;"&gt;IndexReader&lt;/span&gt; from each of the memory indexes, and along with the shared disk &lt;span style="font-style: italic;"&gt;IndexReader&lt;/span&gt;, we return a &lt;span style="font-style: italic;"&gt;MultiIndexReader&lt;/span&gt; built from those three &lt;span style="font-style: italic;"&gt;IndexReaders&lt;/span&gt;.&lt;br /&gt;&lt;br /&gt;&lt;span style="font-weight: bold;"&gt;Zoie&lt;/span&gt; has been running in production at &lt;a href="http://www.linkedin.com"&gt;http://www.linkedin.com&lt;/a&gt;, in distributed mode, it is handling almost 40 million documents (or user profiles) in realtime and serving over 6 million requests a day with an average latency below 50ms.&lt;br /&gt;&lt;br /&gt;We welcome contributions in any form to move the project forward.&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/7998480320632723546-8488395236339602615?l=invertedindex.blogspot.com' alt='' /&gt;&lt;/div&gt;</content><link rel='replies' type='application/atom+xml' href='http://invertedindex.blogspot.com/feeds/8488395236339602615/comments/default' title='Post Comments'/><link rel='replies' type='text/html' href='http://invertedindex.blogspot.com/2009/04/zoie-realtime-search-and-indexing.html#comment-form' title='3 Comments'/><link rel='edit' type='application/atom+xml' href='http://www.blogger.com/feeds/7998480320632723546/posts/default/8488395236339602615'/><link rel='self' type='application/atom+xml' href='http://www.blogger.com/feeds/7998480320632723546/posts/default/8488395236339602615'/><link rel='alternate' type='text/html' href='http://invertedindex.blogspot.com/2009/04/zoie-realtime-search-and-indexing.html' title='Zoie - a realtime search and indexing system'/><author><name>SearchMan</name><uri>http://www.blogger.com/profile/12064596275372471624</uri><email>noreply@blogger.com</email><gd:image rel='http://schemas.google.com/g/2005#thumbnail' width='32' height='24' src='http://4.bp.blogspot.com/_iAxmLMRbouc/TTMisp38OmI/AAAAAAAAAA8/CfgVp7kqiCA/S220/Photo%2Bon%2B2011-01-07%2Bat%2B23.32%2B%25232.jpg'/></author><thr:total>3</thr:total></entry></feed>
