Inverted Index

Sunday, July 31, 2011

LinkedIn Intern Hackday 2011 and Node.js

Yesterday I had the best birthday ever being part of the LinkedIn Intern Hackday.

First a little background of this this event:

Sponsored by LinkedIn, put together by Adam Nash and Jim Brikman.

Any intern in the Silicon Valley is invited. Just form a small group, come to LinkedIn Mountain View headquarters, and pull an all-nighter hacking up some awesomeness. From all the projects, we would pick top three in the likes of American Idol.

Interestingly, Node.js was a popular technology stack being used.

Node.js:

The momentum behind this technology is getting stronger and stronger. I see this trend continuing from the following aspects:

Pervasiveness of JavaScript:

JavaScript has become the de-facto programming language on the client-side. It fulfilled the void of writing complex programs in the browser where Java Applets failed.

As JavaScript matures, I see it become more and more pervasive on the server-side. Having a consistent language stack between client and server is desirable. With support for CommonJS by companies like Google, it does seem possible JavaScript being a strong contender in the server-side landscape.

Cloud Computing:

Cloud computing is here, where the cost of running a service on the cloud is measured by usage. Therefore squeezing every ounce to power from your machine instance is highly desirable. The philosophy of Node.js for asynchronous event handling makes a lot of sense in this environment. Because every bit of CPU starvation is costly, now measurable by $.

Trend:

The tech trend is often set by the younger generation. The projects in this intern hackday collectively is a good sample of what the next generation of wiz's and geeks are going to work on. Node.js set a tone.

Conclusion:

I spent sometime learning about Node.js and find although it is still very young, but its potential impact is going to be significant. I see someday it become a serious competition for Ruby/Rails and/or Python/Django.

I see server-side Java being pushed more towards the backend and eventually finding room in custom backends like NOSQL/Search systems.

I am picking up a JavaScript book and it is going to be exciting!

Friday, July 8, 2011

Python, Tornado, Kafka, Oh My!

Something about Kafka:

Recently LinkedIn's Kafka has been accepted to Apache Incubator.

Kafka is a high-throughput distributed publish-scribe messaging system written in Scala.

Kafka scales very well with increased dataset as well subscribers. For detailed performance results, check this out.

Capturing internet events:

We were looking to build a data application that captures mobile activities.

Requirements are:

High volume
Data sent over the internet

Kafka being the obvious choice for streaming message to our backend systems, but we of course don't want to expose our Kafka endpoint on the web.

So, we need to build a http proxy to front our Kafka cluster.

Python and Tornado:

Being a recent Python convert (by learning Django from Lei), I wanted to build this proxy in Python.

Django is a little heavy for this use-case, all I needed is a http server.

Luckily Ikai facebooked me his talk on Tornado - a light-weight http server in Python.

Given Kafka already has a Python client, voila, we have a http proxy listening for events pumping to Kafka.

Here is the code:

import tornado.ioloop

import tornado.web

from kafka import KafkaProducer

class KafkaHandler(tornado.web.RequestHandler):

 topic = "app-update"

 producer = KafkaProducer('localhost',9092)

 def post(self):

  d = self.request.body

  self.producer.send([d], self.topic)

  print d

application = tornado.web.Application([

    (r"/app-update", KafkaHandler),

])

if __name__ == "__main__":

    application.listen(8080)

    tornado.ioloop.IOLoop.instance().start()

Friday, October 1, 2010

LinkedIn Signal

A few days ago, LinkedIn Signal debuted at Techcrunch Disrupt. Within hours, tweets, blogs, articles spread like wildfires across various mediums on the web. It was exciting to be part of the engineering team behind it.

Here is an encouraging article written about Signal:

http://www.thedailybeast.com/blogs-and-stories/2010-09-29/linkedin-signal-combines-twitter-with-the-resume-site/

Signal is a contextual social search application that leverages the realtime LinkedIn Share/Twitter stream and LinkedIn profile information. In a way, we were able to classify the realtime stream based on who the "Sharer" is, and thus tagging it with structured information.

Being a data product, having clean and abundant data is essential. With combination of LinkedIn and Twitter data, we were in heaven in building a revolutionary product: In a few clicks you are able to answer the following questions:

What are HP employees saying about Mark Hurd?
What is Jerry Brown Campaign saying about Meg Whitman?
What do MIT students think about Scala?
What are the IT professions living in SF-Bayarea talking about Java?
What are people saying about LinkedIn Signal in the last hour?
...

These are only a few examples of the type of insights you can get from the realtime stream that has been enriched with context and precision.

Furthermore, you can also discover trending articles you should read based on your query and selections. We were suggested to read the Techcrunch article about AOL buying Techcrunch hours before any news source picked it up via Signal, that was powerful!

I have written an under-the-hood technical post on the LinkedIn SNA blog, so instead in this post I would like to talk about the development process for which Signal was created.

A few months ago, one of our rock star engineers Nick Dellamaggiore wrote a data stream that merges LinkedIn shares, the Twitter stream for bounded LinkedIn accounts, LinkedIn profile, and derived LinkedIn member information, such that any LinkedIn developer can consume and build interesting applications from.

For a few of us that have been interested in making sense of semi-structured data in realtime, having access to this data stream straight from our development boxes is like mice trapped in a cheese factory, we were excited!

Then the stars aligned some more, a ridiculously awesome application developer Alejandro Crosa took our search library and presented us with a beautiful application, an incarnation of Signal appeared before our eyes. And a team was formed, led by our product counterpart: Esteban Kozak.

For the next couple of weeks, we went nuts with features -> performance -> more features, and worked through weekends and evenings, we were in startup-mode.

Judgement day came, we presented Signal to our CEO, Jeff, and our VP of Products, Deep. Being internet veterans, immediately saw the value, provided feedbacks and demanded execution! Yes, this is how we roll in the Silicon Valley!

We got the entire company excited, people from different groups and organizations pitched in: e.g. Operations, Engineering, Design, Products, Marketing... and a beautiful thing happened: Collaboration!

On September 29th, we showcased Signal, our baby!

Building Signal brought out the essence of the Silicon Valley, the birth places of Google, Yahoo!, Facebook, Twitter etc.

Sunday, January 31, 2010

LinkedIn Search Talk - SDForum

The past Wednesday I had the pleasure of giving a technical talk at SDForum on LinkedIn Search.

This talk came about a month after the 100% rollout of LinkedIn Faceted People Search. See blog by Esteban Kozak.

In this talk, I talked about the LinkedIn search infrastructure that hosts various LinkedIn search properties, e.g. people search, news search, job search etc.

The main features we built are:

realtime indexing/search
streaming/live update
faceted navigation
section search
distributed index partitioning

The slides provide a glance through how we built these technologies through the following open source projects we built/working on:

Zoie - realtime indexing update system
Bobo - faceted search engine based on Lucene.
Sensei - distributed realtime faceted search system.

Some of the notable attendees are:

Daniel Tunkelang - Former Chief-Scientist at Endeca - enterprise search solution with faceted search emphasis, and author of the blog, TheNoisyChannel.
Stefan Groschupf - Creator of the popular distributed search open source project: Katta.

along with representations from companies such as Google, Apple and VMWare etc.

I am glad to have learned different uses for search technology and hope the technologies we have built to be helpful in different areas.

For learn about our team at LinkedIn and see other open source projects we are working on, visit http://sna-projects.com/.

Sunday, November 15, 2009

Numeric Range Queries - A comparison

Problem:

RangeQueries have long been problematic for Lucene. Internally, it constructs an OR query of TermQueries, each correspond to an entry in the term table with value that fall within the specified range.

When the range covers many terms, this approach has a term upper bound of BooleanQuery.getMaxClauseCount() (or a TooManyClauses runtime exception will be thrown)

And it can be really slow!

Solutions:

This blog examines the available alternatives and provides some performances analysis. (Without loss of generality, we will look at range query handling only on integer values, though the approaches to be discussed support all numeric types)

NumericRangeQuery:

As part of the recent Lucene 2.9.x release, Uwe Schindler introduced NumericRangeQuery which aims to solve this problem. (good documentation in the javadoc: http://lucene.apache.org/java/2_9_1/api/core/index.html) I will not do this any injustice by trying to explain details of this algorithm.

FieldCacheQuery:

There is however, another approach by using the FieldCache:

1) obtain the int[] from FieldCache.getInts()

2) iterate thru each element and collect it as a hit if it falls within the specified range.

Code snippet:

static Query buildFieldCacheQuery(final int start,final int end){

Filter f = new Filter(){

@Override

public DocIdSet getDocIdSet(IndexReader reader) throws IOException {

final int[] data = FieldCache.DEFAULT.getInts(reader, FIELDCACHE_FIELD);

return new DocIdSet(){

@Override

public DocIdSetIterator iterator() throws IOException {

return new DocIdSetIterator() {

int docid=-1;

@Override

public int advance(int target) throws IOException {

docid=target-1;

return nextDoc();

}

@Override

public int docID() {

return docid;

}

@Override

public int nextDoc() throws IOException {

int val;

docid++;

while(docidlength){

val = data[docid];

if (val>start && val

return docid;

else docid++;

}

return DocIdSetIterator.NO_MORE_DOCS;

}

};

}

};

}

};

return new ConstantScoreQuery(f);

}

Comparison:

Index structure:

NumericField numericField = new NumericField(NUMERIC_FIELD, 4);

numericField.setIntValue(n);

doc.add(numericField);

Field fieldCacheField = new Field(FIELDCACHE_FIELD,String.valueOf(n),Store.NO,Index.NOT_ANALYZED_NO_NORMS);

fieldCacheField.setOmitTermFreqAndPositions(true);

doc.add(fieldCacheField);

Field rangeField = new Field(RANGE_FIELD,format.format(n),Store.NO,Index.NOT_ANALYZED_NO_NORMS);

rangeField.setOmitTermFreqAndPositions(true);

doc.add(rangeField);

Following are the results:

JVM settings: -server -Xms1g -Xmx1g

The test measures different range lengths with respect to the possible values in the index, e.g. Range 1% would correspond to the range [0 - 10000] on a index size 1M docs

Index Size - 1M docs:

Range %:	RangeQuery	NumericRangeQuery	FieldCacheRangeQuery
1%	202 ms	1 ms	1 ms
5%	2047 ms	3 ms	2 ms
20%	NA	9 ms	5 ms
50%	NA	17 ms	9 ms
100%	NA	26 ms	9 ms

Index Size - 5M docs, No longer measuring RangeQuery to stop beating the dead horse

Range %:	NumericRangeQuery	FieldCacheRangeQuery
1%	6 ms	8 ms
5%	15 ms	11 ms
20%	38 ms	27 ms
50%	75 ms	47 ms
100%	128 ms	43 ms

Index Size - 10M docs

Range %:	NumericRangeQuery	FieldCacheRangeQuery
1%	10 ms	16 ms
5%	28 ms	23 ms
20%	84 ms	53 ms
50%	153 ms	97 ms
100%	249 ms	92 ms

Conclusion & Observations:

Don't use RangeQuery! (Good thing it is deprecated in Lucene 2.9.x)
As index size increases, when the range is small, NumericRangeQuery slows down rather gracefully (less than linear), FieldCacheQuery slows down linearly.
As index size increases, when the range is large, NumericRangeQuery slows down almost linearly, and FieldCacheQuery plateaus at 50%.
If you expect your range covers >=5% of the corpus, FieldCacheQuery is faster.
FieldCacheQuery code snippet above can be easily changed to support Lucene 2.4.x for those of you that have not upgraded.
The FieldCacheQuery idea can be applied similarly to non-numeric fields, e.g. range of texts.
FieldCacheQuery assumes a one-time cost in index load (for the FieldCache), but the cost is necessary if you want to do sorting.
If you want to do range query on a really large index, consider sharding your index.

Don't believe my numbers? Run it yourself by checking out the source code for the test at: http://code.google.com/p/lucene-book/source/checkout, and run the test: book.lid.example.range.RangeTest.

Friday, August 21, 2009

Index Optimization for realtime search - Good idea?

Overview of Optimize:
There is a Lucene API: IndexWriter.optimize(), which combines all segments into 1 large segment and also expunges all deleted docids.

Searching over an optimized index is very fast because you neither pay penalty to evaluate deleted docs nor search time OR'ing over documents in different indexes. After some OS level warming, the 1 segment file is loaded into the IO cache to avoid IO costs. Hence the method name: optimize(). This is terrific for an offline indexing system, where a pristine optimized index is published for searching.

Segment merge:
Segment merge is essential incremental indexing. Lucene has a "hidden" extensible API: MergePolicy. (properly exposed in Lucene 2.9) By default, LogByteSizeMergePolicy is used. This policy will periodically choose to merge small segments into larger segments, and size of the segment is based on number of bytes of the segment file. Only during a merge, deleted docs are expunged.

Real-time indexing:
In a real-time indexing environment, indexing operations are being applied to the index constantly, and the index is fragmented quickly. A challenge here is how to maintain an optimal index for real-time indexing.

In our application, where there are many updates, e.g. old documents are deleted and then added with newer/fresher data. What happened was over time, the largest segment would contain more and more deleted docs, and they will never be expunged because the segment is never a candidate of a merge since deleted docs are merely marked, not removed from the segment, thus the segment size still remain to be large. In the worse scenario, the largest segment would contain only deleted docs.

We made an enhancement to LogMergePolicy to normalize on size taking into consideration number of deleted documents (and contributed back: LUCENE-1634)

This helped quite a bit. We still however, see the problem with the situation when segments are promoted so that they get the merge with the largest segment:

In a realtime scenario, when smaller segments are "escalated" to be merged with the larger segment, the search response time also escalates. This is because the merge itself gets more expensive as the sizes of the segments to be merged get larger. Furthermore, the newly merged segment needs to be loaded into IO cache, while that is happening, search time is impacted significantly.

To solve this problem, we have created a new MergePolicy implementation:

Idea:

Instead of defining an optimized index to be 1 large segment, we redefine it to be N segments of balanced size, where N is a configurable parameter. The idea is to spread the cost of a large segment merge into smaller merge costs.

Implementation:

At each point of the merge operation, the segments to merge is selected to main a balanced segment structure. The selection is modeled as a state and a merge is viewed as a transition between states, and each such transition is associated with a cost function of merge. We then applied the Viterbi algorithm to identify the optimal selection(s).

Performance numbers and details be found at this wiki.

Our MergePolicy implementation has also been contributed back to Lucene: LUCENE-1924

Conclusion:

In conclusion, I would like to emphasize how indexing can affect search performance especially in real-time search. There are often hidden problems as they are invisible to unit tests and simple performance tests. They can also be data dependent and show up after hours or even days of stressing the system. Thus, it is important to understand the details of the indexing to build a scalable and robust system.

Credit:

I'd like to credit this idea and implementation to my colleague Yasuhiro Matsuda.

Saturday, July 18, 2009

Making BooleanQueries faster

At work today, we were looking optimize boolean queries of the type:

f1:v1^b1 OR f1:v2^b2 OR f1:v3^b3 ...

and

f1:v1^b1 AND f1:v2^b2 AND f1:v3^b3 ...

where f1 is a field of some structured data (e.g. Analyze=No, tf=No, norm=No)

We see that this type of query has these patterns:

all values are in the same field
all clauses have the same operator joining them

When the number of clause is large, BooleanQuery can be rather slow.

As of Lucene 2.4, Lucene query api has adopted DocIdSet, and DocIdSetIterator abstractions, which opened doors for various query optimizations. (IMHO, this is one of the best improvements in Lucene from an API stand point.)

For our project, we have quite a few OR-clauses, and the search performance was pretty bad.

Instead, when the index loads, we load into memory a datastructure very much like the FieldCache.StringIndex. In a 5 million index shard, for a field containing N values, memory footprint is approxmiately 20MB + N Strings.

Given K clauses in our OR query, we build a BitSet with K bits turned on, each bit corresponding to an index into String array. (max bit is N)

We then built a DocIdSetIterator that iterates the order array and checks to see if the order[i]-bit is turned on in the bitset.

The resulting performance is quite encouraging: As the number of clauses increases, the iteration time is capped. From our benchmark, when we get to 10+ clauses, the performance gain is approximately 8x!

Here is a detailed write-up:
http://code.google.com/p/bobo-browse/wiki/QueryConstruction

This is an example of how good API design can open doors to things such as what we were able to do.

Kudos to the Lucene team for the DocIdSet api!

Inverted Index

Sunday, July 31, 2011

LinkedIn Intern Hackday 2011 and Node.js

Friday, July 8, 2011

Python, Tornado, Kafka, Oh My!

Friday, October 1, 2010

LinkedIn Signal

Sunday, January 31, 2010

LinkedIn Search Talk - SDForum

Sunday, November 15, 2009

Numeric Range Queries - A comparison

Friday, August 21, 2009

Index Optimization for realtime search - Good idea?

Saturday, July 18, 2009

Making BooleanQueries faster

Facebook Badge

Search This Blog

About Me

Followers

My Blog List

Blog Archive