Sunday, April 28, 2013

Index Sorting with Lucene

When you index documents with Lucene, you often index them in some arbitrary order, usually by a first-come first-served manner. Most applications may not give a second thought about the order of the documents in the index, but some applications could greatly benefit if they could control that. For example, an application which always displays search results sorted by date could benefit if the index kept the documents sorted like that, since then it could traverse exactly the first K documents that match the query, and not process the entire result set to determine the first K.

Application-level Sort

Applications can try to control the order of the documents in the index by e.g. sorting them beforehand. This will work if the index is created once, from a single batch of documents. However, indexes are often created incrementally, where documents are indexed as they are encountered. Therefore, even if the application sorts each batch, the index will quickly get out-of-order as more batches of documents are indexed.
Actually, as long as the application does not use a MergePolicy, each segment will be sorted by itself, and it could process the first K documents from each segment to determine the global top K documents. However, as soon as segments are merged, they will lose their internal sort order.
If the application indexes in multiple threads (which significantly improves indexing performance), then keeping the index and segments sorted becomes an even harder task.

Static Index Sorting

In the past, Lucene offered an IndexSorter utility which allowed you to statically sort an index. It was dropped following the move to Lucene 4.0.0 APIs, but recently brought back as SortingAtomicReader, which exposes a sorted view of the index. It uses a Sorter which returns a permutation on the documents, so when it is asked for document X, it returns document Y, where perm(X)=Y. To sort an index with it, you should pass it to IndexWriter.addIndexes().

Sorting Segments During Merge

Static index sorting is less useful to applications that incrementally build their indexes. As explained above, one challenge is to keep the documents sorted in a segment that is the result of a merge. SortingMergePolicy (introduced in LUCENE-4752) wraps segments that are picked for merge with SortingAtomicReader, and guarantees that merged segments are also sorted.
NOTE: SortingMergePolicy was measured to be slower by x2-3 times than other merge policies, which is not so bad considering the task at hand. If your application's characteristics justify the use of this merge policy, it is important to compare the effects it has on indexing slowdown vs search speedup for your application!

Search Early Termination

Having an incrementally built index kept globally sorted, so that the first K documents you traverse are the ones to return as search results, is very hard and a solution to this is not yet offered by Lucene (patches are welcome!). However, SortingMergePolicy guarantees that merged segments are always sorted, and so we can apply early-termination search logic at each segment level.
EarlyTerminatingSortingCollector (introduced in LUCENE-4858) helps you do that, by early-terminating document collection on sorted segments, while doing full evaluation on unsorted ones (e.g. newly indexed documents or segments in an existing index).
NOTE: in a real-time indexing application, newly flushed segments will usually contain a small number of documents anyway, and therefore fully evaluating them is not likely to be very costly.

Applications for Index Sorting

Besides sorting the index for search early termination purposes, there are other applications to index sorting, for example compression. Sorting the index by documents' size, type or date attributes, can result in better representation of the documents in the index, leading to a smaller index footprint, which can eventually speed up searches too since more parts of the index can fit into RAM. I'm sure that Lucene users will find even more useful applications to index sorting, so stay tuned!