Application-level Sort
Applications can try to control the order of the documents in the index by e.g. sorting them beforehand. This will work if the index is created once, from a single batch of documents. However, indexes are often created incrementally, where documents are indexed as they are encountered. Therefore, even if the application sorts each batch, the index will quickly get out-of-order as more batches of documents are indexed.Actually, as long as the application does not use a MergePolicy, each segment will be sorted by itself, and it could process the first K documents from each segment to determine the global top K documents. However, as soon as segments are merged, they will lose their internal sort order.If the application indexes in multiple threads (which significantly improves indexing performance), then keeping the index and segments sorted becomes an even harder task.
Static Index Sorting
In the past, Lucene offered anIndexSorter
utility which allowed you to statically sort an index. It was dropped following the move to Lucene 4.0.0 APIs, but recently brought back as SortingAtomicReader
, which exposes a sorted view of the index. It uses a Sorter
which returns a permutation on the documents, so when it is asked for document X, it returns document Y, where perm(X)=Y
. To sort an index with it, you should pass it to IndexWriter.addIndexes()
.
Sorting Segments During Merge
Static index sorting is less useful to applications that incrementally build their indexes. As explained above, one challenge is to keep the documents sorted in a segment that is the result of a merge.SortingMergePolicy
(introduced in LUCENE-4752) wraps segments that are picked for merge with SortingAtomicReader
, and guarantees that merged segments are also sorted.
NOTE: SortingMergePolicy
was measured to be slower by x2-3 times than other merge policies, which is not so bad considering the task at hand. If your application's characteristics justify the use of this merge policy, it is important to compare the effects it has on indexing slowdown vs search speedup for your application!
Search Early Termination
Having an incrementally built index kept globally sorted, so that the first K documents you traverse are the ones to return as search results, is very hard and a solution to this is not yet offered by Lucene (patches are welcome!). However,SortingMergePolicy
guarantees that merged segments are always sorted, and so we can apply early-termination search logic at each segment level.
EarlyTerminatingSortingCollector
(introduced in LUCENE-4858) helps you do that, by early-terminating document collection on sorted segments, while doing full evaluation on unsorted ones (e.g. newly indexed documents or segments in an existing index).
NOTE: in a real-time indexing application, newly flushed segments will usually contain a small number of documents anyway, and therefore fully evaluating them is not likely to be very costly.
Applications for Index Sorting
Besides sorting the index for search early termination purposes, there are other applications to index sorting, for example compression. Sorting the index by documents' size, type or date attributes, can result in better representation of the documents in the index, leading to a smaller index footprint, which can eventually speed up searches too since more parts of the index can fit into RAM. I'm sure that Lucene users will find even more useful applications to index sorting, so stay tuned!