Sunday, April 27, 2014

Expressions with Lucene

Lucene's expressions module allows computing values for documents using arbitrary mathematical expressions. Among other things, expressions allow a very easy and powerful way to e.g. sort search results and geospatial faceting. The module parses String expressions in JavaScript notation, and returns an object which computes a value for each requested document. Expressions are built of literals, variables and functions. For example, in the expression 5*sqrt(votes), 5 is a literal, sqrt is a function and votes is a variable.

The JavaScript parser recognizes a handful of useful functions, such as max, min, sqrt etc. The javadocs contain the full list of functions as well as operators that the parser recognizes. The variables are resolved by binding their name to a ValueSource. For example, in order to parse and execute the above expression, you need to write code similar to this:

Expression expr = JavascriptCompiler.compile("5*sqrt(votes)");
SimpleBindings bindings = new SimpleBindings();
bindings.add(new SortField("votes", SortField.Type.INT));
The votes variable is bound to the numeric field "votes" (usually, a NumericDocValuesField). SimpleBindings let you bind a variable to a SortField, however internally it is bound to a ValueSource. Expression itself returns a ValueSource, and when your application asks for the value of a document, it computes it based on the formula and the bounded variables' ValueSources.

Customizing expressions

JavascriptCompiler lets you pass a mapping of custom functions, where each function is implemented by a public and static Method, which takes up to 256 double parameters and returns a double. You can see a good example here. That, together with variables, provides great customization capabilities of expressions.

Sometimes customizing expressions is not so straightforward. For example, someone recently asked on the Lucene user-list how to use a multi-valued field in an expression, for the purpose of computing different functions on it (max, sum etc.). At first, it looks like a custom function, e.g. maxAll() (max() is already taken), which can be embedded in an expression like 5*maxVal(data). However, since data is a variable, and variables are bound to ValueSources (which return a single value for a document), we cannot pass all values of data to maxAll().

We can implement a ValueSource though, which returns the maximum value of the field data, and bind a variable to it. Since this isn't a true function, we cannot use a more natural notation like max(data). Perhaps one day Lucene will have built-in support for multi-valued numeric fields, and the expressions module will auto-detect such fields and pass all values to the function (you're welcome to contribute patches!). Until then though, you need to implement a ValueSource for each such function, but fortunately it is quite trivial. I wrote some prototype code below which demonstrates how to do that.

▶ Multi-valued expressions demo

Nested expressions

Suppose that your documents contain longitude and latitude fields and you want to use them to compute a document's relevance, by computing the haversine formula. You can easily do that by compiling an expression such as haversin(40.7143528,-74.0059731,latitude,longitude).

Now, what if you want to use the result of that expression in another expression, such as _score + 1/(1+haversin(40.7143528,-74.0059731,latitude,longitude))? It starts to become somewhat longish to read. Wouldn't it be better if we can encapsulate that in a distance variable and read the expression _score + 1/(1+distance) instead? Fortunately, we can do that easily with nested/sub expressions:

Expression distance = JavascriptCompiler.compile(
Expression expr = JavascriptCompiler.compile("_score + 1/(1+distance)");
SimpleBindings bindings = new SimpleBindings();
bindings.add(new SortField("_score", SortField.Type.SCORE));
bindings.add(new SortField("latitude", SortField.Type.LONG));
bindings.add(new SortField("longitude", SortField.Type.LONG));
bindings.add("distance", distance);
Since a variable can be bound to any ValueSource, and expressions are in fact ValueSources, we can bind the result of an expression to a variable in another expression. Nested expressions also cache their result in case you use them e.g. in multiple clauses of another expression. This could be useful if you have a very expensive expression which needs to be evaluated multiple times per document.

The expressions module is very powerful and lets you code custom ranking/sorting/faceting formulas easily. With custom functions, variables and ValueSources, it makes it trivial to extend and build upon even further. If you implement useful functions, you're welcome to contribute them!

Thursday, April 24, 2014

Benchmarking Updatable DocValues

In my previous post I described how updatable DocValues are implemented, and concluded with a brief discussion about alternative approaches we could take to implement them. I created a benchmark, leveraging our benchmarking framework, to measure the indexing and searching effects of updatable DocValues. In this post I compare re-indexing a document versus updating a numeric/binary doc-values field.

The Benchmark

I ran the benchmark on an IBM System x3690 X5, a 40 cores machine (dual Intel® Xeon® E7-2850 clocked @ 2.00GHz, 24MB per-core cache) with SAS 10K disks, using Oracle's 64-bit Java 1.7.0_45 (4GB heap). I also set Linux's swappiness to 0. I indexed two Wikipedia datasets: WIKIBIG (full documents, 6.64 million total docs) and WIKIMEDIUM (documents clipped to 1K chunks of text, 33.3 million total docs), with the following settings:
  • Directory: NRTCachingDirectory(MMapDirectory)
  • Analyzer: StandardAnalyzer (no stopwords)
  • ID postings: MemoryPostingsFormat (speedup lookups by ID)
  • Postings: Lucene41 (the current default)
  • DocValues: Direct (eliminates HW factors when reading DocValues)
  • I index the document’s last-modification-time as a numeric doc-values field, and the title also as a binary doc-values field. Additionally, each document is indexed several fields such as body, title etc., with their respective content from the dataset.
The benchmark (all sources available here) performs the following operations:

Document updates
Documents are updated by multiple threads, where each run performs a different type of update, at a set docs/sec rate. Threads record the time to execute the update:

  • reindex (writer.updateDocument()): the document is loaded from the dataset and re-indexed fully
  • numeric (writer.updateNumericDocValue()): the last-modification-time is updated to current time
  • binary (writer.updateBinaryDocValue()): the title byte[] is updated

Index Reopens
The index is reopened at a set interval by a separate thread. The thread records the time it took for each reopen as well as whether the index was in fact reopened, and whether the thread fell behind (i.e. if reopen was too slow to achieve the target interval).

While the index is being updated and reopened, search threads execute search tasks against it, and sort the results by the last-modification-time numeric doc-values field to exercise the updated doc-values.

There are different costs to re-indexing a document and updating a doc-values field. Re-indexing a document has both some cost during writer.updateDocument(), when the document’s fields are analyzed, as well as during index reopen when deleted documents’ flags are turned off in the in-memory bit vectors of the segments that hold them, and the re-indexed information (postings, stored fields, doc-values…) is flushed. When updating a doc-values field, writer.updateNumericDocValue() merely caches the update, and therefore it’s almost a 0-cost operation, however during index reopen the doc-values field is entirely rewritten for all documents in that segment, even documents that were not updated. Therefore, doc-values updates should see the majority of the cost in that phase.

The benchmark takes few parameters in order to compare different setups:

  • docsPerSec: a comma separated list of docs/sec rates. A higher rate means more documents are updated between index reopens.
    The assumption is that higher rates are worse for re-indexing than updating a doc-values field, since writer.updateDocument() is more expensive than writer.updateNumericDocValue().
  • reopenIntervalSec: a comma separated list of index reopen intervals. A lower interval means more index reopens, but each reopen flushes a smaller number of document updates.
    The assumption is that lower intervals are bad for updatable doc-values, since they have a constant cost no matter how many documents are updated between index reopens, while even though re-indexing a document also has some cost during index reopen, it is proportionally smaller as more documents are flushed at once. Also, because the test uses NRTCachingDirectory, most likely those flushed files don’t even hit the disk.


I ran the benchmark over the two datasets with the following parameters: numIndexThreads=4, numSearchThreads=2, reopenIntervalSec=1,2,5 and docsPerSec=1,10,100,1000. The benchmark runs each reopenIntervalSec and docsPerSec combination for 30 seconds and outputs the following results for each dataset and update method, excluding the first 5 seconds (warmup):
  • docs/s: the docs/sec rate
  • reopen/s: the reopen interval
  • reopen(ms): the average time (in milliseconds) it took to reopen the reader
  • update(ms): the total time (in milliseconds) for executing the update, averaged per reopen
  • total(ms): the total time spent per index reopen (sum of reopen(ms) and update(ms))
  • perdoc(ms): the average cost (in milliseconds) for each updated document
  • query/s: the average query/sec rate that was achieved, per reopen
The WIKIBIG dataset consists of full Wikipedia documents, and therefore re-indexing a document is expected to be more expensive (proportionally) than for the WIKIMEDIUM dataset. On the other hand, it consists of fewer documents, therefore a doc-values update needs to rewrite a smaller amount of data, and (proportionally) should be cheaper on this index. The results cannot easily be compared between the two datasets, therefore let’s analyze them separately:


  • WIKIBIG: document re-indexing
  • WIKIBIG: update numeric doc-values
  • WIKIBIG: update binary doc-values
Reopen + Update times
Reopening is almost always faster when re-indexing than updating doc-values. Also, when updating doc-values, the time does not significantly increase as more documents are updated. That is due to the fact that the heavy part of doc-values updates - rewriting the doc-values field - is the same whether we update 1 or 1000 documents in a segment. As for executing the update itself, it is negligible when updating doc-values compared to re-indexing a document.

Total update cost

The results show that when the number of document updates is small, it is faster overall to re-index the documents than update the doc-values field. However when there are many document updates, re-indexing cannot keep up with the reopen interval and starts to fall behind. On the other hand, updating doc-values reaches a "steady" cost very quickly, and is less affected by the number of document updates.

Per document cost

While the cost per document consistently decreases in all update methods, doc-values achieve nearly 90% lower costs than re-indexing the documents. E.g. in the combination of 1000 docs/sec and reopen every 5 seconds (5000 document updates on average between reopens), the per-document cost of numeric doc-values is 0.17 msec compared to 1.41 msec when re-indexing the document. Therefore it looks like that if your application updates many documents between index reopens, updating the doc-values field seems better than re-indexing them.

Query per second

Updating doc-values seems to affect running searches more than re-indexing documents when the reopen interval is low. However, when updating doc-values at a given indexing rate, QPS improves as the reopen interval increases. This is expected because the I/O cost does not change much as more documents are updated, unlike when re-indexing documents. Also, given that we see same drops even when updating binary doc-values (which do not participate in the search at all), strengthens the assumption that the drop is mainly due to I/O, and could be that if I had used a faster hardware (e.g. SSD), the drops would not show at all, or be less significant.


  • WIKIMEDIUM: document re-indexing
  • WIKIMEDIUM: update numeric doc-values
  • WIKIMEDIUM: update binary doc-values
As described before, the cost of updating a doc-values field in a segment depends on the number of documents in the segment, and less on how many documents are updated in it. Furthermore, even at a low docs/sec rate there’s a high chance that between index reopens all index segments will be "touched" (updated), and therefore on the WIKIMEDIUM index (which contains x5 more documents than the WIKIBIG index) the cost of index reopen is substantially higher.
The results clearly show that re-indexing documents is always preferred to updating doc-values at the given reopen intervals. More so, updatable doc-values almost always fell behind with the reopen interval. This makes the per-document cost results irrelevant to look at because in practice there were much fewer index reopens than in the re-index case, and therefore each reopen affected a larger number of documents (which makes the average per-document appear low). Maybe at a much higher reopen interval (e.g. few tens seconds, or minutes), updating doc-values on that index would have performed better than re-indexing, however in this benchmark I focused on the near-real-time aspects of updatable doc-values, and therefore chose short reopen intervals.


The current implementation of updatable doc-values can be very useful to applications with relatively small indexes (few million documents) which reopen the index every few seconds. However, for applications with bigger indexes it is currently very expensive and such applications might prefer to re-index the documents instead.

One aspect of re-indexing a document that I didn't fully measure is content acquisition. In this benchmark, the content of the document is read from the local disk while for many applications it usually means accessing a remote repository, which slows things down considerably. Therefore it is recommended for such applications to benchmark this part in their indexing pipeline and determine what is the point where updating doc-values become slower than entirely re-indexing the documents.

So what's next?

In this post I described the stacked segments approach and how it could speed up updatable doc-values, by only writing the documents that were updated in each stack. On the other hand this would have some impact on search, since in order to determine the value of a document, we’d need to look it up through the "stacks".

There is however a hybrid approach, which is similar in nature to document deletions. Instead of flushing the updates to the index directory in every index reopen, we can carry the updates in-memory, and only flush when the updates take too much memory. So if an application updates only a few documents between index reopens, it won’t pay the cost of rewriting the entire doc-values field in each reopen. At some point though, the doc-values field will be written to disk (e.g. on commit or when they consume too much memory), but when that happens then many updates have already accumulated anyway, and so overall the cost of flushing per document is smaller. On the search side, the value of a document will only need to be looked for in two "stacks" - the in-memory and the base stack, which hopefully won’t have a big impact on search.

I think that this approach could work well for near-real-time applications, since they often reopen very frequently with only few document updates in between. Also, in this benchmark I've measured the cost of updating a single doc-values field, however in real-life applications there are often several such fields that are updated, and therefore the cost of reopening an index will be much higher (since every field will be rewritten). Applications already hold the deletions bit-vector in memory (for every segment with deletions; maxDoc/8 bytes), and I believe that spending an additional proportional amount of that memory for doc-values updates is tolerable. Still, once in a while a "naive"-looking reopen (only few updates since last reopen) could trigger a rewrite of the entire doc-values field. To avoid that we need to only flush the documents that were updated (i.e. stacked segments), but this could be a good compromise until we have that.

Looks like there is more fun work to do with updatable doc-values. If you have further ideas on how to improve them, your patches are welcome!

Wednesday, April 23, 2014

Updatable DocValues Under the Hood

We recently committed updatable numeric and binary doc-values fields, which lets you update the value of such fields in a document without re-indexing the whole document. This allows using such fields for e.g. boosting documents relevancy to a query, especially when the boosting factor is frequently changing (like date, ratings, counters etc.). In this post I describe the under-the-hood mechanics of updating those fields.

A brief overview on DocValues

DocValues have been around since Lucene 4.0, and while they have gone through several changes and improvements since then, the basic concept remained the same – they let you record a value per-document per-field in each segment, and allow for very efficient random-access retrieval of those values. Lucene offers several doc-values types: numeric for single-valued numeric fields (date, ratings, counters), sorted/sortedset for single/multi-valued string (mostly, but can be any BytesRef) fields used for sorting, faceting, grouping etc., and binary fields for storing arbitrary byte[] per-document.

Also, Lucene provides several implementations for writing and reading DocValues: Lucene45 (the current default) stores the values on disk, but keeps some information in-memory for fast lookups. Memory loads the values into compressed arrays, while Direct loads the values into uncompressed arrays (faster lookups, but higher memory consumption). If your application runs in a resource constrained environment, you can use Disk which keeps absolutely everything on disk, at the cost of slower lookups.

DocValues record a value for every document in every field and segment even when a document has no value associated with a field. In most cases this is acceptable because you usually store a value for most, if not all, documents in a field (e.g. scoring factors, facets). But even if you index very sparse doc-values fields, the default encoding is very efficient at indexing blocks of "missing" values. Still, if you index hundreds of millions of documents with few dozen doc-values fields, but only a small percentage of them contain a value, writing your own sparse doc-values encoding may be beneficial.

Document deletions are a form of in-place document updates!

The mechanism by which Lucene updates DocValues fields is very similar to how document deletions are handled, so let’s look at how that’s done first. When you issue e.g. a deleteDocument(term) request, Lucene caches the request until an updated SegmentReader is needed, or the index is committed. It then resolves all the issued deletes to determine which documents are affected, both in already flushed segments as well as documents that were in IndexWriter’s RAM buffer when the delete was issued.
Up-to-date SegmentReaders are needed in two cases: when a merge kicks off (SegmentMerger uses SegmentReaders to merge segments) and when the application reopens its NRT (near-real-time) reader.
If an updated reader is needed, Lucene carries the deletes in memory by turning off a bit in the in-memory bit vector of each affected segment, and avoids flushing the bit vector to the index directory, which may reside on a slow storage device (e.g. NFS share). If the application also has an NRT reader open, the bit vector is cloned (because the app’s reader should not be made aware of the new deletes until it’s reopened), leading to double RAM consumption by those vectors. However, once that cost is paid once, it is "fixed" (1 bit/doc, or maxDoc/8 bytes overall), no matter how many documents are later deleted. Also, as soon as the application releases its NRT reader (usually after reopening it), one of the vectors is released and can be reclaimed by the GC.

Updatable DocValues

Like document deletions, when the application issues e.g. an updateNumericDocValue(term, field, value) request, Lucene caches the request until an updated reader is needed or the index is committed, at which point it resolves the updates against all affected documents. Unlike deletes though, doc-values consume much more RAM, which makes carrying the updates in-memory not feasible for many applications: numeric doc-values may potentially consume 8 bytes per-document, while binary values are usually even more expensive.

Therefore, when the updates are resolved, Lucene currently rewrites the entire field that was updated to the index directory. So for example, if you update a numeric doc-values field of a document in a 1 million documents segment, Lucene rewrites 1 million numeric values, or 8MB (at most). Because Lucene segments are write-once, the updated field is written to a gen’d doc-values file of the segment. A reader then reads the values of a field from the respective gen’d doc-values file. While this has a highish cost during indexing, it does not affect search at all, because the values are read from a single file, and often this is a tradeoff that most applications are willing to make.

To illustrate, the code below indexes several documents with "date" and "ratings" NumericDocValuesFields and then lists the files in the index directory:

IndexWriterConfig conf = new IndexWriterConfig(...);
conf.setCodec(new SimpleTextCodec());
IndexWriter writer = new IndexWriter(dir, conf);
ReaderManager manager = new ReaderManager(writer, true);
for (int i = 0; i < 5; i++) {
  Document doc = new Document();
  doc.add(new StringField("id", ""+i, Store.NO));
  doc.add(new NumericDocValuesField("date", i));
  doc.add(new NumericDocValuesField("ratings", i*10));

manager.maybeRefresh(); // triggers a flush
// update the 'date' field's value of document '0'
  new Term("id", "0"), "date", System.currentTimeMillis());
manager.maybeRefresh(); // triggers applying the update
You will see the following print:

Notice how after the "date" field was updated and NRT reader re-opened, the index directory contains two sets of doc-values files: _0.dat stores both the "date" and "ratings" values and _0_1.dat stores just the "date" field’s values (the _1 part in the file name is the doc-values generation for the "date" field). Since the code uses SimpleTextCodec, we can easily print the contents of each file. Below is such print, clipped to show just the header of the files:
file: _0_1.dat
field date
  type NUMERIC

file: _0.dat
field date
  type NUMERIC
field ratings
  type NUMERIC
If we update the value of the "date" field for the same or any other document and then list the doc-values files in the index, we will see:
Notice how _0_1.dat no longer exists in the index - that’s because it only contains values for the "date" field, which was entirely rewritten by the second update (under generation 2), and so _0_1.dat can be deleted.

Stacked Segments

On LUCENE-4258 we’ve started to implement field updates taking a stacked-segments approach. A stacked segment can be thought of as a layer/filter onto an existing segment, which indexes only the updates to that segment. When the full view of the segment is required (e.g. during search), the segment "collapses" all the updates down, so that the most up-to-date view of the segment is presented to the application.

The approach taken to implement updatable doc-values is similar in concept, only the doc-values "stacks" index the values of all documents. If we were to implement the approach taken on LUCENE-4258, then the doc-values stacks would index only the values of the updated documents. While this could have improved the indexing performance (writing a single value ought to be faster than writing 1 million values, no matter how fast your IO system is), it would affect search time since in order to resolve the value of a document, we would need to look for its most up-to-date value in any of the stacks. There are efficient ways to do this, but as usual, each comes with its own tradeoffs.

The key point is that stacked segments have a tradeoff between indexing and search and it is important to measure that tradeoff, as in Lucene, we often prefer to tradeoff in favor of search. I will publish in a separate post some benchmark numbers of the current approach, which will also serve as a baseline for measuring alternative approaches in the future.

Sunday, September 22, 2013

Boosting Documents in Lucene

In Information Retrieval, a document's relevance to a search is measured by how similar it is to the query. There are several similarity models implemented in Lucene, and you can implement your own by extending the Similarity class and using the index statistics that Lucene saves. Documents can also be assigned a static score to denote their importance in the overall corpus, irrespective of the query that is being executed, e.g their popularity, ratings or PageRank.

Prior to Lucene 4.0, you could assign a static score to a document by calling document.setBoost. Internally, the boost was applied to every field of the document, by multiplying the field's boost factor with the document's. However, as explained here, this has never worked correctly and depending on the type of query executed, might not affect the document's rank at all.

With the addition of DocValues to Lucene, boosting documents is as easy as adding a NumericDocValuesField and use it in a CustomScoreQuery, which multiplies the computed score by the value of the 'boost' field. The code example below illustrates how to achieve that:
// add two documents to the index
Document doc = new Document();
doc.add(new TextField("f", "test document", Store.NO));
doc.add(new NumericDocValuesField("boost", 1L));

doc = new Document();
doc.add(new TextField("f", "test document", Store.NO));
doc.add(new NumericDocValuesField("boost", 2L));

// search for 'test' while boosting by field 'boost'
Query baseQuery = new TermQuery(new Term("f", "test"));
Query boostQuery = new FunctionQuery(new LongFieldSource("boost"));
Query q = new CustomScoreQuery(baseQuery, boostQuery);, 10);

The new Expressions module can also be used for boosting documents by writing a simple formula, as depicted below. While it's more verbose than using CustomScoreQuery, it makes boosting by computing more complex formulas trivial, e.g. sqrt(_score) + ln(boost).
Expression expr = JavascriptCompiler.compile("_score * boost");
SimpleBindings bindings = new SimpleBindings();    
bindings.add(new SortField("_score", SortField.Type.SCORE));
bindings.add(new SortField("boost", SortField.Type.LONG));
Sort sort = new Sort(expr.getSortField(bindings, true));, null, 10, sort);

Now that Lucene allows updating NumericDocValuesFields without re-indexing the documents, you can incorporate frequently changing fields (popularity, ratings, price, last-modified time...) in the boosting factor without re-indexing the document every time any one of them changes.

Tuesday, May 14, 2013

The Replicator

No, as much as I want to, I don't (yet) have a true Replicator at my disposal to blog about. Nor does it look like I will have one in the near future, even though scientists are making great progress towards it. I have recently made my contribution to the global replication effort though, by adding an index replication module to Lucene. It does not convert energy to mass, nor mass to mass, so I'm not sure if scientifically it qualifies as a Replicator at all, but if you are into search indexes replication with Lucene, read on!

In computer science, replication is defined as "sharing information so as to ensure consistency between redundant resources ... to improve reliability, fault-tolerance, or accessibility" (Wikipedia). When you design a search engine, especially at large scales, replication is one approach to achieve these. Index replicas can replace primary nodes that become unavailable (e.g due to planned maintenance or severe hardware failures), as well as to support higher query loads by load-balancing search requests across them. Even if you are not building a large-scale search engine, you can use replication to take hot backups of your primary index, while searches are running on it.

Lucene's replication framework implements a primary/backup approach, where the primary process performs all indexing operations (updating the index, merging segments etc.), and the backup/replica processes copy over the changes in an asynchronous manner. The framework consists of few key components that control the replication actions:
  • Replicator mediates between the clients and server. The primary process publishes Revisions while the replica processes update to the most recent revision following their own. When a new Revision is published, the previous one is released (unless it is currently being replicated), so that the resources it consumes can be reclaimed (e.g. remove unneeded files).

  • Revision describes a list of files and their metadata. The Revision is responsible to ensure that the files are available as long as clients copy its files. For example, IndexRevision takes a snapshot on the index using SnapshotDeletionPolicy, to guarantee the files are not deleted until the snapshot is released.

  • ReplicationClient performs the replication operation on the replica side. It first copies the needed files from the primary server (e.g. the files that the replica is missing) and then invokes the ReplicationHandler to act on the copied files. If any errors occur during the copy process, it restarts the replication session. That way, when the handler is invoked, it is guaranteed that all files were copied safely to the local storage.

  • ReplicationHandler acts on the copied files. IndexReplicationHandler copies the files over to the index directory and then syncs them to ensure the files are on stable storage. If any errors occur, it aborts the replication session and cleans up the index directory. Only after successfully copying and syncing the files, it notifies a callback that the index has been updated, so that e.g. the application can refresh its SearcherManager or perform whatever tasks it needs on the updated index.
The replication framework supports replicating any type of files. It offers built-in support for replicating a single index (as described above), as well as an index and taxonomy pair (for faceted search) via IndexAndTaxonomyRevision and IndexAndTaxonomyReplicationHandler. To replicate other types of files, you need to implement Revision and ReplicationHandler. But be advised, implementing a handler properly is ... tricky!

Following example code shows how to use the replicator on the server side:
// publish a new revision on the server
IndexWriter indexWriter; // the writer used for indexing
Replicator replicator = new LocalReplicator();
replicator.publish(new IndexRevision(indexWriter));
Using the replicator on the client side is a bit more involved but hey, that's where the important stuff happens!
// client can replicate either via LocalReplicator (e.g. for backups)
// or HttpReplicator (e.g. when the server is located on a different
// node).
Replicator replicator;

// refresh SearcherManager after the index is updated
Callable<Boolean> callback = new Callable<Boolean>() {
  public void call() throws Exception {
    // index was updated, refresh manager

// initialize a matching handler for the published revisions
ReplicationHandler handler = new IndexReplicationHandler(indexDir, callback);

// where should the client store the files before invoking the handler
SourceDirectoryFactory factory = new PerSessionDirectoryFactory(workDir);

ReplicationClient client = new ReplicationClient(replicator, handler, factory);

client.updateNow(); // invoke client manually
// -- OR --
client.startUpdateThread(30000); // check for updates every 30 seconds

So What's Next?

The replication framework currently restarts a replication session upon failures. It would be great if it supported resuming a session by e.g. copying only the files that weren't already successfully copied. This can be done by having both server and client e.g. compute a checksum on the files, so the client knows which ones were successfully copied and which weren't. Patches welcome!

Another improvement would be to support resume at the file level, i.e. don't copy parts of the file that were already copied safely. This can be implemented by modifying ReplicationClient to discard the first N bytes of the file it already copied, but would be better if it's supported by the server as well, so that it only sends the required parts. Here too, patches are welcome!

Yet another improvement is to support peer-to-peer replication. Today, the primary server is the bottleneck of the replication process since all clients access it for pulling files. This can be somewhat mitigated by e.g. having a tree network topology, where the primary node is accessed by only a few nodes, which are then accessed by other nodes and so forth. Such topology is however harder to maintain as well as very sensitive to the root node going out of action. In a peer-to-peer replication however, there is no single node from which all clients replicate, but rather each server can broadcast its current revision as well as choose any server that has a newer revision available to replicate from. An example of such implementation over Apache Solr is described here. Hmmm ... did I say patches are welcome?

Lucene has a new replication module. It's only 1 days-old and can already do so much. You are welcome to use it and help us teach it new things!

Sunday, April 28, 2013

Index Sorting with Lucene

When you index documents with Lucene, you often index them in some arbitrary order, usually by a first-come first-served manner. Most applications may not give a second thought about the order of the documents in the index, but some applications could greatly benefit if they could control that. For example, an application which always displays search results sorted by date could benefit if the index kept the documents sorted like that, since then it could traverse exactly the first K documents that match the query, and not process the entire result set to determine the first K.

Application-level Sort

Applications can try to control the order of the documents in the index by e.g. sorting them beforehand. This will work if the index is created once, from a single batch of documents. However, indexes are often created incrementally, where documents are indexed as they are encountered. Therefore, even if the application sorts each batch, the index will quickly get out-of-order as more batches of documents are indexed.
Actually, as long as the application does not use a MergePolicy, each segment will be sorted by itself, and it could process the first K documents from each segment to determine the global top K documents. However, as soon as segments are merged, they will lose their internal sort order.
If the application indexes in multiple threads (which significantly improves indexing performance), then keeping the index and segments sorted becomes an even harder task.

Static Index Sorting

In the past, Lucene offered an IndexSorter utility which allowed you to statically sort an index. It was dropped following the move to Lucene 4.0.0 APIs, but recently brought back as SortingAtomicReader, which exposes a sorted view of the index. It uses a Sorter which returns a permutation on the documents, so when it is asked for document X, it returns document Y, where perm(X)=Y. To sort an index with it, you should pass it to IndexWriter.addIndexes().

Sorting Segments During Merge

Static index sorting is less useful to applications that incrementally build their indexes. As explained above, one challenge is to keep the documents sorted in a segment that is the result of a merge. SortingMergePolicy (introduced in LUCENE-4752) wraps segments that are picked for merge with SortingAtomicReader, and guarantees that merged segments are also sorted.
NOTE: SortingMergePolicy was measured to be slower by x2-3 times than other merge policies, which is not so bad considering the task at hand. If your application's characteristics justify the use of this merge policy, it is important to compare the effects it has on indexing slowdown vs search speedup for your application!

Search Early Termination

Having an incrementally built index kept globally sorted, so that the first K documents you traverse are the ones to return as search results, is very hard and a solution to this is not yet offered by Lucene (patches are welcome!). However, SortingMergePolicy guarantees that merged segments are always sorted, and so we can apply early-termination search logic at each segment level.
EarlyTerminatingSortingCollector (introduced in LUCENE-4858) helps you do that, by early-terminating document collection on sorted segments, while doing full evaluation on unsorted ones (e.g. newly indexed documents or segments in an existing index).
NOTE: in a real-time indexing application, newly flushed segments will usually contain a small number of documents anyway, and therefore fully evaluating them is not likely to be very costly.

Applications for Index Sorting

Besides sorting the index for search early termination purposes, there are other applications to index sorting, for example compression. Sorting the index by documents' size, type or date attributes, can result in better representation of the documents in the index, leading to a smaller index footprint, which can eventually speed up searches too since more parts of the index can fit into RAM. I'm sure that Lucene users will find even more useful applications to index sorting, so stay tuned!

Sunday, January 20, 2013

Lucene Facets just got faster!

Recently, Mike McCandless had some fun with Lucene Facets, and came up with interesting performance numbers and some ideas on how facets can be improved, by e.g. utilizing DocValues to store per-document category ordinals, instead of the payload. This triggered a series of improvements and API changes, which resulted (so far) in 2x speedups to faceted search!

From Facets nightly benchmark

The biggest contribution to the spike came from cutting over facets to DocValues (LUCENE-4602), which got us nearly 70% improvements over term payloads. Not only are DocValues more efficient than payloads, they also let you cache their information in-memory, for faster access, which we now take advantage by default during faceted search.

Additional improvements came from specializing the decode logic (LUCENE-4686), as well as moving various components that take part during faceted search from an iterator-style API to a bulk-read (decode, aggregate) API (LUCENE-4620). The latter got us an additional ~20% improvements, which is amazing given that decoding and aggregation logic weren't changed - it was all just different software design. This is an important lesson for "hot" code - sometimes in order to make it efficient, you need to let go of some Object Oriented best practices!

While that's a major step forward, there's more to come. There's already work going on (LUCENE-4600 and LUCENE-4610), which will bring even more improvements for the common faceted search use cases (facet counting, no associations, no partitions etc.). Also, I plan to address two points that Mike raised in his blog: (1) making TotalFacetCounts (aka complements) segment-aware (to make it more NRT-friendly and reduce its reloading cost), and (2) experiment with a more efficient category ordinals cache, to reduce its RAM consumption as well as loading speed. There's also fun work going on here, which explores an alternative encoding for category ordinals. So stay tuned for more updates!

NOTE: cutting over facets to DocValues breaks existing indexes. You need to either rebuild them, or use FacetsPayloadMigrationReader to do a one-time migration of the payload information to DocValues.