Sunday, April 27, 2014

Expressions with Lucene

Lucene's expressions module allows computing values for documents using arbitrary mathematical expressions. Among other things, expressions allow a very easy and powerful way to e.g. sort search results and geospatial faceting. The module parses String expressions in JavaScript notation, and returns an object which computes a value for each requested document. Expressions are built of literals, variables and functions. For example, in the expression 5*sqrt(votes), 5 is a literal, sqrt is a function and votes is a variable.

The JavaScript parser recognizes a handful of useful functions, such as max, min, sqrt etc. The javadocs contain the full list of functions as well as operators that the parser recognizes. The variables are resolved by binding their name to a ValueSource. For example, in order to parse and execute the above expression, you need to write code similar to this:

Expression expr = JavascriptCompiler.compile("5*sqrt(votes)");
SimpleBindings bindings = new SimpleBindings();
bindings.add(new SortField("votes", SortField.Type.INT));
The votes variable is bound to the numeric field "votes" (usually, a NumericDocValuesField). SimpleBindings let you bind a variable to a SortField, however internally it is bound to a ValueSource. Expression itself returns a ValueSource, and when your application asks for the value of a document, it computes it based on the formula and the bounded variables' ValueSources.

Customizing expressions

JavascriptCompiler lets you pass a mapping of custom functions, where each function is implemented by a public and static Method, which takes up to 256 double parameters and returns a double. You can see a good example here. That, together with variables, provides great customization capabilities of expressions.

Sometimes customizing expressions is not so straightforward. For example, someone recently asked on the Lucene user-list how to use a multi-valued field in an expression, for the purpose of computing different functions on it (max, sum etc.). At first, it looks like a custom function, e.g. maxAll() (max() is already taken), which can be embedded in an expression like 5*maxVal(data). However, since data is a variable, and variables are bound to ValueSources (which return a single value for a document), we cannot pass all values of data to maxAll().

We can implement a ValueSource though, which returns the maximum value of the field data, and bind a variable max.data to it. Since this isn't a true function, we cannot use a more natural notation like max(data). Perhaps one day Lucene will have built-in support for multi-valued numeric fields, and the expressions module will auto-detect such fields and pass all values to the function (you're welcome to contribute patches!). Until then though, you need to implement a ValueSource for each such function, but fortunately it is quite trivial. I wrote some prototype code below which demonstrates how to do that.

▶ Multi-valued expressions demo

Nested expressions

Suppose that your documents contain longitude and latitude fields and you want to use them to compute a document's relevance, by computing the haversine formula. You can easily do that by compiling an expression such as haversin(40.7143528,-74.0059731,latitude,longitude).

Now, what if you want to use the result of that expression in another expression, such as _score + 1/(1+haversin(40.7143528,-74.0059731,latitude,longitude))? It starts to become somewhat longish to read. Wouldn't it be better if we can encapsulate that in a distance variable and read the expression _score + 1/(1+distance) instead? Fortunately, we can do that easily with nested/sub expressions:

Expression distance = JavascriptCompiler.compile(
  "haversin(40.7143528,-74.0059731,latitude,longitude)");
Expression expr = JavascriptCompiler.compile("_score + 1/(1+distance)");
SimpleBindings bindings = new SimpleBindings();
bindings.add(new SortField("_score", SortField.Type.SCORE));
bindings.add(new SortField("latitude", SortField.Type.LONG));
bindings.add(new SortField("longitude", SortField.Type.LONG));
bindings.add("distance", distance);
Since a variable can be bound to any ValueSource, and expressions are in fact ValueSources, we can bind the result of an expression to a variable in another expression. Nested expressions also cache their result in case you use them e.g. in multiple clauses of another expression. This could be useful if you have a very expensive expression which needs to be evaluated multiple times per document.

The expressions module is very powerful and lets you code custom ranking/sorting/faceting formulas easily. With custom functions, variables and ValueSources, it makes it trivial to extend and build upon even further. If you implement useful functions, you're welcome to contribute them!

Thursday, April 24, 2014

Benchmarking Updatable DocValues

In my previous post I described how updatable DocValues are implemented, and concluded with a brief discussion about alternative approaches we could take to implement them. I created a benchmark, leveraging our benchmarking framework, to measure the indexing and searching effects of updatable DocValues. In this post I compare re-indexing a document versus updating a numeric/binary doc-values field.

The Benchmark

I ran the benchmark on an IBM System x3690 X5, a 40 cores machine (dual Intel® Xeon® E7-2850 clocked @ 2.00GHz, 24MB per-core cache) with SAS 10K disks, using Oracle's 64-bit Java 1.7.0_45 (4GB heap). I also set Linux's swappiness to 0. I indexed two Wikipedia datasets: WIKIBIG (full documents, 6.64 million total docs) and WIKIMEDIUM (documents clipped to 1K chunks of text, 33.3 million total docs), with the following settings:
  • Directory: NRTCachingDirectory(MMapDirectory)
  • Analyzer: StandardAnalyzer (no stopwords)
  • ID postings: MemoryPostingsFormat (speedup lookups by ID)
  • Postings: Lucene41 (the current default)
  • DocValues: Direct (eliminates HW factors when reading DocValues)
  • I index the document’s last-modification-time as a numeric doc-values field, and the title also as a binary doc-values field. Additionally, each document is indexed several fields such as body, title etc., with their respective content from the dataset.
The benchmark (all sources available here) performs the following operations:

Document updates
Documents are updated by multiple threads, where each run performs a different type of update, at a set docs/sec rate. Threads record the time to execute the update:

  • reindex (writer.updateDocument()): the document is loaded from the dataset and re-indexed fully
  • numeric (writer.updateNumericDocValue()): the last-modification-time is updated to current time
  • binary (writer.updateBinaryDocValue()): the title byte[] is updated

Index Reopens
The index is reopened at a set interval by a separate thread. The thread records the time it took for each reopen as well as whether the index was in fact reopened, and whether the thread fell behind (i.e. if reopen was too slow to achieve the target interval).

Searches
While the index is being updated and reopened, search threads execute search tasks against it, and sort the results by the last-modification-time numeric doc-values field to exercise the updated doc-values.

Parameters
There are different costs to re-indexing a document and updating a doc-values field. Re-indexing a document has both some cost during writer.updateDocument(), when the document’s fields are analyzed, as well as during index reopen when deleted documents’ flags are turned off in the in-memory bit vectors of the segments that hold them, and the re-indexed information (postings, stored fields, doc-values…) is flushed. When updating a doc-values field, writer.updateNumericDocValue() merely caches the update, and therefore it’s almost a 0-cost operation, however during index reopen the doc-values field is entirely rewritten for all documents in that segment, even documents that were not updated. Therefore, doc-values updates should see the majority of the cost in that phase.

The benchmark takes few parameters in order to compare different setups:

  • docsPerSec: a comma separated list of docs/sec rates. A higher rate means more documents are updated between index reopens.
    The assumption is that higher rates are worse for re-indexing than updating a doc-values field, since writer.updateDocument() is more expensive than writer.updateNumericDocValue().
  • reopenIntervalSec: a comma separated list of index reopen intervals. A lower interval means more index reopens, but each reopen flushes a smaller number of document updates.
    The assumption is that lower intervals are bad for updatable doc-values, since they have a constant cost no matter how many documents are updated between index reopens, while even though re-indexing a document also has some cost during index reopen, it is proportionally smaller as more documents are flushed at once. Also, because the test uses NRTCachingDirectory, most likely those flushed files don’t even hit the disk.

Results

I ran the benchmark over the two datasets with the following parameters: numIndexThreads=4, numSearchThreads=2, reopenIntervalSec=1,2,5 and docsPerSec=1,10,100,1000. The benchmark runs each reopenIntervalSec and docsPerSec combination for 30 seconds and outputs the following results for each dataset and update method, excluding the first 5 seconds (warmup):
  • docs/s: the docs/sec rate
  • reopen/s: the reopen interval
  • reopen(ms): the average time (in milliseconds) it took to reopen the reader
  • update(ms): the total time (in milliseconds) for executing the update, averaged per reopen
  • total(ms): the total time spent per index reopen (sum of reopen(ms) and update(ms))
  • perdoc(ms): the average cost (in milliseconds) for each updated document
  • query/s: the average query/sec rate that was achieved, per reopen
The WIKIBIG dataset consists of full Wikipedia documents, and therefore re-indexing a document is expected to be more expensive (proportionally) than for the WIKIMEDIUM dataset. On the other hand, it consists of fewer documents, therefore a doc-values update needs to rewrite a smaller amount of data, and (proportionally) should be cheaper on this index. The results cannot easily be compared between the two datasets, therefore let’s analyze them separately:

WIKIBIG

  • WIKIBIG: document re-indexing
  • WIKIBIG: update numeric doc-values
  • WIKIBIG: update binary doc-values
Reopen + Update times
Reopening is almost always faster when re-indexing than updating doc-values. Also, when updating doc-values, the time does not significantly increase as more documents are updated. That is due to the fact that the heavy part of doc-values updates - rewriting the doc-values field - is the same whether we update 1 or 1000 documents in a segment. As for executing the update itself, it is negligible when updating doc-values compared to re-indexing a document.

Total update cost

The results show that when the number of document updates is small, it is faster overall to re-index the documents than update the doc-values field. However when there are many document updates, re-indexing cannot keep up with the reopen interval and starts to fall behind. On the other hand, updating doc-values reaches a "steady" cost very quickly, and is less affected by the number of document updates.

Per document cost

While the cost per document consistently decreases in all update methods, doc-values achieve nearly 90% lower costs than re-indexing the documents. E.g. in the combination of 1000 docs/sec and reopen every 5 seconds (5000 document updates on average between reopens), the per-document cost of numeric doc-values is 0.17 msec compared to 1.41 msec when re-indexing the document. Therefore it looks like that if your application updates many documents between index reopens, updating the doc-values field seems better than re-indexing them.

Query per second

Updating doc-values seems to affect running searches more than re-indexing documents when the reopen interval is low. However, when updating doc-values at a given indexing rate, QPS improves as the reopen interval increases. This is expected because the I/O cost does not change much as more documents are updated, unlike when re-indexing documents. Also, given that we see same drops even when updating binary doc-values (which do not participate in the search at all), strengthens the assumption that the drop is mainly due to I/O, and could be that if I had used a faster hardware (e.g. SSD), the drops would not show at all, or be less significant.

WIKIMEDIUM

  • WIKIMEDIUM: document re-indexing
  • WIKIMEDIUM: update numeric doc-values
  • WIKIMEDIUM: update binary doc-values
As described before, the cost of updating a doc-values field in a segment depends on the number of documents in the segment, and less on how many documents are updated in it. Furthermore, even at a low docs/sec rate there’s a high chance that between index reopens all index segments will be "touched" (updated), and therefore on the WIKIMEDIUM index (which contains x5 more documents than the WIKIBIG index) the cost of index reopen is substantially higher.
The results clearly show that re-indexing documents is always preferred to updating doc-values at the given reopen intervals. More so, updatable doc-values almost always fell behind with the reopen interval. This makes the per-document cost results irrelevant to look at because in practice there were much fewer index reopens than in the re-index case, and therefore each reopen affected a larger number of documents (which makes the average per-document appear low). Maybe at a much higher reopen interval (e.g. few tens seconds, or minutes), updating doc-values on that index would have performed better than re-indexing, however in this benchmark I focused on the near-real-time aspects of updatable doc-values, and therefore chose short reopen intervals.

Summary

The current implementation of updatable doc-values can be very useful to applications with relatively small indexes (few million documents) which reopen the index every few seconds. However, for applications with bigger indexes it is currently very expensive and such applications might prefer to re-index the documents instead.

One aspect of re-indexing a document that I didn't fully measure is content acquisition. In this benchmark, the content of the document is read from the local disk while for many applications it usually means accessing a remote repository, which slows things down considerably. Therefore it is recommended for such applications to benchmark this part in their indexing pipeline and determine what is the point where updating doc-values become slower than entirely re-indexing the documents.

So what's next?

In this post I described the stacked segments approach and how it could speed up updatable doc-values, by only writing the documents that were updated in each stack. On the other hand this would have some impact on search, since in order to determine the value of a document, we’d need to look it up through the "stacks".

There is however a hybrid approach, which is similar in nature to document deletions. Instead of flushing the updates to the index directory in every index reopen, we can carry the updates in-memory, and only flush when the updates take too much memory. So if an application updates only a few documents between index reopens, it won’t pay the cost of rewriting the entire doc-values field in each reopen. At some point though, the doc-values field will be written to disk (e.g. on commit or when they consume too much memory), but when that happens then many updates have already accumulated anyway, and so overall the cost of flushing per document is smaller. On the search side, the value of a document will only need to be looked for in two "stacks" - the in-memory and the base stack, which hopefully won’t have a big impact on search.

I think that this approach could work well for near-real-time applications, since they often reopen very frequently with only few document updates in between. Also, in this benchmark I've measured the cost of updating a single doc-values field, however in real-life applications there are often several such fields that are updated, and therefore the cost of reopening an index will be much higher (since every field will be rewritten). Applications already hold the deletions bit-vector in memory (for every segment with deletions; maxDoc/8 bytes), and I believe that spending an additional proportional amount of that memory for doc-values updates is tolerable. Still, once in a while a "naive"-looking reopen (only few updates since last reopen) could trigger a rewrite of the entire doc-values field. To avoid that we need to only flush the documents that were updated (i.e. stacked segments), but this could be a good compromise until we have that.

Looks like there is more fun work to do with updatable doc-values. If you have further ideas on how to improve them, your patches are welcome!

Wednesday, April 23, 2014

Updatable DocValues Under the Hood

We recently committed updatable numeric and binary doc-values fields, which lets you update the value of such fields in a document without re-indexing the whole document. This allows using such fields for e.g. boosting documents relevancy to a query, especially when the boosting factor is frequently changing (like date, ratings, counters etc.). In this post I describe the under-the-hood mechanics of updating those fields.

A brief overview on DocValues

DocValues have been around since Lucene 4.0, and while they have gone through several changes and improvements since then, the basic concept remained the same – they let you record a value per-document per-field in each segment, and allow for very efficient random-access retrieval of those values. Lucene offers several doc-values types: numeric for single-valued numeric fields (date, ratings, counters), sorted/sortedset for single/multi-valued string (mostly, but can be any BytesRef) fields used for sorting, faceting, grouping etc., and binary fields for storing arbitrary byte[] per-document.

Also, Lucene provides several implementations for writing and reading DocValues: Lucene45 (the current default) stores the values on disk, but keeps some information in-memory for fast lookups. Memory loads the values into compressed arrays, while Direct loads the values into uncompressed arrays (faster lookups, but higher memory consumption). If your application runs in a resource constrained environment, you can use Disk which keeps absolutely everything on disk, at the cost of slower lookups.

DocValues record a value for every document in every field and segment even when a document has no value associated with a field. In most cases this is acceptable because you usually store a value for most, if not all, documents in a field (e.g. scoring factors, facets). But even if you index very sparse doc-values fields, the default encoding is very efficient at indexing blocks of "missing" values. Still, if you index hundreds of millions of documents with few dozen doc-values fields, but only a small percentage of them contain a value, writing your own sparse doc-values encoding may be beneficial.

Document deletions are a form of in-place document updates!

The mechanism by which Lucene updates DocValues fields is very similar to how document deletions are handled, so let’s look at how that’s done first. When you issue e.g. a deleteDocument(term) request, Lucene caches the request until an updated SegmentReader is needed, or the index is committed. It then resolves all the issued deletes to determine which documents are affected, both in already flushed segments as well as documents that were in IndexWriter’s RAM buffer when the delete was issued.
Up-to-date SegmentReaders are needed in two cases: when a merge kicks off (SegmentMerger uses SegmentReaders to merge segments) and when the application reopens its NRT (near-real-time) reader.
If an updated reader is needed, Lucene carries the deletes in memory by turning off a bit in the in-memory bit vector of each affected segment, and avoids flushing the bit vector to the index directory, which may reside on a slow storage device (e.g. NFS share). If the application also has an NRT reader open, the bit vector is cloned (because the app’s reader should not be made aware of the new deletes until it’s reopened), leading to double RAM consumption by those vectors. However, once that cost is paid once, it is "fixed" (1 bit/doc, or maxDoc/8 bytes overall), no matter how many documents are later deleted. Also, as soon as the application releases its NRT reader (usually after reopening it), one of the vectors is released and can be reclaimed by the GC.

Updatable DocValues

Like document deletions, when the application issues e.g. an updateNumericDocValue(term, field, value) request, Lucene caches the request until an updated reader is needed or the index is committed, at which point it resolves the updates against all affected documents. Unlike deletes though, doc-values consume much more RAM, which makes carrying the updates in-memory not feasible for many applications: numeric doc-values may potentially consume 8 bytes per-document, while binary values are usually even more expensive.

Therefore, when the updates are resolved, Lucene currently rewrites the entire field that was updated to the index directory. So for example, if you update a numeric doc-values field of a document in a 1 million documents segment, Lucene rewrites 1 million numeric values, or 8MB (at most). Because Lucene segments are write-once, the updated field is written to a gen’d doc-values file of the segment. A reader then reads the values of a field from the respective gen’d doc-values file. While this has a highish cost during indexing, it does not affect search at all, because the values are read from a single file, and often this is a tradeoff that most applications are willing to make.

To illustrate, the code below indexes several documents with "date" and "ratings" NumericDocValuesFields and then lists the files in the index directory:

IndexWriterConfig conf = new IndexWriterConfig(...);
conf.setUseCompoundFile(false);
conf.setCodec(new SimpleTextCodec());
IndexWriter writer = new IndexWriter(dir, conf);
ReaderManager manager = new ReaderManager(writer, true);
for (int i = 0; i < 5; i++) {
  Document doc = new Document();
  doc.add(new StringField("id", ""+i, Store.NO));
  doc.add(new NumericDocValuesField("date", i));
  doc.add(new NumericDocValuesField("ratings", i*10));
  writer.addDocument(doc);
}

manager.maybeRefresh(); // triggers a flush
listDocValuesFiles(dir);
     
// update the 'date' field's value of document '0'
writer.updateNumericDocValue(
  new Term("id", "0"), "date", System.currentTimeMillis());
manager.maybeRefresh(); // triggers applying the update
listDocValuesFiles(dir);
You will see the following print:
files:
  _0.dat

files:
  _0_1.dat
  _0.dat
Notice how after the "date" field was updated and NRT reader re-opened, the index directory contains two sets of doc-values files: _0.dat stores both the "date" and "ratings" values and _0_1.dat stores just the "date" field’s values (the _1 part in the file name is the doc-values generation for the "date" field). Since the code uses SimpleTextCodec, we can easily print the contents of each file. Below is such print, clipped to show just the header of the files:
file: _0_1.dat
field date
  type NUMERIC

file: _0.dat
field date
  type NUMERIC
field ratings
  type NUMERIC
If we update the value of the "date" field for the same or any other document and then list the doc-values files in the index, we will see:
files:
  _0_2.dat
  _0.dat
Notice how _0_1.dat no longer exists in the index - that’s because it only contains values for the "date" field, which was entirely rewritten by the second update (under generation 2), and so _0_1.dat can be deleted.

Stacked Segments

On LUCENE-4258 we’ve started to implement field updates taking a stacked-segments approach. A stacked segment can be thought of as a layer/filter onto an existing segment, which indexes only the updates to that segment. When the full view of the segment is required (e.g. during search), the segment "collapses" all the updates down, so that the most up-to-date view of the segment is presented to the application.

The approach taken to implement updatable doc-values is similar in concept, only the doc-values "stacks" index the values of all documents. If we were to implement the approach taken on LUCENE-4258, then the doc-values stacks would index only the values of the updated documents. While this could have improved the indexing performance (writing a single value ought to be faster than writing 1 million values, no matter how fast your IO system is), it would affect search time since in order to resolve the value of a document, we would need to look for its most up-to-date value in any of the stacks. There are efficient ways to do this, but as usual, each comes with its own tradeoffs.

The key point is that stacked segments have a tradeoff between indexing and search and it is important to measure that tradeoff, as in Lucene, we often prefer to tradeoff in favor of search. I will publish in a separate post some benchmark numbers of the current approach, which will also serve as a baseline for measuring alternative approaches in the future.