Thursday, April 24, 2014

Benchmarking Updatable DocValues

In my previous post I described how updatable DocValues are implemented, and concluded with a brief discussion about alternative approaches we could take to implement them. I created a benchmark, leveraging our benchmarking framework, to measure the indexing and searching effects of updatable DocValues. In this post I compare re-indexing a document versus updating a numeric/binary doc-values field.

The Benchmark

I ran the benchmark on an IBM System x3690 X5, a 40 cores machine (dual Intel® Xeon® E7-2850 clocked @ 2.00GHz, 24MB per-core cache) with SAS 10K disks, using Oracle's 64-bit Java 1.7.0_45 (4GB heap). I also set Linux's swappiness to 0. I indexed two Wikipedia datasets: WIKIBIG (full documents, 6.64 million total docs) and WIKIMEDIUM (documents clipped to 1K chunks of text, 33.3 million total docs), with the following settings:
  • Directory: NRTCachingDirectory(MMapDirectory)
  • Analyzer: StandardAnalyzer (no stopwords)
  • ID postings: MemoryPostingsFormat (speedup lookups by ID)
  • Postings: Lucene41 (the current default)
  • DocValues: Direct (eliminates HW factors when reading DocValues)
  • I index the document’s last-modification-time as a numeric doc-values field, and the title also as a binary doc-values field. Additionally, each document is indexed several fields such as body, title etc., with their respective content from the dataset.
The benchmark (all sources available here) performs the following operations:

Document updates
Documents are updated by multiple threads, where each run performs a different type of update, at a set docs/sec rate. Threads record the time to execute the update:

  • reindex (writer.updateDocument()): the document is loaded from the dataset and re-indexed fully
  • numeric (writer.updateNumericDocValue()): the last-modification-time is updated to current time
  • binary (writer.updateBinaryDocValue()): the title byte[] is updated

Index Reopens
The index is reopened at a set interval by a separate thread. The thread records the time it took for each reopen as well as whether the index was in fact reopened, and whether the thread fell behind (i.e. if reopen was too slow to achieve the target interval).

Searches
While the index is being updated and reopened, search threads execute search tasks against it, and sort the results by the last-modification-time numeric doc-values field to exercise the updated doc-values.

Parameters
There are different costs to re-indexing a document and updating a doc-values field. Re-indexing a document has both some cost during writer.updateDocument(), when the document’s fields are analyzed, as well as during index reopen when deleted documents’ flags are turned off in the in-memory bit vectors of the segments that hold them, and the re-indexed information (postings, stored fields, doc-values…) is flushed. When updating a doc-values field, writer.updateNumericDocValue() merely caches the update, and therefore it’s almost a 0-cost operation, however during index reopen the doc-values field is entirely rewritten for all documents in that segment, even documents that were not updated. Therefore, doc-values updates should see the majority of the cost in that phase.

The benchmark takes few parameters in order to compare different setups:

  • docsPerSec: a comma separated list of docs/sec rates. A higher rate means more documents are updated between index reopens.
    The assumption is that higher rates are worse for re-indexing than updating a doc-values field, since writer.updateDocument() is more expensive than writer.updateNumericDocValue().
  • reopenIntervalSec: a comma separated list of index reopen intervals. A lower interval means more index reopens, but each reopen flushes a smaller number of document updates.
    The assumption is that lower intervals are bad for updatable doc-values, since they have a constant cost no matter how many documents are updated between index reopens, while even though re-indexing a document also has some cost during index reopen, it is proportionally smaller as more documents are flushed at once. Also, because the test uses NRTCachingDirectory, most likely those flushed files don’t even hit the disk.

Results

I ran the benchmark over the two datasets with the following parameters: numIndexThreads=4, numSearchThreads=2, reopenIntervalSec=1,2,5 and docsPerSec=1,10,100,1000. The benchmark runs each reopenIntervalSec and docsPerSec combination for 30 seconds and outputs the following results for each dataset and update method, excluding the first 5 seconds (warmup):
  • docs/s: the docs/sec rate
  • reopen/s: the reopen interval
  • reopen(ms): the average time (in milliseconds) it took to reopen the reader
  • update(ms): the total time (in milliseconds) for executing the update, averaged per reopen
  • total(ms): the total time spent per index reopen (sum of reopen(ms) and update(ms))
  • perdoc(ms): the average cost (in milliseconds) for each updated document
  • query/s: the average query/sec rate that was achieved, per reopen
The WIKIBIG dataset consists of full Wikipedia documents, and therefore re-indexing a document is expected to be more expensive (proportionally) than for the WIKIMEDIUM dataset. On the other hand, it consists of fewer documents, therefore a doc-values update needs to rewrite a smaller amount of data, and (proportionally) should be cheaper on this index. The results cannot easily be compared between the two datasets, therefore let’s analyze them separately:

WIKIBIG

  • WIKIBIG: document re-indexing
  • WIKIBIG: update numeric doc-values
  • WIKIBIG: update binary doc-values
Reopen + Update times
Reopening is almost always faster when re-indexing than updating doc-values. Also, when updating doc-values, the time does not significantly increase as more documents are updated. That is due to the fact that the heavy part of doc-values updates - rewriting the doc-values field - is the same whether we update 1 or 1000 documents in a segment. As for executing the update itself, it is negligible when updating doc-values compared to re-indexing a document.

Total update cost

The results show that when the number of document updates is small, it is faster overall to re-index the documents than update the doc-values field. However when there are many document updates, re-indexing cannot keep up with the reopen interval and starts to fall behind. On the other hand, updating doc-values reaches a "steady" cost very quickly, and is less affected by the number of document updates.

Per document cost

While the cost per document consistently decreases in all update methods, doc-values achieve nearly 90% lower costs than re-indexing the documents. E.g. in the combination of 1000 docs/sec and reopen every 5 seconds (5000 document updates on average between reopens), the per-document cost of numeric doc-values is 0.17 msec compared to 1.41 msec when re-indexing the document. Therefore it looks like that if your application updates many documents between index reopens, updating the doc-values field seems better than re-indexing them.

Query per second

Updating doc-values seems to affect running searches more than re-indexing documents when the reopen interval is low. However, when updating doc-values at a given indexing rate, QPS improves as the reopen interval increases. This is expected because the I/O cost does not change much as more documents are updated, unlike when re-indexing documents. Also, given that we see same drops even when updating binary doc-values (which do not participate in the search at all), strengthens the assumption that the drop is mainly due to I/O, and could be that if I had used a faster hardware (e.g. SSD), the drops would not show at all, or be less significant.

WIKIMEDIUM

  • WIKIMEDIUM: document re-indexing
  • WIKIMEDIUM: update numeric doc-values
  • WIKIMEDIUM: update binary doc-values
As described before, the cost of updating a doc-values field in a segment depends on the number of documents in the segment, and less on how many documents are updated in it. Furthermore, even at a low docs/sec rate there’s a high chance that between index reopens all index segments will be "touched" (updated), and therefore on the WIKIMEDIUM index (which contains x5 more documents than the WIKIBIG index) the cost of index reopen is substantially higher.
The results clearly show that re-indexing documents is always preferred to updating doc-values at the given reopen intervals. More so, updatable doc-values almost always fell behind with the reopen interval. This makes the per-document cost results irrelevant to look at because in practice there were much fewer index reopens than in the re-index case, and therefore each reopen affected a larger number of documents (which makes the average per-document appear low). Maybe at a much higher reopen interval (e.g. few tens seconds, or minutes), updating doc-values on that index would have performed better than re-indexing, however in this benchmark I focused on the near-real-time aspects of updatable doc-values, and therefore chose short reopen intervals.

Summary

The current implementation of updatable doc-values can be very useful to applications with relatively small indexes (few million documents) which reopen the index every few seconds. However, for applications with bigger indexes it is currently very expensive and such applications might prefer to re-index the documents instead.

One aspect of re-indexing a document that I didn't fully measure is content acquisition. In this benchmark, the content of the document is read from the local disk while for many applications it usually means accessing a remote repository, which slows things down considerably. Therefore it is recommended for such applications to benchmark this part in their indexing pipeline and determine what is the point where updating doc-values become slower than entirely re-indexing the documents.

So what's next?

In this post I described the stacked segments approach and how it could speed up updatable doc-values, by only writing the documents that were updated in each stack. On the other hand this would have some impact on search, since in order to determine the value of a document, we’d need to look it up through the "stacks".

There is however a hybrid approach, which is similar in nature to document deletions. Instead of flushing the updates to the index directory in every index reopen, we can carry the updates in-memory, and only flush when the updates take too much memory. So if an application updates only a few documents between index reopens, it won’t pay the cost of rewriting the entire doc-values field in each reopen. At some point though, the doc-values field will be written to disk (e.g. on commit or when they consume too much memory), but when that happens then many updates have already accumulated anyway, and so overall the cost of flushing per document is smaller. On the search side, the value of a document will only need to be looked for in two "stacks" - the in-memory and the base stack, which hopefully won’t have a big impact on search.

I think that this approach could work well for near-real-time applications, since they often reopen very frequently with only few document updates in between. Also, in this benchmark I've measured the cost of updating a single doc-values field, however in real-life applications there are often several such fields that are updated, and therefore the cost of reopening an index will be much higher (since every field will be rewritten). Applications already hold the deletions bit-vector in memory (for every segment with deletions; maxDoc/8 bytes), and I believe that spending an additional proportional amount of that memory for doc-values updates is tolerable. Still, once in a while a "naive"-looking reopen (only few updates since last reopen) could trigger a rewrite of the entire doc-values field. To avoid that we need to only flush the documents that were updated (i.e. stacked segments), but this could be a good compromise until we have that.

Looks like there is more fun work to do with updatable doc-values. If you have further ideas on how to improve them, your patches are welcome!

3 comments:

  1. your blog very useful i am intrested your blog Big Data for telecom

    ReplyDelete
  2. This is very informative article,
    "benchmarking"

    ReplyDelete
  3. Thanks for doing this - how much has changed with Lucene 5.5 and Lucene 6? I can try to port forward your source if you make it available (port to Lucene 5.5 at least).

    ReplyDelete