The BenchmarkI ran the benchmark on an IBM System x3690 X5, a 40 cores machine (dual Intel® Xeon® E7-2850 clocked @ 2.00GHz, 24MB per-core cache) with SAS 10K disks, using Oracle's 64-bit Java 1.7.0_45 (4GB heap). I also set Linux's swappiness to 0. I indexed two Wikipedia datasets: WIKIBIG (full documents, 6.64 million total docs) and WIKIMEDIUM (documents clipped to 1K chunks of text, 33.3 million total docs), with the following settings:
- ID postings:
MemoryPostingsFormat(speedup lookups by ID)
Lucene41(the current default)
Direct(eliminates HW factors when reading DocValues)
- I index the document’s last-modification-time as a numeric doc-values field, and the title also as a binary doc-values field. Additionally, each document is indexed several fields such as body, title etc., with their respective content from the dataset.
Documents are updated by multiple threads, where each run performs a different type of update, at a set docs/sec rate. Threads record the time to execute the update:
- reindex (
writer.updateDocument()): the document is loaded from the dataset and re-indexed fully
- numeric (
writer.updateNumericDocValue()): the last-modification-time is updated to current time
- binary (
writer.updateBinaryDocValue()): the title
The index is reopened at a set interval by a separate thread. The thread records the time it took for each reopen as well as whether the index was in fact reopened, and whether the thread fell behind (i.e. if reopen was too slow to achieve the target interval). Searches
While the index is being updated and reopened, search threads execute search tasks against it, and sort the results by the last-modification-time numeric doc-values field to exercise the updated doc-values. Parameters
There are different costs to re-indexing a document and updating a doc-values field. Re-indexing a document has both some cost during
writer.updateDocument(), when the document’s fields are analyzed, as well as during index reopen when deleted documents’ flags are turned off in the in-memory bit vectors of the segments that hold them, and the re-indexed information (postings, stored fields, doc-values…) is flushed. When updating a doc-values field,
writer.updateNumericDocValue()merely caches the update, and therefore it’s almost a 0-cost operation, however during index reopen the doc-values field is entirely rewritten for all documents in that segment, even documents that were not updated. Therefore, doc-values updates should see the majority of the cost in that phase. The benchmark takes few parameters in order to compare different setups:
docsPerSec: a comma separated list of docs/sec rates. A higher rate means more documents are updated between index reopens.
The assumption is that higher rates are worse for re-indexing than updating a doc-values field, since
writer.updateDocument()is more expensive than
reopenIntervalSec: a comma separated list of index reopen intervals. A lower interval means more index reopens, but each reopen flushes a smaller number of document updates.
The assumption is that lower intervals are bad for updatable doc-values, since they have a constant cost no matter how many documents are updated between index reopens, while even though re-indexing a document also has some cost during index reopen, it is proportionally smaller as more documents are flushed at once. Also, because the test uses
NRTCachingDirectory, most likely those flushed files don’t even hit the disk.
ResultsI ran the benchmark over the two datasets with the following parameters:
docsPerSec=1,10,100,1000. The benchmark runs each
docsPerSeccombination for 30 seconds and outputs the following results for each dataset and update method, excluding the first 5 seconds (warmup):
docs/s: the docs/sec rate
reopen/s: the reopen interval
reopen(ms): the average time (in milliseconds) it took to reopen the reader
update(ms): the total time (in milliseconds) for executing the update, averaged per reopen
total(ms): the total time spent per index reopen (sum of reopen(ms) and update(ms))
perdoc(ms): the average cost (in milliseconds) for each updated document
query/s: the average query/sec rate that was achieved, per reopen
- WIKIMEDIUM: document re-indexing
- WIKIMEDIUM: update numeric doc-values
- WIKIMEDIUM: update binary doc-values
SummaryThe current implementation of updatable doc-values can be very useful to applications with relatively small indexes (few million documents) which reopen the index every few seconds. However, for applications with bigger indexes it is currently very expensive and such applications might prefer to re-index the documents instead. One aspect of re-indexing a document that I didn't fully measure is content acquisition. In this benchmark, the content of the document is read from the local disk while for many applications it usually means accessing a remote repository, which slows things down considerably. Therefore it is recommended for such applications to benchmark this part in their indexing pipeline and determine what is the point where updating doc-values become slower than entirely re-indexing the documents.
So what's next?In this post I described the stacked segments approach and how it could speed up updatable doc-values, by only writing the documents that were updated in each stack. On the other hand this would have some impact on search, since in order to determine the value of a document, we’d need to look it up through the "stacks". There is however a hybrid approach, which is similar in nature to document deletions. Instead of flushing the updates to the index directory in every index reopen, we can carry the updates in-memory, and only flush when the updates take too much memory. So if an application updates only a few documents between index reopens, it won’t pay the cost of rewriting the entire doc-values field in each reopen. At some point though, the doc-values field will be written to disk (e.g. on commit or when they consume too much memory), but when that happens then many updates have already accumulated anyway, and so overall the cost of flushing per document is smaller. On the search side, the value of a document will only need to be looked for in two "stacks" - the in-memory and the base stack, which hopefully won’t have a big impact on search. I think that this approach could work well for near-real-time applications, since they often reopen very frequently with only few document updates in between. Also, in this benchmark I've measured the cost of updating a single doc-values field, however in real-life applications there are often several such fields that are updated, and therefore the cost of reopening an index will be much higher (since every field will be rewritten). Applications already hold the deletions bit-vector in memory (for every segment with deletions;
maxDoc/8bytes), and I believe that spending an additional proportional amount of that memory for doc-values updates is tolerable. Still, once in a while a "naive"-looking reopen (only few updates since last reopen) could trigger a rewrite of the entire doc-values field. To avoid that we need to only flush the documents that were updated (i.e. stacked segments), but this could be a good compromise until we have that. Looks like there is more fun work to do with updatable doc-values. If you have further ideas on how to improve them, your patches are welcome!