Wednesday, April 23, 2014

Updatable DocValues Under the Hood

We recently committed updatable numeric and binary doc-values fields, which lets you update the value of such fields in a document without re-indexing the whole document. This allows using such fields for e.g. boosting documents relevancy to a query, especially when the boosting factor is frequently changing (like date, ratings, counters etc.). In this post I describe the under-the-hood mechanics of updating those fields.

A brief overview on DocValues

DocValues have been around since Lucene 4.0, and while they have gone through several changes and improvements since then, the basic concept remained the same – they let you record a value per-document per-field in each segment, and allow for very efficient random-access retrieval of those values. Lucene offers several doc-values types: numeric for single-valued numeric fields (date, ratings, counters), sorted/sortedset for single/multi-valued string (mostly, but can be any BytesRef) fields used for sorting, faceting, grouping etc., and binary fields for storing arbitrary byte[] per-document.

Also, Lucene provides several implementations for writing and reading DocValues: Lucene45 (the current default) stores the values on disk, but keeps some information in-memory for fast lookups. Memory loads the values into compressed arrays, while Direct loads the values into uncompressed arrays (faster lookups, but higher memory consumption). If your application runs in a resource constrained environment, you can use Disk which keeps absolutely everything on disk, at the cost of slower lookups.

DocValues record a value for every document in every field and segment even when a document has no value associated with a field. In most cases this is acceptable because you usually store a value for most, if not all, documents in a field (e.g. scoring factors, facets). But even if you index very sparse doc-values fields, the default encoding is very efficient at indexing blocks of "missing" values. Still, if you index hundreds of millions of documents with few dozen doc-values fields, but only a small percentage of them contain a value, writing your own sparse doc-values encoding may be beneficial.

Document deletions are a form of in-place document updates!

The mechanism by which Lucene updates DocValues fields is very similar to how document deletions are handled, so let’s look at how that’s done first. When you issue e.g. a deleteDocument(term) request, Lucene caches the request until an updated SegmentReader is needed, or the index is committed. It then resolves all the issued deletes to determine which documents are affected, both in already flushed segments as well as documents that were in IndexWriter’s RAM buffer when the delete was issued.
Up-to-date SegmentReaders are needed in two cases: when a merge kicks off (SegmentMerger uses SegmentReaders to merge segments) and when the application reopens its NRT (near-real-time) reader.
If an updated reader is needed, Lucene carries the deletes in memory by turning off a bit in the in-memory bit vector of each affected segment, and avoids flushing the bit vector to the index directory, which may reside on a slow storage device (e.g. NFS share). If the application also has an NRT reader open, the bit vector is cloned (because the app’s reader should not be made aware of the new deletes until it’s reopened), leading to double RAM consumption by those vectors. However, once that cost is paid once, it is "fixed" (1 bit/doc, or maxDoc/8 bytes overall), no matter how many documents are later deleted. Also, as soon as the application releases its NRT reader (usually after reopening it), one of the vectors is released and can be reclaimed by the GC.

Updatable DocValues

Like document deletions, when the application issues e.g. an updateNumericDocValue(term, field, value) request, Lucene caches the request until an updated reader is needed or the index is committed, at which point it resolves the updates against all affected documents. Unlike deletes though, doc-values consume much more RAM, which makes carrying the updates in-memory not feasible for many applications: numeric doc-values may potentially consume 8 bytes per-document, while binary values are usually even more expensive.

Therefore, when the updates are resolved, Lucene currently rewrites the entire field that was updated to the index directory. So for example, if you update a numeric doc-values field of a document in a 1 million documents segment, Lucene rewrites 1 million numeric values, or 8MB (at most). Because Lucene segments are write-once, the updated field is written to a gen’d doc-values file of the segment. A reader then reads the values of a field from the respective gen’d doc-values file. While this has a highish cost during indexing, it does not affect search at all, because the values are read from a single file, and often this is a tradeoff that most applications are willing to make.

To illustrate, the code below indexes several documents with "date" and "ratings" NumericDocValuesFields and then lists the files in the index directory:

IndexWriterConfig conf = new IndexWriterConfig(...);
conf.setUseCompoundFile(false);
conf.setCodec(new SimpleTextCodec());
IndexWriter writer = new IndexWriter(dir, conf);
ReaderManager manager = new ReaderManager(writer, true);
for (int i = 0; i < 5; i++) {
  Document doc = new Document();
  doc.add(new StringField("id", ""+i, Store.NO));
  doc.add(new NumericDocValuesField("date", i));
  doc.add(new NumericDocValuesField("ratings", i*10));
  writer.addDocument(doc);
}

manager.maybeRefresh(); // triggers a flush
listDocValuesFiles(dir);
     
// update the 'date' field's value of document '0'
writer.updateNumericDocValue(
  new Term("id", "0"), "date", System.currentTimeMillis());
manager.maybeRefresh(); // triggers applying the update
listDocValuesFiles(dir);
You will see the following print:
files:
  _0.dat

files:
  _0_1.dat
  _0.dat
Notice how after the "date" field was updated and NRT reader re-opened, the index directory contains two sets of doc-values files: _0.dat stores both the "date" and "ratings" values and _0_1.dat stores just the "date" field’s values (the _1 part in the file name is the doc-values generation for the "date" field). Since the code uses SimpleTextCodec, we can easily print the contents of each file. Below is such print, clipped to show just the header of the files:
file: _0_1.dat
field date
  type NUMERIC

file: _0.dat
field date
  type NUMERIC
field ratings
  type NUMERIC
If we update the value of the "date" field for the same or any other document and then list the doc-values files in the index, we will see:
files:
  _0_2.dat
  _0.dat
Notice how _0_1.dat no longer exists in the index - that’s because it only contains values for the "date" field, which was entirely rewritten by the second update (under generation 2), and so _0_1.dat can be deleted.

Stacked Segments

On LUCENE-4258 we’ve started to implement field updates taking a stacked-segments approach. A stacked segment can be thought of as a layer/filter onto an existing segment, which indexes only the updates to that segment. When the full view of the segment is required (e.g. during search), the segment "collapses" all the updates down, so that the most up-to-date view of the segment is presented to the application.

The approach taken to implement updatable doc-values is similar in concept, only the doc-values "stacks" index the values of all documents. If we were to implement the approach taken on LUCENE-4258, then the doc-values stacks would index only the values of the updated documents. While this could have improved the indexing performance (writing a single value ought to be faster than writing 1 million values, no matter how fast your IO system is), it would affect search time since in order to resolve the value of a document, we would need to look for its most up-to-date value in any of the stacks. There are efficient ways to do this, but as usual, each comes with its own tradeoffs.

The key point is that stacked segments have a tradeoff between indexing and search and it is important to measure that tradeoff, as in Lucene, we often prefer to tradeoff in favor of search. I will publish in a separate post some benchmark numbers of the current approach, which will also serve as a baseline for measuring alternative approaches in the future.

2 comments:

  1. This comment has been removed by the author.

    ReplyDelete
  2. There has been such values and codes prescribed which are considered to be important to some extent and in most of the cases they are nearly proved to be efficient, so there will be a good combination. document typing services

    ReplyDelete