A brief overview on DocValues
DocValues have been around since Lucene 4.0, and while they have gone through several changes and improvements since then, the basic concept remained the same – they let you record a value per-document per-field in each segment, and allow for very efficient random-access retrieval of those values. Lucene offers several doc-values types: numeric for single-valued numeric fields (date, ratings, counters), sorted/sortedset for single/multi-valued string (mostly, but can be anyBytesRef
) fields used for sorting, faceting, grouping etc., and binary fields for storing arbitrary byte[]
per-document.
Also, Lucene provides several implementations for writing and reading DocValues: Lucene45
(the current default) stores the values on disk, but keeps some information in-memory for fast lookups. Memory
loads the values into compressed arrays, while Direct
loads the values into uncompressed arrays (faster lookups, but higher memory consumption). If your application runs in a resource constrained environment, you can use Disk
which keeps absolutely everything on disk, at the cost of slower lookups.
DocValues record a value for every document in every field and segment even when a document has no value associated with a field. In most cases this is acceptable because you usually store a value for most, if not all, documents in a field (e.g. scoring factors, facets). But even if you index very sparse doc-values fields, the default encoding is very efficient at indexing blocks of "missing" values. Still, if you index hundreds of millions of documents with few dozen doc-values fields, but only a small percentage of them contain a value, writing your own sparse doc-values encoding may be beneficial.
Document deletions are a form of in-place document updates!
The mechanism by which Lucene updates DocValues fields is very similar to how document deletions are handled, so let’s look at how that’s done first. When you issue e.g. adeleteDocument(term)
request, Lucene caches the request until an updated SegmentReader
is needed, or the index is committed. It then resolves all the issued deletes to determine which documents are affected, both in already flushed segments as well as documents that were in IndexWriter’s RAM buffer when the delete was issued.
Up-to-dateIf an updated reader is needed, Lucene carries the deletes in memory by turning off a bit in the in-memory bit vector of each affected segment, and avoids flushing the bit vector to the index directory, which may reside on a slow storage device (e.g. NFS share). If the application also has an NRT reader open, the bit vector is cloned (because the app’s reader should not be made aware of the new deletes until it’s reopened), leading to double RAM consumption by those vectors. However, once that cost is paid once, it is "fixed" (1 bit/doc, orSegmentReaders
are needed in two cases: when a merge kicks off (SegmentMerger
usesSegmentReaders
to merge segments) and when the application reopens its NRT (near-real-time) reader.
maxDoc/8
bytes overall), no matter how many documents are later deleted. Also, as soon as the application releases its NRT reader (usually after reopening it), one of the vectors is released and can be reclaimed by the GC.
Updatable DocValues
Like document deletions, when the application issues e.g. anupdateNumericDocValue(term, field, value)
request, Lucene caches the request until an updated reader is needed or the index is committed, at which point it resolves the updates against all affected documents. Unlike deletes though, doc-values consume much more RAM, which makes carrying the updates in-memory not feasible for many applications: numeric doc-values may potentially consume 8 bytes per-document, while binary values are usually even more expensive.
Therefore, when the updates are resolved, Lucene currently rewrites the entire field that was updated to the index directory. So for example, if you update a numeric doc-values field of a document in a 1 million documents segment, Lucene rewrites 1 million numeric values, or 8MB (at most). Because Lucene segments are write-once, the updated field is written to a gen’d doc-values file of the segment. A reader then reads the values of a field from the respective gen’d doc-values file. While this has a highish cost during indexing, it does not affect search at all, because the values are read from a single file, and often this is a tradeoff that most applications are willing to make.
To illustrate, the code below indexes several documents with "date" and "ratings" NumericDocValuesField
s and then lists the files in the index directory:
IndexWriterConfig conf = new IndexWriterConfig(...); conf.setUseCompoundFile(false); conf.setCodec(new SimpleTextCodec()); IndexWriter writer = new IndexWriter(dir, conf); ReaderManager manager = new ReaderManager(writer, true); for (int i = 0; i < 5; i++) { Document doc = new Document(); doc.add(new StringField("id", ""+i, Store.NO)); doc.add(new NumericDocValuesField("date", i)); doc.add(new NumericDocValuesField("ratings", i*10)); writer.addDocument(doc); } manager.maybeRefresh(); // triggers a flush listDocValuesFiles(dir); // update the 'date' field's value of document '0' writer.updateNumericDocValue( new Term("id", "0"), "date", System.currentTimeMillis()); manager.maybeRefresh(); // triggers applying the update listDocValuesFiles(dir);You will see the following print:
files: _0.dat files: _0_1.dat _0.datNotice how after the "date" field was updated and NRT reader re-opened, the index directory contains two sets of doc-values files:
_0.dat
stores both the "date" and "ratings" values and _0_1.dat
stores just the "date" field’s values (the _1
part in the file name is the doc-values generation for the "date" field). Since the code uses SimpleTextCodec, we can easily print the contents of each file. Below is such print, clipped to show just the header of the files:
file: _0_1.dat field date type NUMERIC file: _0.dat field date type NUMERIC field ratings type NUMERICIf we update the value of the "date" field for the same or any other document and then list the doc-values files in the index, we will see:
files: _0_2.dat _0.datNotice how
_0_1.dat
no longer exists in the index - that’s because it only contains values for the "date" field, which was entirely rewritten by the second update (under generation 2
), and so _0_1.dat
can be deleted.
Stacked Segments
On LUCENE-4258 we’ve started to implement field updates taking a stacked-segments approach. A stacked segment can be thought of as a layer/filter onto an existing segment, which indexes only the updates to that segment. When the full view of the segment is required (e.g. during search), the segment "collapses" all the updates down, so that the most up-to-date view of the segment is presented to the application. The approach taken to implement updatable doc-values is similar in concept, only the doc-values "stacks" index the values of all documents. If we were to implement the approach taken onLUCENE-4258
, then the doc-values stacks would index only the values of the updated documents. While this could have improved the indexing performance (writing a single value ought to be faster than writing 1 million values, no matter how fast your IO system is), it would affect search time since in order to resolve the value of a document, we would need to look for its most up-to-date value in any of the stacks. There are efficient ways to do this, but as usual, each comes with its own tradeoffs.
The key point is that stacked segments have a tradeoff between indexing and search and it is important to measure that tradeoff, as in Lucene, we often prefer to tradeoff in favor of search. I will publish in a separate post some benchmark numbers of the current approach, which will also serve as a baseline for measuring alternative approaches in the future.
This comment has been removed by the author.
ReplyDeleteThere has been such values and codes prescribed which are considered to be important to some extent and in most of the cases they are nearly proved to be efficient, so there will be a good combination. document typing services
ReplyDelete