Code Reduction

Monday, January 11, 2016

Indexing Tagged Data with Lucene (part 4)

Indexing single-word annotations is cool, but will likely not be enough. Let's take the following text (with crazy-colored animals): quick rosy brown fox and a pale violet red dog. Our simple ColorAnnotator only accepts single color words, such as "brown", "red" and "violet", however, it cannot detect "rosy brown" and "pale violet red" (and well, that is just unacceptable, isn't it!).

Annotating multiple words together can be done either before indexing, or during indexing by implementing another TokenFilter. In this post I focus on how to index those annotations, and therefore let's assume that our indexer receives the annotation markers together with the text to index. For simplicity, let's also assume that the information is given to us in an array of integer pairs, where the first integer denotes the start (word) position of the annotation in the text, and the second denotes the annotation's length. For example, for the text above we might get the following color annotation markers: [1,2,2,1,6,3,7,1,8,1].

quick rosy brown fox and a pale violet red dog
      ---------- (1,2)     --------------- (6,3)
           ----- (2,1)          ------     (7,1)
                                       --- (8,1)

To index these annotations we can choose between two approaches. Both index each word that is covered by an annotation marker, however they index the _any_ term differently. The first (and simpler one) will index it at every position as the original terms, while the second approach will represent it as a single token which spans multiple positions. The way to do that is to encode the annotation length in Lucene's PayloadAttribute.

`SimplePreAnnotatedTokenFilter`

The filter takes annotation markers as input (as described above), and filters tokens that do not fall within any annotation. It extends FilteringTokenFilter and only accepts or rejects tokens, but does not modify their attributes or injects new terms (hence its called Simple). Let's go through its code:

public final class SimplePreAnnotatedTokenFilter extends FilteringTokenFilter {

  private final PositionIncrementAttribute posIncrAtt = ...;
  private final int[] markers;

  private int absPosition;
  private int curStart;
  private int curEnd;
  private int pairIdx;

  public SimplePreAnnotatedTokenFilter(TokenStream input, int... markers) {
    super(input);
    this.markers = getSortedFilteredMarkers(markers);                 // (1)
  }

  @Override
  protected boolean accept() throws IOException {
    absPosition += posIncrAtt.getPositionIncrement();
    return acceptCurrentToken();
  }

  @Override
  public void reset() throws IOException {
    super.reset();
    absPosition = -1;
    pairIdx = 0;
    updateCurrentStartEnd();
  }

  private boolean acceptCurrentToken() {                                // (2)
    if (absPosition < curStart) {
      return false;
    }
    if (absPosition > curEnd) {
      pairIdx += 2;
      if (pairIdx < markers.length) {
        updateCurrentStartEnd();
        return acceptCurrentToken();                                    // (3)
      }
      // No more annotation markers                                     // (4)
      curStart = Integer.MAX_VALUE;
      curEnd = Integer.MAX_VALUE;
      return false;
    }
    return true;
 }

  private void updateCurrentStartEnd() {
    curStart = markers[pairIdx];
    curEnd = curStart + markers[pairIdx + 1] - 1;
  }

}

(1): Some annotation markers may be covered by others. E.g. in the example above, both (7,1) and (8,1) are covered by (6,3) and therefore are redundant to keep.
(2): Checks if the current token's position is covered by any annotation marker.
(3): Since we move through the markers sequentially, if the current token appears after the current annotation marker, continue to search for other markers that may cover it.
(4): All annotation markers were visited, therefore no additional tokens can be accepted. Since we extend FilteringTokenFilter, we can only accept/reject a token, but we cannot short-circuit the entire stream, so we will just ignore all future tokens.

Let's index few documents using this token filter and inspect the indexed terms as well try out some searches:

private void addDocument(IndexWriter writer, String text, int... colorAnnotations) {
  final Tokenizer tokenizer = ...;
  final TeeSinkTokenFilter textStream = ...;
  final TokenStream colorsStream = 
      new AnyAnnotationTokenFilter(                                // (1)
          new SimplePreAnnotatedTokenFilter(                       // (2)
              textStream.newSinkTokenStream(), colorAnnotations));
  ...
}

addDocument(writer, "quick rosy brown fox and a pale violet red dog", 
                    1, 2, 2, 1, 6, 3, 7, 1, 8, 1);
addDocument(writer, "only red dog", 1, 1);
addDocument(writer, "man with red pale face", 2, 1);

(1): As before, add the _any_ token in addition to all annotated color tokens.
(2): Our SimplePreAnnotatedTokenFilter, taking the input stream and annotation markers.

Let's print the information about the terms of the "color" field:

Terms for field [color], with positional info:
  _any_
    doc=0, freq=5, pos=[1, 2, 6, 7, 8]
    doc=1, freq=1, pos=[1]
    doc=2, freq=1, pos=[2]
  brown
    doc=0, freq=1, pos=[2]
  pale
    doc=0, freq=1, pos=[6]
  red
    doc=0, freq=1, pos=[8]
    doc=1, freq=1, pos=[1]
    doc=2, freq=1, pos=[2]
  rosy
    doc=0, freq=1, pos=[1]
  violet
    doc=0, freq=1, pos=[7]

As you can see, besides just the clear color terms ("red", "brown", "violet"), the "color" field also contains tokens that are not colors, but do appear as part of the color annotations. This is useful as it allows us to e.g. search for pale or rosy colors (color:(pale OR rosy)). Also, you can notice that the _any_ term appears in all documents and at the same position of other terms that were annotated as colors. This allows us to search for e.g. colored foxes, using this code:

final SpanQuery anyColor = 
    new SpanTermQuery(new Term(COLOR_FIELD, ANY_ANNOTATION_TERM));
final SpanQuery colorAsText = new FieldMaskingSpanQuery(anyColor, TEXT_FIELD);
final SpanQuery fox = new SpanTermQuery(new Term(TEXT_FIELD, "fox"));
final SpanQuery coloredFox = 
    new SpanNearQuery(new SpanQuery[] { colorAsText, fox }, 0, true);

You can notice how I set slop=0 on the SpanNearQuery, which is just like executing a phrase query. In this case our phrase query is "color:_any_ text:fox", only we cannot parse phrase queries across fields, so SpanNearQuery gives us what we need. Executing this query gives the following:

Searching for [spanNear([mask(color:_any_) as text, text:fox], 0, true)]:
  doc=0, text=quick rosy brown fox and a pale violet red dog

This document is found because the term _any_ appears at position 2 (same as brown), and since the term fox appears at position 3, these two follow each other and are considered a phrase. We now start to see the power of indexing multi-word annotations!

Indexing multi-word annotations as a single term

One down side of indexing an _any_ token for each original annotated term, is that if you have very long annotations, this might balloon your index more than you need, as well impact query performance. Instead, we can index a single _any_ term for each annotation, with enough information that will tell us the length of that annotation.

To achieve that, let's implement PreAnnotatedTokenFilter. Like its simpler cousin, it takes an array of annotation markers and filters terms that are not covered by any of the markers. Unlike it though, it also handles adding the _any_ term, and therefore does not extend FilteringTokenFilter. Let's take a look at its code:

public final class PreAnnotatedTokenFilter extends TokenFilter {

  public static final String ANY_ANNOTATION_TERM = "_any_";                 // (1)

  private static final int MAX_BYTES_IN_VINT = 5;

  private final CharTermAttribute termAtt = ...;
  private final PositionIncrementAttribute posIncrAtt = ...;
  private final PayloadAttribute payloadAtt = ...;
  private final int[] markers;

  private final BytesRef payloadBytes = new BytesRef(MAX_BYTES_IN_VINT);    // (2)
  private final ByteArrayDataOutput out = ...;

  private int skippedPositions;
  private int absPosition;
  private int curStart;
  private int curEnd;
  private int pairIdx;

  private State state = null;

  public PreAnnotatedTokenFilter(TokenStream input, int... markers) {
    super(input);
    this.markers = getSortedFilteredMarkers(markers);
  }

  /** Update the payload attribute for the {@link #ANY_ANNOTATION_TERM}. */
  private void outputAnyTerm() throws IOException {                         // (3)
    state = captureState();
    termAtt.setEmpty().append(ANY_ANNOTATION_TERM);
    out.reset(payloadBytes.bytes);
    out.writeVInt(curEnd - curStart + 1);
    payloadBytes.length = out.getPosition();
    payloadAtt.setPayload(payloadBytes);
  }

  private void outputFirstAnnotatedTerm() {                                 // (4)
    restoreState(state);
    state = null;
    // Output first annotated term at same position as ANY_ANNOTATION_TERM
    posIncrAtt.setPositionIncrement(0);
  }

  @Override
  public boolean incrementToken() throws IOException {
    if (state != null) {
      outputFirstAnnotatedTerm();
      return true;
    }

    skippedPositions = 0;
    while (input.incrementToken()) {
      final int posIncr = posIncrAtt.getPositionIncrement();
      absPosition += posIncr;
      if (acceptCurrentToken()) {                                           // (5)
        posIncrAtt.setPositionIncrement(posIncr + skippedPositions);
        // Output the ANY_ANNOTATION_TERM term first
        if (absPosition == curStart) {
          outputAnyTerm();
        }
        return true;
      }
      skippedPositions += posIncr;
    }
    return false;
  }

  ...

}

(1): We need to handle the _any term ourselves now, because we index it differently.
(2): The payload of the _any_ term holds the length of the annotation as VInt.
(3): When the filter outputs the _any_ term, it sets the payloadAtt value to the length of the annotation. Before it does so though, it captures the state of the token stream, so that whatever attributes it changes can later be restored.
(4): Restore the state of the stream to the one before the _any_ term. Also sets the position increment of the term to 0, so that it is output at the same position of the _any_ term.
(5): acceptCurrentToken() is identical to the implementation of the simple version.

So, it's slightly more involved, but hopefully still understandable. The main difference is the way this filter outputs the _any_ term, and because it adds a PayloadAttribute to that term only, it needs to be more careful with the state of the stream, and therefore captures and restores it. But other than that, it outputs each term that is covered by an annotation, so we should still be able to do everything that its simpler version allowed us. Let's take a look at the indexed information for the "color" field:

Terms for field [color], with positional info:
  _any_
    doc=0, freq=2
      pos=1, payload=[2]
      pos=6, payload=[3]
    doc=1, freq=1
      pos=1, payload=[1]
    doc=2, freq=1
      pos=2, payload=[1]
  brown
    doc=0, freq=1, pos=[2]
  pale
    doc=0, freq=1, pos=[6]
  red
    doc=0, freq=1, pos=[8]
    doc=1, freq=1, pos=[1]
    doc=2, freq=1, pos=[2]
  rosy
    doc=0, freq=1, pos=[1]
  violet
    doc=0, freq=1, pos=[7]

First, notice that the indexed information of the annotated terms is identical to the simple version of the filter. The only difference is how the _any_ term is indexed. You can see it has freq=2 (as opposed to 5 before) which exactly matches the number of "color" annotation markers in the document (after removing redundant ones). You'll also notice that it appears at positions 1 and 6 (the start of the annotation markers), and the payload of each position captures the length of the annotation. Nice, we were able to capture the same information in the index, with less term positions, and using Lucene's payloads!

However, if we try to search for colored foxes using the same search code from above, we will get no results. The reason is that SpanTermQuery defines the end position of the span to be 1+span.start() (which is what it should be for normal terms). Therefore, the terms _any_ (which appears at position 1) and fox (which appears at position 3) do not form a phrase. Fortunately, we can quite easily extend Lucene's SpanTermQuery to a MultiPositionSpanQuery, and override the end position to be span.start()+payloadValue. We then can search for colored foxes using this code:

final SpanQuery anyColor = 
    new MultiPositionSpanQuery(new Term(COLOR_FIELD, ANY_ANNOTATION_TERM));
final SpanQuery colorAsText = new FieldMaskingSpanQuery(anyColor, TEXT_FIELD);
final SpanQuery fox = new SpanTermQuery(new Term(TEXT_FIELD, "fox"));
final SpanQuery coloredFox = 
    new SpanNearQuery(new SpanQuery[] { colorAsText, fox }, 0, true);

Which, as expected, yields the following:

Searching for [spanNear([mask(color:_any_) as text, text:fox], 0, true)]:
  doc=0, text=quick rosy brown fox and a pale violet red dog

Which approach to choose?

So first, it's important to note that both approaches index the original annotated terms under the annotation field ("color" in this example). If we only indexed the _any_ term, then searching for a brown fox would translate into something like (text:brown WITHIN color:_any_) NEAR text:fox, which means that we now need to traverse 3 postings lists ("text:brown", "color:_any_" and "text:fox") which is of course more expensive.

The difference then is in how the _any_ term is indexed. Which approach is better for indexing it really depends on the nature of the index, annotations and search. For instance, if all your annotation markers span one position, then SimplePreAnnotatedTokenFilter is likely to perform better as it does not have to index or read the payload. If the annotation markers however span many positions, then PreAnnotatedTokenFilter is likely to perform better, since it indexes less data for the _any_ term.

If your search interface does not make frequent usage of the _any_ term (i.e. the queries are usually specific), then I guess that only if you know that your annotations span a single position, I would go with PreAnnotatedTokenFilter, as it is likely to index less data overall.

Lucene's TokenStream is a very powerful extension point. Knowing how to take advantage of it can greatly help in implementing rich and sophisticated search solutions!

Tuesday, January 5, 2016

Indexing Tagged Data with Lucene (part 3)

So far we've seen two approaches for indexing tagged data with Lucene: processing the text for every annotation and using TeeSinkTokenFilter to share the common analysis chain with the annotating filters. We've also seen that AnnotatorTokenFilter adds the terms that were accepted by the annotator to the index, which allows us to search for e.g. color:red or animal:fox.

We have also seen how to search for all documents with colors, by sending the query color:*. However, this query can be very expensive when the "color" field has many terms, as the query is expanded to all the terms in the field. According to this table, there are 300+ named colors. With animals (well, species on earth) the situation is even worse, as our "animals" field could potentially have >8 million tokens. If we executed the query animal:*, it's going to expand into one humongous query...

Adding an `_any_` term

Now that we know how to implement a TokenFilter, we can implement one which adds a fixed term, say _any_, to every document that contains a token that is accepted by the annotator. Then, we can simply search for animal:_any_, and our query becomes a simple TermQuery which traverses only one term's indexed list (also referred to as a posting list). Let's just do that:

public final class AnyAnnotationFilter extends TokenFilter {

  public static final String ANY_ANNOTATION_TERM = "_any_";

  private final CharTermAttribute termAtt = ...;
  private final PositionIncrementAttribute posIncrAtt = ...;

  private boolean addedOrigTerm = false;

  public AnyAnnotationFilter(TokenStream input) {
    super(input);
  }

  @Override
  public boolean incrementToken() throws IOException {
    if (!addedOrigTerm) {
      addedOrigTerm = true;                          // (1)
      return input.incrementToken();
    }

    termAtt.setEmpty().append(ANY_ANNOTATION_TERM);  // (2)
    posIncrAtt.setPositionIncrement(0);              // (3)
    addedOrigTerm = false;
    return true;
  }

  @Override
  public void reset() throws IOException {
    addedOrigTerm = false;
    super.reset();
  }
}

(1): First return the original term.
(2): Set the token's value to _any_ so that it's added to the index.
(3): Set the position increment to 0, so that the _any_ term is added at the same position as the original term.

We should now change the code of addDocument() as shown here to wrap the token stream with the new filter:

private void addDocument(IndexWriter writer, String text) {
  ...
  final TokenStream colorsStream = 
      new AnyAnnotationFilter(new AnnotatorTokenFilter(...));
  final TokenStream animalsStream = 
      new AnyAnnotationFilter(new AnnotatorTokenFilter(...));
  ...

Let's print the terms of the "color" field, to see the information of its _any_ term:

Terms for field [color], with positional info:
  _any_
    doc=0, freq=2, pos=[0, 4]
    doc=1, freq=1, pos=[1]
    doc=2, freq=1, pos=[1]
  brown
    doc=0, freq=1, pos=[0]
  red
    doc=0, freq=1, pos=[4]
    doc=1, freq=1, pos=[1]
    doc=2, freq=1, pos=[1]

You can see that the term _any_ is added to every document with colors, and that in doc=0 it was added twice, at the same positions of the color terms ("brown" and "red"). This now allows us to search for color:red AND animals:_any_ (documents with red animals). Actually that query will match any document which contains an animal, and the color "red", but not necessarily "a red animal". We can construct a SpanQuery programmatically to achieve that:

final SpanQuery red = new SpanTermQuery(new Term(COLOR_FIELD, "red"));
final SpanQuery redColorAsAnimal = new FieldMaskingSpanQuery(red, ANIMAL_FIELD);
final SpanQuery anyAnimal = new SpanTermQuery(new Term(ANIMAL_FIELD, "_any_"));
final SpanQuery redAnimals = new SpanNearQuery(
    new SpanQuery[] { redColorAsAnimal, anyAnimal }, 1, true);

What's nice about this query is that we search only on the annotation fields, i.e. that metadata that never explicitly existed in the input text! And indeed, this returns the following results:

Searching for [spanNear([mask(color:red) as animal, animal:_any_], 1, true)]:
  doc=1, text=only red dog
  doc=0, text=brown fox and a red dog

The _any_ term adds more information to the index (i.e. it repeats the indexed information of all annotations). However, in practice this would normally not have a big impact on your index, since it will already contain plenty of other indexed data. But, if you don't expect to execute such global matching queries (animal:*), you can avoid the indexing of the extra term by removing the AnyAnnotationFilter from your analysis chain.

Multi word annotations

So far we have seen how to index single word annotations with Lucene. The techniques that I covered are simple, but not enough if you want to detect and index multi-word annotations, e.g. the color Dark Sea Green. To do that, we will need to use another of Lucene's very useful data structures, the Payload. I will demonstrate how to use it for our multi-word annotations in a follow-up post.

Indexing Tagged Data with Lucene (part 2)

In the previous post I demonstrated how to index tagged data with Lucene. I used an Analyzer which filters out terms that are not accepted/recognized by an Annotator and added the text twice to the document: once for the "text" field and again for the "color" field (where the color-annotated terms were indexed).

I also said that this approach is inefficient when indexing large amounts of data, or when using multiple annotators, and that Lucene allows us to improve that by using its TeeSinkTokenFilter. In this post I will show how to use it, but before I do that, let's index some more data and add an AnimalAnnotator to also detect animals.

private void indexDocs(IndexWriter writer) {
  addDocument(writer, "brown fox and a red dog");
  addDocument(writer, "black and white cow");
  addDocument(writer, "no red animals here");
  writer.commit();
}

private void addDocument(IndexWriter writer, String text) {
  final Document doc = new Document();
  doc.add(new TextField(TEXT_FIELD, text, Store.YES));
  doc.add(new TextField(COLOR_FIELD, text, Store.NO));
  doc.add(new TextField(ANIMAL_FIELD, text, Store.NO));         // (1)
  writer.addDocument(doc);
}

public final class AnimalAnnotatorAnalyzer extends Analyzer {   // (2)
  @Override
  protected TokenStreamComponents createComponents(String fieldName) {
    final Tokenizer tokenizer = new WhitespaceTokenizer();
    final TokenStream stream = new AnnotatorTokenFilter(tokenizer, 
        AnimalAnnotator.withDefaultAnimals());
    return new TokenStreamComponents(tokenizer, stream);
  }
}

private Analyzer createAnalyzer() {
  final Analyzer colorAnnotatorAnalyzer = new ColorAnnotatorAnalyzer();
  final Analyzer animalAnnotatorAnalyzer = new AnimalAnnotatorAnalyzer();
  final Analyzer defaultAnalyzer = new WhitespaceAnalyzer();
  return new PerFieldAnalyzerWrapper(defaultAnalyzer,
        ImmutableMap. of(
            COLOR_FIELD, colorAnnotatorAnalyzer,
            ANIMAL_FIELD, animalAnnotatorAnalyzer));            // (3)
}

(1): Add an "animal" field, in addition to "text" and "color".
(2): Similar to ColorAnnotatorAnalyzer, only chains AnnotatorFilter with AnimalAnnotator.
(3): Add AnimalAnnotatorAnalyzer to the PerFieldAnalyzerWrapper.

We can now also print the terms of the "animal" field, and with their position information we get the following:

Terms for field [color], with positional info:
  brown
    doc=0, freq=1, pos=[0]
  red
    doc=0, freq=1, pos=[4]
    doc=1, freq=1, pos=[1]
    doc=2, freq=1, pos=[1]
Terms for field [animal], with positional info:
  dog
    doc=0, freq=1, pos=[5]
    doc=1, freq=1, pos=[2]
  fox
    doc=0, freq=1, pos=[1]

Now that we have "color" and "animal" annotations, we can explore more fun queries:

// Search for documents with animals and colors
Searching for [+animal:* +color:*]:
  doc=0, text=brown fox and a red dog
  doc=1, text=only red dog

// Search for documents with red animals
Searching for [+animal:* +color:red]:
  doc=1, text=only red dog
  doc=0, text=brown fox and a red dog

Indexing annotations with `TeeSinkTokenFilter`

TeeSinkTokenFilter is a TokenFilter which allows sharing analysis chains between multiple fields, that at some point need to go their separate ways. Named after Unix's tee command, it pipes the output of the source stream to one or more "sinks", which can then continue processing the token independently of the source as well the other "sinks".

This sounds perfect for our use case! We want to have the input text tokenized by whitespaces (and potentially also lowercased and have stopwords removed) and indexed in a "text" field (for regular searches). We also want to extract colors and animals, and index them in separate fields. With TeeSinkTokenFilter, our addDocument() code changes as follows:

private void addDocument(IndexWriter writer, String text) {
  final Tokenizer tokenizer = new WhitespaceTokenizer();
  tokenizer.setReader(new StringReader(text));
  final TeeSinkTokenFilter textStream = new TeeSinkTokenFilter(tokenizer);    // (1)
  final TokenStream colorsStream = new AnnotatorTokenFilter(                  // (2)
      textStream.newSinkTokenStream(), ColorAnnotator.withDefaultColors());
  final TokenStream animalsStream = new AnnotatorTokenFilter(                 // (2)
      textStream.newSinkTokenStream(), AnimalAnnotator.withDefaultAnimals());

  final Document doc = new Document();
  doc.add(new StoredField(TEXT_FIELD, text));                                 // (3)
  doc.add(new TextField(TEXT_FIELD, textStream));                             // (4)
  doc.add(new TextField(COLOR_FIELD, colorsStream));                          // (4)
  doc.add(new TextField(ANIMAL_FIELD, animalsStream));                        // (4)
  writer.addDocument(doc);

(1): Create the TeeSinkTokenFilter over WhitespaceTokenizer. You can chain additional TokenFilters for additional analysis (lowercasing, stopwrods removal), and wrap the full chain with TeeSinkTokenFilter.
(2): Create a SinkTokenStream from textStream, so that it receives all processed tokens and can perform independent analysis on them (here, applying AnnotatorTokenFilter). Notice that we create a separate sink for "colors" and "animals".
(3): Since we now add the value of the "text" field as a TokenStream, we also need to separately add it as a StoredField (the former does not store the value).
(4): Add each TokenStream to its respective field.

OK, so definitely the indexing code is not as simple as it was before. We now need to add fields to the document that depend on each other, and therefore we need to write a slightly more involved code. However, we do gain from the input text being processed only once, as opposed to 3 times before (the "text" and two annotation fields).

NOTE: when you work with TeeSinkTokenFilter, you should add the source field (textStream above) before you add the "sink" fields to the document, since it needs to be processed before they are.

Besides the changes to addDocument(), we no longer need to configure IndexWriter with the PerFieldAnalyzerWrapper that we created before. In fact, the way we add documents above, it doesn't matter what analyzer we configure the indexer with, since the indexer will use the pre-built TokenStream that we added the fields with.

Furthermore, if we run the same search code as we did before, we will get the exact same results. That's because we only changed the way the fields are processed, but not the way they are indexed. This is pretty cool!

Conclusions

Lucene does not allow one field to inject tokens into other fields, but with TeeSinkTokenFilter we can overcome that limitation. Each token that is processed by the source, is further sent to the "sinks", who can continue processing it independently of the source and other "sinks" (changes to the stream are only visible to them).

Using TeeSinkTokenFilter does come with a cost though. With the current implementation, every "sink" caches the state of the attributes of every produced token by the source. Furthermore, the "sinks" do not share these states with each other. That means that you might see a memory increase on your heap during indexing time, although these objects may be collected as soon as you're done indexing that document.

This can likely be improved by having the "sinks" share the cached states with each other. But even without it, if you're indexing large amounts of data, and especially if you have an expensive analysis chain that you can share with the "sinks", using TeeSinkTokenFilter might be beneficial to you.

Happy tee'ing!

Sunday, January 3, 2016

Indexing Tagged Data with Lucene

If you want to index the text quick brown fox and a red dog with Lucene, you just add a TextField to a Document and index it with IndexWriter. You can then search for it by sending a query which contains one or more of the indexed words. The following code snippet demonstrates how easy it is:

final String text = "quick brown fox and a red dog";
final Document doc = new Document();
doc.add(new TextField(TEXT_FIELD, text, Store.YES);           // (1)
writer.addDocument(doc);                                      // (2)
writer.commit();                                              // (3)

final Query q = new TermQuery(new Term(TEXT_FIELD, "quick")); // (4)
final TopDocs results = searcher.search(q, 10);               // (5)
System.out.println(searcher.doc(results.scoreDocs[0].doc).get(TEXT_FIELD));

(1): Add the text to index as a TextField.
(2): Index the document. This will use the default Analyzer that was configured for IndexWriter to extract and process the field's terms.
(3): Commit the changes to the index so that they are visible to the searcher. Note, you rarely want to commit() after every indexed document, and in many cases it is preferred to use Lucene's near-realtime search.
(4): Create a Query to search for the text. Usually you will use a QueryParser to parse a query such as text:quick.
(5): Execute the search. TopDocs holds the list of matching documents; in the example above only one document.

However, what if you (or your users) are not interested in the direct information that’s embedded in the text, but rather about meta-information such as "documents with red colors" or "documents about foxes (animal)"? Clearly, feeding search queries such as animal:fox or color:red isn’t going to work since we never added "color" nor "animal" fields to the document.

Detecting color terms

Named-entity recognition (also known as Named-entity extraction) is a task that seeks to classify elements in the text into categories. There are various tools that cope with that task, and a quick search for entity extraction tools returns quite a few, as well this discussion. Such extraction techniques and tools are beyond the scope of this post though. Let's assume that we have a simple Annotator interface, which can either accept or reject tokens, with two implementations: a ColorAnnotator which accepts red and brown, and an AnimalAnnotator which accepts fox and dog.

How could we use this annotator to tag our text, such that the index captured the tagged data, and allowed us to find the document by searching for color:red or animal:fox?

An annotating `TokenFilter`

As mentioned above, when we add text fields to documents, they are processed by the Analyzer that was configured for the indexer. Lucene's Analyzer produces a TokenStream which processes the text and produces index tokens. That token stream comprises a Tokenizer and TokenFilter. The former is responsible for breaking the input text into tokens and the latter is responsible for processing them. Lucene offers a great variety of filters out of the box. Some drop tokens (e.g. stopwords), some replace tokens (e.g. stemming) and others inject additional ones (e.g. synonyms).

You can learn more about Lucene's analysis chain here.

Since we are not after breaking the input text into words, let's write an AnnotatorTokenFilter which emits only words that are accepted by an annotator. Fortunately, Lucene provides a base class for filters that keep only some of the input tokens, called FilteringTokenFilter. It lets us focus only on the task of accepting/rejecting tokens, while it handles the rest of the analysis stuff, which mainly consists of updating token attributes (it can sometimes be quite tricky). By extending it, and using our Annotator interface, we come up with the following implementation:

/**
 * A {@link FilteringTokenFilter} which uses an {@link Annotator} to 
 * {@link #accept()} tokens.
 */
public final class AnnotatorTokenFilter extends FilteringTokenFilter {

  private final CharTermAttribute termAtt = 
      addAttribute(CharTermAttribute.class);                             // (1)
  private final Annotator annotator;

  public AnnotatorTokenFilter(TokenStream input, Annotator annotator) {
    super(input);
    this.annotator = annotator;
  }

  @Override
  protected boolean accept() throws IOException {                        // (2)
    return annotator.accept(termAtt.buffer(), 0, termAtt.length());
  }
}

(1): The standard way to get access to the term attribute on the chain. Note that we don't need to explicitly populate it, as it's being populated by the downstream tokenizer and additional filters.
(2): The only method that we need to implement. If our annotator accepts the current token, we retain it on the stream. Otherwise, it will not be indexed.

Lucene also provides a KeepWordFilter which takes a set of words to keep. We could use it by passing a list of words for each category, however in reality a simple list of words may not be enough to extract entities from the text (e.g. consider muli-word colors).

Sweet! It really can't get any simpler than that. Now we just need to build an Analyzer which uses our filter and we can move on to indexing and searching colors:

/**
 * An {@link Analyzer} which chains {@link WhitespaceTokenizer} and 
 * {@link AnnotatingTokenFilter} with {@link ColorAnnotator}.
 */
public static final class ColorAnnotatorAnalyzer extends Analyzer {
  @Override
  protected TokenStreamComponents createComponents(String fieldName) {
    final Tokenizer tokenizer = new WhitespaceTokenizer();           // (1)
    final TokenStream stream = new AnnotatorTokenFilter(tokenizer,   // (2)
        ColorAnnotator.withDefaultColors());
    return new TokenStreamComponents(tokenizer, stream);
  }
}

(1): For simplicity only, use a whitespace tokenizer.
(2): You will most likely want to chain additional filters before the annotator, e.g. lower-casing, stemming etc.

Index and search annotated colors

To index the color annotations, we are going to index two fields: "text" and "color". The "text" field will index the original text's tokens while the "color" field will index only the ones that are also colors. Lucene provides PerFieldAnalyzerWrapper which can return a different analyzer per field. It's quite handy and easy to construct:

private static Analyzer createAnalyzer() {
  final Analyzer colorAnnotatorAnalyzer = new ColorAnnotatorAnalyzer(); // (1)
  final Analyzer defaultAnalyzer = new WhitespaceAnalyzer();            // (2)
  return new PerFieldAnalyzerWrapper(defaultAnalyzer,
      ImmutableMap.<String, Analyzer> of(
          COLOR_FIELD, colorAnnotatorAnalyzer));                        // (3)
}

(1): The color annotating analyzer that we created before.
(2): The default analyzer to be used for all fields without an explicit set analyzer.
(3): Set the color annotating analyzer to the "color" field.

Our indexing code then changes as follows:

final Analyzer analyzer = createAnalyzer();
final IndexWriterConfig conf = new IndexWriterConfig(analyzer); // (1)
final IndexWriter writer = new IndexWriter(dir, conf);

final Document doc = new Document();
doc.add(new TextField(TEXT_FIELD, TEXT, Store.YES));
doc.add(new TextField(COLOR_FIELD, TEXT, Store.NO));            // (2)
writer.addDocument(doc);
writer.commit();

(1): Configure IndexWriter with our PerFieldAnalyzerWrapper.
(2): In addition to adding the text to the "text" field, also add it to the "color" field. Notice that there is no point storing it for this field too.

Printing the indexed terms for each field will yield the following output:

Terms for field [text]:
  a
  and
  brown
  dog
  fox
  quick
  red
Terms for field [color]:
  brown
  red

As you can see, the "text" field contains all words from the text that we indexed, while the "color" field contains only the words that were identified as colors by our ColorAnnotator. We can now also search for color:red to get the document back:

final Query q = new TermQuery(new Term(COLOR_FIELD, "red"));
final TopDocs results = searcher.search(q, 10);
System.out.println(searcher.doc(results.scoreDocs[0].doc).get(TEXT_FIELD));

That prints the following output, which is the value of the "text" field of the document:

quick brown fox and a red dog

Annotated terms' position

If we also print the full indexed information about the "color" terms, we will notice that they retain their original text position:

Terms for field [color], with additional info:
  brown
    doc=0, freq=1
      pos=1
  red
    doc=0, freq=1
      pos=5

This is important since it provides us a direct back-reference to the input text. That way we can search for "foxes that are brown-colored", while if we omit the exact position information of 'brown', we may not be able to associate the token 'fox' with the color 'brown'. Let's demonstrate that capability using Lucene's a SpanQuery:

final SpanQuery brown = new SpanTermQuery(new Term(COLOR_FIELD, "brown")); // (1)
final SpanQuery brownText = new FieldMaskingSpanQuery(brown, TEXT_FIELD);  // (2)
final SpanQuery quick = new SpanTermQuery(new Term(TEXT_FIELD, "quick"));  // (3)
final Query q = new SpanNearQuery(                                         // (4)
    new SpanQuery[] { quick, brownText }, 1, true);

final TopDocs results = searcher.search(q, 10);
System.out.println(searcher.doc(results.scoreDocs[0].doc).get(TEXT_FIELD));

(1): Matches "brown" in the "color" field.
(2): Matches "quick" in the "text" field.
(3): Since SpanNearQuery requires the input SpanQuery-ies to refer to the same field, we mask the "color" one under the "text" field.
(4): Search for "quick" followed by "brown".

A note about efficiency

In this post I've demonstrated a very basic approach to index and search tagged data with Lucene. As I noted above, the input data is added to a separate field for every annotator that you will apply to it. This may be OK in case you index short texts and only apply few annotators. However, if you index large data, and especially since most likely your analysis chain comprises something more complex than just whitespace tokenization, there will be performance implications to this approach.

In a follow-up post I will demonstrate how to process the input data only once, by using another of Lucene's cool TokenStream implementations, TeeSinkTokenFilter.

Sunday, April 27, 2014

Expressions with Lucene

Lucene's expressions module allows computing values for documents using arbitrary mathematical expressions. Among other things, expressions allow a very easy and powerful way to e.g. sort search results and geospatial faceting. The module parses String expressions in JavaScript notation, and returns an object which computes a value for each requested document. Expressions are built of literals, variables and functions. For example, in the expression 5*sqrt(votes), 5 is a literal, sqrt is a function and votes is a variable.

The JavaScript parser recognizes a handful of useful functions, such as max, min, sqrt etc. The javadocs contain the full list of functions as well as operators that the parser recognizes. The variables are resolved by binding their name to a ValueSource. For example, in order to parse and execute the above expression, you need to write code similar to this:

Expression expr = JavascriptCompiler.compile("5*sqrt(votes)");
SimpleBindings bindings = new SimpleBindings();
bindings.add(new SortField("votes", SortField.Type.INT));

The votes variable is bound to the numeric field "votes" (usually, a NumericDocValuesField). SimpleBindings let you bind a variable to a SortField, however internally it is bound to a ValueSource. Expression itself returns a ValueSource, and when your application asks for the value of a document, it computes it based on the formula and the bounded variables' ValueSources.

Customizing expressions

JavascriptCompiler lets you pass a mapping of custom functions, where each function is implemented by a public and static Method, which takes up to 256 double parameters and returns a double. You can see a good example here. That, together with variables, provides great customization capabilities of expressions.

Sometimes customizing expressions is not so straightforward. For example, someone recently asked on the Lucene user-list how to use a multi-valued field in an expression, for the purpose of computing different functions on it (max, sum etc.). At first, it looks like a custom function, e.g. maxAll() (max() is already taken), which can be embedded in an expression like 5*maxVal(data). However, since data is a variable, and variables are bound to ValueSources (which return a single value for a document), we cannot pass all values of data to maxAll().

We can implement a ValueSource though, which returns the maximum value of the field data, and bind a variable max.data to it. Since this isn't a true function, we cannot use a more natural notation like max(data). Perhaps one day Lucene will have built-in support for multi-valued numeric fields, and the expressions module will auto-detect such fields and pass all values to the function (you're welcome to contribute patches!). Until then though, you need to implement a ValueSource for each such function, but fortunately it is quite trivial. I wrote some prototype code below which demonstrates how to do that.

▶ Multi-valued expressions demo

First, to index a multi-valued numeric field, you can use BinaryDocValues by encoding the list of numbers into a byte[]. The following code indexes five documents with a binary doc-values field which encodes some integers using variable-length coding:

BytesRef scratch = new BytesRef(24); // max for three longs
ByteArrayDataOutput out = new ByteArrayDataOutput(scratch.bytes);
for (int i = 0; i < 5; i++) {
  Document doc = new Document();
  out.reset(scratch.bytes);
  out.writeVLong(i);
  out.writeVLong(i*2);
  out.writeVLong(i*3);
  scratch.length = out.getPosition();
  doc.add(new BinaryDocValuesField("data", scratch));
  writer.addDocument(doc);
}

Next, we need to write a MaxValuesSource which reads from the binary doc-values field and returns the maximum value. Its getValues method looks like this:

@Override
public FunctionValues getValues(Map context, AtomicReaderContext context)
  throws IOException {
  final BinaryDocValues bdv = context.reader().getBinaryDocValues(field);
  final BytesRef scratch = new BytesRef(24); // max for three longs
  final ByteArrayDataInput in = new ByteArrayDataInput();
  return new DoubleDocValues(this) {
    @Override
    public double doubleVal(int doc) {
      bdv.get(doc, scratch);
      in.reset(scratch.bytes, 0, scratch.length);
      long v1 = in.readVLong();
      long v2 = in.readVLong();
      long v3 = in.readVLong();
      return Math.max(Math.max(v1, v2), v3);
    }
  };
}

Since SimpleBindings does not allow binding variables to arbitrary ValueSources, we can write our own Bindings extension. For simplicity, I wrapped a SimpleBindings instance for all SortFields, but the custom ones:

class CustomBindings extends Bindings {

  private final Map<String,ValueSource> vsNames = new HashMap<>();
  private final SimpleBindings simple = new SimpleBindings();
  
  public void add(SortField sortField) {
    simple.add(sortField);
  }
  
  public void add(String name, ValueSource valueSource) {
    vsNames.put(name, valueSource);
  }
  
  @Override
  public ValueSource getValueSource(String name) {
    if (vsNames.containsKey(name)) {
      return vsNames.get(name);
    } else {
      return simple.getValueSource(name);
    }
  }
  
}

And finally, to use them all in an expression we should write this code:

Expression expr = JavascriptCompiler.compile("5*sqrt(max.data)");
CustomBindings bindings = new CustomBindings();
// bind the variable to the custom ValueSource
bindings.add("max.data", new MaxValueSource("data"));

Nested expressions

Suppose that your documents contain longitude and latitude fields and you want to use them to compute a document's relevance, by computing the haversine formula. You can easily do that by compiling an expression such as haversin(40.7143528,-74.0059731,latitude,longitude).

Now, what if you want to use the result of that expression in another expression, such as _score + 1/(1+haversin(40.7143528,-74.0059731,latitude,longitude))? It starts to become somewhat longish to read. Wouldn't it be better if we can encapsulate that in a distance variable and read the expression _score + 1/(1+distance) instead? Fortunately, we can do that easily with nested/sub expressions:

Expression distance = JavascriptCompiler.compile(
  "haversin(40.7143528,-74.0059731,latitude,longitude)");
Expression expr = JavascriptCompiler.compile("_score + 1/(1+distance)");
SimpleBindings bindings = new SimpleBindings();
bindings.add(new SortField("_score", SortField.Type.SCORE));
bindings.add(new SortField("latitude", SortField.Type.LONG));
bindings.add(new SortField("longitude", SortField.Type.LONG));
bindings.add("distance", distance);

Since a variable can be bound to any ValueSource, and expressions are in fact ValueSources, we can bind the result of an expression to a variable in another expression. Nested expressions also cache their result in case you use them e.g. in multiple clauses of another expression. This could be useful if you have a very expensive expression which needs to be evaluated multiple times per document.

The expressions module is very powerful and lets you code custom ranking/sorting/faceting formulas easily. With custom functions, variables and ValueSources, it makes it trivial to extend and build upon even further. If you implement useful functions, you're welcome to contribute them!

Thursday, April 24, 2014

Benchmarking Updatable DocValues

In my previous post I described how updatable DocValues are implemented, and concluded with a brief discussion about alternative approaches we could take to implement them. I created a benchmark, leveraging our benchmarking framework, to measure the indexing and searching effects of updatable DocValues. In this post I compare re-indexing a document versus updating a numeric/binary doc-values field.

The Benchmark

I ran the benchmark on an IBM System x3690 X5, a 40 cores machine (dual Intel® Xeon® E7-2850 clocked @ 2.00GHz, 24MB per-core cache) with SAS 10K disks, using Oracle's 64-bit Java 1.7.0_45 (4GB heap). I also set Linux's swappiness to 0. I indexed two Wikipedia datasets: WIKIBIG (full documents, 6.64 million total docs) and WIKIMEDIUM (documents clipped to 1K chunks of text, 33.3 million total docs), with the following settings:

Directory: NRTCachingDirectory(MMapDirectory)
Analyzer: StandardAnalyzer (no stopwords)
ID postings: MemoryPostingsFormat (speedup lookups by ID)
Postings: Lucene41 (the current default)
DocValues: Direct (eliminates HW factors when reading DocValues)
I index the document’s last-modification-time as a numeric doc-values field, and the title also as a binary doc-values field. Additionally, each document is indexed several fields such as body, title etc., with their respective content from the dataset.

The benchmark (all sources available here) performs the following operations:

Document updates
Documents are updated by multiple threads, where each run performs a different type of update, at a set docs/sec rate. Threads record the time to execute the update:

reindex (writer.updateDocument()): the document is loaded from the dataset and re-indexed fully
numeric (writer.updateNumericDocValue()): the last-modification-time is updated to current time
binary (writer.updateBinaryDocValue()): the title byte[] is updated

Index Reopens
The index is reopened at a set interval by a separate thread. The thread records the time it took for each reopen as well as whether the index was in fact reopened, and whether the thread fell behind (i.e. if reopen was too slow to achieve the target interval).

Searches
While the index is being updated and reopened, search threads execute search tasks against it, and sort the results by the last-modification-time numeric doc-values field to exercise the updated doc-values.

Parameters
There are different costs to re-indexing a document and updating a doc-values field. Re-indexing a document has both some cost during writer.updateDocument(), when the document’s fields are analyzed, as well as during index reopen when deleted documents’ flags are turned off in the in-memory bit vectors of the segments that hold them, and the re-indexed information (postings, stored fields, doc-values…) is flushed. When updating a doc-values field, writer.updateNumericDocValue() merely caches the update, and therefore it’s almost a 0-cost operation, however during index reopen the doc-values field is entirely rewritten for all documents in that segment, even documents that were not updated. Therefore, doc-values updates should see the majority of the cost in that phase.

The benchmark takes few parameters in order to compare different setups:

docsPerSec: a comma separated list of docs/sec rates. A higher rate means more documents are updated between index reopens.
The assumption is that higher rates are worse for re-indexing than updating a doc-values field, since writer.updateDocument() is more expensive than writer.updateNumericDocValue().
reopenIntervalSec: a comma separated list of index reopen intervals. A lower interval means more index reopens, but each reopen flushes a smaller number of document updates.
The assumption is that lower intervals are bad for updatable doc-values, since they have a constant cost no matter how many documents are updated between index reopens, while even though re-indexing a document also has some cost during index reopen, it is proportionally smaller as more documents are flushed at once. Also, because the test uses NRTCachingDirectory, most likely those flushed files don’t even hit the disk.

Results

I ran the benchmark over the two datasets with the following parameters: numIndexThreads=4, numSearchThreads=2, reopenIntervalSec=1,2,5 and docsPerSec=1,10,100,1000. The benchmark runs each reopenIntervalSec and docsPerSec combination for 30 seconds and outputs the following results for each dataset and update method, excluding the first 5 seconds (warmup):

docs/s: the docs/sec rate
reopen/s: the reopen interval
reopen(ms): the average time (in milliseconds) it took to reopen the reader
update(ms): the total time (in milliseconds) for executing the update, averaged per reopen
total(ms): the total time spent per index reopen (sum of reopen(ms) and update(ms))
perdoc(ms): the average cost (in milliseconds) for each updated document
query/s: the average query/sec rate that was achieved, per reopen

The WIKIBIG dataset consists of full Wikipedia documents, and therefore re-indexing a document is expected to be more expensive (proportionally) than for the WIKIMEDIUM dataset. On the other hand, it consists of fewer documents, therefore a doc-values update needs to rewrite a smaller amount of data, and (proportionally) should be cheaper on this index. The results cannot easily be compared between the two datasets, therefore let’s analyze them separately:

WIKIBIG

WIKIBIG: document re-indexing

docs/s reopen/s reopen(ms) update(ms)  total(ms) perdoc(ms) query/s
     1        1      27.13       7.14      34.26      28.99    9.44
     1        2      39.72      11.36      51.08      26.49    9.33
     1        5      91.96      28.33     120.29      24.06    9.40
    10        1      68.13      64.42     132.56      13.68    8.98
    10        2      95.42     157.71     253.14      12.84    9.02
    10        5     189.03     412.33     601.36      12.36    9.43
   100        1     145.55     342.42     487.97       4.96    8.78
   100        2     195.56     523.13     718.70       3.85    8.44
   100        5     513.92   2,089.50   2,603.42       5.31    8.40
  1000        1     531.59     821.79   1,353.38       1.67    7.54
  1000        2   1,050.62   1,768.60   2,819.22       1.91    7.52
  1000        5   2,528.44   4,264.67   6,793.11       1.41    8.45

WIKIBIG: update numeric doc-values

docs/s reopen/s reopen(ms) update(ms)  total(ms) perdoc(ms) query/s
     1        1     225.05       0.00     225.05     207.05   10.70
     1        2     294.69       0.00     294.69     152.80   10.85
     1        5     451.21       0.00     451.21      93.35   10.76
    10        1     592.13       0.00     592.13      61.09    7.92
    10        2     754.99       0.00     754.99      38.30    9.57
    10        5     909.14       0.17     909.30      18.43    9.72
   100        1     794.02       5.42     799.45       8.08    6.56
   100        2     761.09       8.53     769.63       4.16    9.28
   100        5     806.43      20.67     827.10       1.66    9.67
  1000        1   1,066.50      11.29   1,077.79       1.02    3.96
  1000        2     747.82      25.87     773.69       0.42    8.76
  1000        5     787.40      74.00     861.40       0.17    9.50

WIKIBIG: update binary doc-values

docs/s reopen/s reopen(ms) update(ms)  total(ms) perdoc(ms) query/s
     1        1     351.30       0.00     351.30     351.30   10.30
     1        2     712.00       0.00     712.00     356.00    9.72
     1        5   1,138.40       0.00   1,138.40     227.68   10.38
    10        1   1,971.81       0.00   1,971.81     100.13    5.04
    10        2   2,002.26       0.79   2,003.04     101.60    6.11
    10        5   2,093.48       0.33   2,093.82      40.79   10.43
   100        1   2,205.83       9.15   2,214.98      10.62    5.46
   100        2   2,146.07       8.00   2,154.07      10.44    5.93
   100        5   2,334.82      21.17   2,355.99       4.46   10.21
  1000        1   2,187.46      55.00   2,242.46       1.07    4.52
  1000        2   2,352.30      26.67   2,378.97       1.03    5.85
  1000        5   2,390.13      63.67   2,453.79       0.46    9.05

Reopen + Update times

Reopening is almost always faster when re-indexing than updating doc-values. Also, when updating doc-values, the time does not significantly increase as more documents are updated. That is due to the fact that the heavy part of doc-values updates - rewriting the doc-values field - is the same whether we update 1 or 1000 documents in a segment. As for executing the update itself, it is negligible when updating doc-values compared to re-indexing a document.

Total update cost

The results show that when the number of document updates is small, it is faster overall to re-index the documents than update the doc-values field. However when there are many document updates, re-indexing cannot keep up with the reopen interval and starts to fall behind. On the other hand, updating doc-values reaches a "steady" cost very quickly, and is less affected by the number of document updates.

Per document cost

While the cost per document consistently decreases in all update methods, doc-values achieve nearly 90% lower costs than re-indexing the documents. E.g. in the combination of 1000 docs/sec and reopen every 5 seconds (5000 document updates on average between reopens), the per-document cost of numeric doc-values is 0.17 msec compared to 1.41 msec when re-indexing the document. Therefore it looks like that if your application updates many documents between index reopens, updating the doc-values field seems better than re-indexing them.

Query per second

Updating doc-values seems to affect running searches more than re-indexing documents when the reopen interval is low. However, when updating doc-values at a given indexing rate, QPS improves as the reopen interval increases. This is expected because the I/O cost does not change much as more documents are updated, unlike when re-indexing documents. Also, given that we see same drops even when updating binary doc-values (which do not participate in the search at all), strengthens the assumption that the drop is mainly due to I/O, and could be that if I had used a faster hardware (e.g. SSD), the drops would not show at all, or be less significant.

WIKIMEDIUM

WIKIMEDIUM: document re-indexing

docs/s reopen/s reopen(ms) update(ms)  total(ms) perdoc(ms) query/s
     1        1      21.48       2.43      23.92      22.00    4.82
     1        2      28.90       5.93      34.83      18.06    4.91
     1        5      36.53       7.67      44.20       9.14    4.90
    10        1      36.78      15.27      52.05       5.37    4.78
    10        2      50.95      63.50     114.45       5.81    4.43
    10        5      73.93      96.00     169.93       3.44    4.97
   100        1      65.79      36.19     101.98       1.04    4.58
   100        2     107.99      76.29     184.27       0.94    4.48
   100        5     209.22     228.17     437.39       0.89    4.41
  1000        1     127.11     141.04     268.15       0.28    4.46
  1000        2     253.23     364.27     617.50       0.32    4.17
  1000        5     306.16     790.00   1,096.16       0.23    4.43

WIKIMEDIUM: update numeric doc-values

docs/s reopen/s reopen(ms) update(ms)  total(ms) perdoc(ms) query/s
     1        1     796.04       0.00     796.04     700.52    3.78
     1        2   1,218.09       0.00   1,218.09     609.04    4.19
     1        5   2,833.26       0.00   2,833.26     531.24    4.19
    10        1   4,543.56       1.71   4,545.28     113.63    2.48
    10        2   4,276.63       1.14   4,277.77     103.97    2.41
    10        5   4,484.67       1.67   4,486.33      81.08    3.93
   100        1   4,158.31      18.86   4,177.16      11.46    2.00
   100        2   4,069.35      25.29   4,094.64      10.48    2.48
   100        5   4,042.75      33.50   4,076.25       7.54    4.14
  1000        1   5,139.27      47.17   5,186.44       1.14    2.06
  1000        2   4,221.89      49.43   4,271.32       1.05    2.26
  1000        5   4,165.38      78.33   4,243.71       0.77    3.62

WIKIMEDIUM: update binary doc-values

docs/s reopen/s reopen(ms) update(ms)  total(ms) perdoc(ms) query/s
     1        1   7,085.58       0.17   7,085.74   1,036.94    1.44
     1        2   6,294.42       0.00   6,294.42   1,124.00    2.59
     1        5   7,519.37       0.00   7,519.37     911.44    3.12
    10        1   8,874.71      19.00   8,893.71      69.48    2.48
    10        2   8,882.72      27.00   8,909.72      65.51    2.72
    10        5   8,660.44     290.50   8,950.94      53.92    3.14
   100        1   9,960.65     120.50  10,081.15       8.03    2.54
   100        2   9,961.24      87.00  10,048.24       7.30    2.80
   100        5   9,659.66      80.50   9,740.16       6.42    3.17
  1000        1  10,652.19     173.50  10,825.69       0.87    3.84
  1000        2  10,840.01     192.50  11,032.51       0.80    3.07
  1000        5   9,516.62     817.50  10,334.12       0.67    3.47

As described before, the cost of updating a doc-values field in a segment depends on the number of documents in the segment, and less on how many documents are updated in it. Furthermore, even at a low docs/sec rate there’s a high chance that between index reopens all index segments will be "touched" (updated), and therefore on the WIKIMEDIUM index (which contains x5 more documents than the WIKIBIG index) the cost of index reopen is substantially higher.

The results clearly show that re-indexing documents is always preferred to updating doc-values at the given reopen intervals. More so, updatable doc-values almost always fell behind with the reopen interval. This makes the per-document cost results irrelevant to look at because in practice there were much fewer index reopens than in the re-index case, and therefore each reopen affected a larger number of documents (which makes the average per-document appear low). Maybe at a much higher reopen interval (e.g. few tens seconds, or minutes), updating doc-values on that index would have performed better than re-indexing, however in this benchmark I focused on the near-real-time aspects of updatable doc-values, and therefore chose short reopen intervals.

Summary

The current implementation of updatable doc-values can be very useful to applications with relatively small indexes (few million documents) which reopen the index every few seconds. However, for applications with bigger indexes it is currently very expensive and such applications might prefer to re-index the documents instead.

One aspect of re-indexing a document that I didn't fully measure is content acquisition. In this benchmark, the content of the document is read from the local disk while for many applications it usually means accessing a remote repository, which slows things down considerably. Therefore it is recommended for such applications to benchmark this part in their indexing pipeline and determine what is the point where updating doc-values become slower than entirely re-indexing the documents.

So what's next?

In this post I described the stacked segments approach and how it could speed up updatable doc-values, by only writing the documents that were updated in each stack. On the other hand this would have some impact on search, since in order to determine the value of a document, we’d need to look it up through the "stacks".

There is however a hybrid approach, which is similar in nature to document deletions. Instead of flushing the updates to the index directory in every index reopen, we can carry the updates in-memory, and only flush when the updates take too much memory. So if an application updates only a few documents between index reopens, it won’t pay the cost of rewriting the entire doc-values field in each reopen. At some point though, the doc-values field will be written to disk (e.g. on commit or when they consume too much memory), but when that happens then many updates have already accumulated anyway, and so overall the cost of flushing per document is smaller. On the search side, the value of a document will only need to be looked for in two "stacks" - the in-memory and the base stack, which hopefully won’t have a big impact on search.

I think that this approach could work well for near-real-time applications, since they often reopen very frequently with only few document updates in between. Also, in this benchmark I've measured the cost of updating a single doc-values field, however in real-life applications there are often several such fields that are updated, and therefore the cost of reopening an index will be much higher (since every field will be rewritten). Applications already hold the deletions bit-vector in memory (for every segment with deletions; maxDoc/8 bytes), and I believe that spending an additional proportional amount of that memory for doc-values updates is tolerable. Still, once in a while a "naive"-looking reopen (only few updates since last reopen) could trigger a rewrite of the entire doc-values field. To avoid that we need to only flush the documents that were updated (i.e. stacked segments), but this could be a good compromise until we have that.

Looks like there is more fun work to do with updatable doc-values. If you have further ideas on how to improve them, your patches are welcome!

Wednesday, April 23, 2014

Updatable DocValues Under the Hood

We recently committed updatable numeric and binary doc-values fields, which lets you update the value of such fields in a document without re-indexing the whole document. This allows using such fields for e.g. boosting documents relevancy to a query, especially when the boosting factor is frequently changing (like date, ratings, counters etc.). In this post I describe the under-the-hood mechanics of updating those fields.

A brief overview on DocValues

DocValues have been around since Lucene 4.0, and while they have gone through several changes and improvements since then, the basic concept remained the same – they let you record a value per-document per-field in each segment, and allow for very efficient random-access retrieval of those values. Lucene offers several doc-values types: numeric for single-valued numeric fields (date, ratings, counters), sorted/sortedset for single/multi-valued string (mostly, but can be any BytesRef) fields used for sorting, faceting, grouping etc., and binary fields for storing arbitrary byte[] per-document.

Also, Lucene provides several implementations for writing and reading DocValues: Lucene45 (the current default) stores the values on disk, but keeps some information in-memory for fast lookups. Memory loads the values into compressed arrays, while Direct loads the values into uncompressed arrays (faster lookups, but higher memory consumption). If your application runs in a resource constrained environment, you can use Disk which keeps absolutely everything on disk, at the cost of slower lookups.

DocValues record a value for every document in every field and segment even when a document has no value associated with a field. In most cases this is acceptable because you usually store a value for most, if not all, documents in a field (e.g. scoring factors, facets). But even if you index very sparse doc-values fields, the default encoding is very efficient at indexing blocks of "missing" values. Still, if you index hundreds of millions of documents with few dozen doc-values fields, but only a small percentage of them contain a value, writing your own sparse doc-values encoding may be beneficial.

Document deletions are a form of in-place document updates!

The mechanism by which Lucene updates DocValues fields is very similar to how document deletions are handled, so let’s look at how that’s done first. When you issue e.g. a deleteDocument(term) request, Lucene caches the request until an updated SegmentReader is needed, or the index is committed. It then resolves all the issued deletes to determine which documents are affected, both in already flushed segments as well as documents that were in IndexWriter’s RAM buffer when the delete was issued.

Up-to-date SegmentReaders are needed in two cases: when a merge kicks off (SegmentMerger uses SegmentReaders to merge segments) and when the application reopens its NRT (near-real-time) reader.

If an updated reader is needed, Lucene carries the deletes in memory by turning off a bit in the in-memory bit vector of each affected segment, and avoids flushing the bit vector to the index directory, which may reside on a slow storage device (e.g. NFS share). If the application also has an NRT reader open, the bit vector is cloned (because the app’s reader should not be made aware of the new deletes until it’s reopened), leading to double RAM consumption by those vectors. However, once that cost is paid once, it is "fixed" (1 bit/doc, or maxDoc/8 bytes overall), no matter how many documents are later deleted. Also, as soon as the application releases its NRT reader (usually after reopening it), one of the vectors is released and can be reclaimed by the GC.

Updatable DocValues

Like document deletions, when the application issues e.g. an updateNumericDocValue(term, field, value) request, Lucene caches the request until an updated reader is needed or the index is committed, at which point it resolves the updates against all affected documents. Unlike deletes though, doc-values consume much more RAM, which makes carrying the updates in-memory not feasible for many applications: numeric doc-values may potentially consume 8 bytes per-document, while binary values are usually even more expensive.

Therefore, when the updates are resolved, Lucene currently rewrites the entire field that was updated to the index directory. So for example, if you update a numeric doc-values field of a document in a 1 million documents segment, Lucene rewrites 1 million numeric values, or 8MB (at most). Because Lucene segments are write-once, the updated field is written to a gen’d doc-values file of the segment. A reader then reads the values of a field from the respective gen’d doc-values file. While this has a highish cost during indexing, it does not affect search at all, because the values are read from a single file, and often this is a tradeoff that most applications are willing to make.

To illustrate, the code below indexes several documents with "date" and "ratings" NumericDocValuesFields and then lists the files in the index directory:

IndexWriterConfig conf = new IndexWriterConfig(...);
conf.setUseCompoundFile(false);
conf.setCodec(new SimpleTextCodec());
IndexWriter writer = new IndexWriter(dir, conf);
ReaderManager manager = new ReaderManager(writer, true);
for (int i = 0; i < 5; i++) {
  Document doc = new Document();
  doc.add(new StringField("id", ""+i, Store.NO));
  doc.add(new NumericDocValuesField("date", i));
  doc.add(new NumericDocValuesField("ratings", i*10));
  writer.addDocument(doc);
}

manager.maybeRefresh(); // triggers a flush
listDocValuesFiles(dir);
     
// update the 'date' field's value of document '0'
writer.updateNumericDocValue(
  new Term("id", "0"), "date", System.currentTimeMillis());
manager.maybeRefresh(); // triggers applying the update
listDocValuesFiles(dir);

You will see the following print:

files:
  _0.dat

files:
  _0_1.dat
  _0.dat

Notice how after the "date" field was updated and NRT reader re-opened, the index directory contains two sets of doc-values files: _0.dat stores both the "date" and "ratings" values and _0_1.dat stores just the "date" field’s values (the _1 part in the file name is the doc-values generation for the "date" field). Since the code uses SimpleTextCodec, we can easily print the contents of each file. Below is such print, clipped to show just the header of the files:

file: _0_1.dat
field date
  type NUMERIC

file: _0.dat
field date
  type NUMERIC
field ratings
  type NUMERIC

If we update the value of the "date" field for the same or any other document and then list the doc-values files in the index, we will see:

files:
  _0_2.dat
  _0.dat

Notice how _0_1.dat no longer exists in the index - that’s because it only contains values for the "date" field, which was entirely rewritten by the second update (under generation 2), and so _0_1.dat can be deleted.

Stacked Segments

On LUCENE-4258 we’ve started to implement field updates taking a stacked-segments approach. A stacked segment can be thought of as a layer/filter onto an existing segment, which indexes only the updates to that segment. When the full view of the segment is required (e.g. during search), the segment "collapses" all the updates down, so that the most up-to-date view of the segment is presented to the application.

The approach taken to implement updatable doc-values is similar in concept, only the doc-values "stacks" index the values of all documents. If we were to implement the approach taken on LUCENE-4258, then the doc-values stacks would index only the values of the updated documents. While this could have improved the indexing performance (writing a single value ought to be faster than writing 1 million values, no matter how fast your IO system is), it would affect search time since in order to resolve the value of a document, we would need to look for its most up-to-date value in any of the stacks. There are efficient ways to do this, but as usual, each comes with its own tradeoffs.

The key point is that stacked segments have a tradeoff between indexing and search and it is important to measure that tradeoff, as in Lucene, we often prefer to tradeoff in favor of search. I will publish in a separate post some benchmark numbers of the current approach, which will also serve as a baseline for measuring alternative approaches in the future.