Code Reduction: Indexing Tagged Data with Lucene (part 4)

Indexing single-word annotations is cool, but will likely not be enough. Let's take the following text (with crazy-colored animals): quick rosy brown fox and a pale violet red dog. Our simple ColorAnnotator only accepts single color words, such as "brown", "red" and "violet", however, it cannot detect "rosy brown" and "pale violet red" (and well, that is just unacceptable, isn't it!).

Annotating multiple words together can be done either before indexing, or during indexing by implementing another TokenFilter. In this post I focus on how to index those annotations, and therefore let's assume that our indexer receives the annotation markers together with the text to index. For simplicity, let's also assume that the information is given to us in an array of integer pairs, where the first integer denotes the start (word) position of the annotation in the text, and the second denotes the annotation's length. For example, for the text above we might get the following color annotation markers: [1,2,2,1,6,3,7,1,8,1].

quick rosy brown fox and a pale violet red dog
      ---------- (1,2)     --------------- (6,3)
           ----- (2,1)          ------     (7,1)
                                       --- (8,1)

To index these annotations we can choose between two approaches. Both index each word that is covered by an annotation marker, however they index the _any_ term differently. The first (and simpler one) will index it at every position as the original terms, while the second approach will represent it as a single token which spans multiple positions. The way to do that is to encode the annotation length in Lucene's PayloadAttribute.

`SimplePreAnnotatedTokenFilter`

The filter takes annotation markers as input (as described above), and filters tokens that do not fall within any annotation. It extends FilteringTokenFilter and only accepts or rejects tokens, but does not modify their attributes or injects new terms (hence its called Simple). Let's go through its code:

public final class SimplePreAnnotatedTokenFilter extends FilteringTokenFilter {

  private final PositionIncrementAttribute posIncrAtt = ...;
  private final int[] markers;

  private int absPosition;
  private int curStart;
  private int curEnd;
  private int pairIdx;

  public SimplePreAnnotatedTokenFilter(TokenStream input, int... markers) {
    super(input);
    this.markers = getSortedFilteredMarkers(markers);                 // (1)
  }

  @Override
  protected boolean accept() throws IOException {
    absPosition += posIncrAtt.getPositionIncrement();
    return acceptCurrentToken();
  }

  @Override
  public void reset() throws IOException {
    super.reset();
    absPosition = -1;
    pairIdx = 0;
    updateCurrentStartEnd();
  }

  private boolean acceptCurrentToken() {                                // (2)
    if (absPosition < curStart) {
      return false;
    }
    if (absPosition > curEnd) {
      pairIdx += 2;
      if (pairIdx < markers.length) {
        updateCurrentStartEnd();
        return acceptCurrentToken();                                    // (3)
      }
      // No more annotation markers                                     // (4)
      curStart = Integer.MAX_VALUE;
      curEnd = Integer.MAX_VALUE;
      return false;
    }
    return true;
 }

  private void updateCurrentStartEnd() {
    curStart = markers[pairIdx];
    curEnd = curStart + markers[pairIdx + 1] - 1;
  }

}

(1): Some annotation markers may be covered by others. E.g. in the example above, both (7,1) and (8,1) are covered by (6,3) and therefore are redundant to keep.
(2): Checks if the current token's position is covered by any annotation marker.
(3): Since we move through the markers sequentially, if the current token appears after the current annotation marker, continue to search for other markers that may cover it.
(4): All annotation markers were visited, therefore no additional tokens can be accepted. Since we extend FilteringTokenFilter, we can only accept/reject a token, but we cannot short-circuit the entire stream, so we will just ignore all future tokens.

Let's index few documents using this token filter and inspect the indexed terms as well try out some searches:

private void addDocument(IndexWriter writer, String text, int... colorAnnotations) {
  final Tokenizer tokenizer = ...;
  final TeeSinkTokenFilter textStream = ...;
  final TokenStream colorsStream = 
      new AnyAnnotationTokenFilter(                                // (1)
          new SimplePreAnnotatedTokenFilter(                       // (2)
              textStream.newSinkTokenStream(), colorAnnotations));
  ...
}

addDocument(writer, "quick rosy brown fox and a pale violet red dog", 
                    1, 2, 2, 1, 6, 3, 7, 1, 8, 1);
addDocument(writer, "only red dog", 1, 1);
addDocument(writer, "man with red pale face", 2, 1);

(1): As before, add the _any_ token in addition to all annotated color tokens.
(2): Our SimplePreAnnotatedTokenFilter, taking the input stream and annotation markers.

Let's print the information about the terms of the "color" field:

Terms for field [color], with positional info:
  _any_
    doc=0, freq=5, pos=[1, 2, 6, 7, 8]
    doc=1, freq=1, pos=[1]
    doc=2, freq=1, pos=[2]
  brown
    doc=0, freq=1, pos=[2]
  pale
    doc=0, freq=1, pos=[6]
  red
    doc=0, freq=1, pos=[8]
    doc=1, freq=1, pos=[1]
    doc=2, freq=1, pos=[2]
  rosy
    doc=0, freq=1, pos=[1]
  violet
    doc=0, freq=1, pos=[7]

As you can see, besides just the clear color terms ("red", "brown", "violet"), the "color" field also contains tokens that are not colors, but do appear as part of the color annotations. This is useful as it allows us to e.g. search for pale or rosy colors (color:(pale OR rosy)). Also, you can notice that the _any_ term appears in all documents and at the same position of other terms that were annotated as colors. This allows us to search for e.g. colored foxes, using this code:

final SpanQuery anyColor = 
    new SpanTermQuery(new Term(COLOR_FIELD, ANY_ANNOTATION_TERM));
final SpanQuery colorAsText = new FieldMaskingSpanQuery(anyColor, TEXT_FIELD);
final SpanQuery fox = new SpanTermQuery(new Term(TEXT_FIELD, "fox"));
final SpanQuery coloredFox = 
    new SpanNearQuery(new SpanQuery[] { colorAsText, fox }, 0, true);

You can notice how I set slop=0 on the SpanNearQuery, which is just like executing a phrase query. In this case our phrase query is "color:_any_ text:fox", only we cannot parse phrase queries across fields, so SpanNearQuery gives us what we need. Executing this query gives the following:

Searching for [spanNear([mask(color:_any_) as text, text:fox], 0, true)]:
  doc=0, text=quick rosy brown fox and a pale violet red dog

This document is found because the term _any_ appears at position 2 (same as brown), and since the term fox appears at position 3, these two follow each other and are considered a phrase. We now start to see the power of indexing multi-word annotations!

Indexing multi-word annotations as a single term

One down side of indexing an _any_ token for each original annotated term, is that if you have very long annotations, this might balloon your index more than you need, as well impact query performance. Instead, we can index a single _any_ term for each annotation, with enough information that will tell us the length of that annotation.

To achieve that, let's implement PreAnnotatedTokenFilter. Like its simpler cousin, it takes an array of annotation markers and filters terms that are not covered by any of the markers. Unlike it though, it also handles adding the _any_ term, and therefore does not extend FilteringTokenFilter. Let's take a look at its code:

public final class PreAnnotatedTokenFilter extends TokenFilter {

  public static final String ANY_ANNOTATION_TERM = "_any_";                 // (1)

  private static final int MAX_BYTES_IN_VINT = 5;

  private final CharTermAttribute termAtt = ...;
  private final PositionIncrementAttribute posIncrAtt = ...;
  private final PayloadAttribute payloadAtt = ...;
  private final int[] markers;

  private final BytesRef payloadBytes = new BytesRef(MAX_BYTES_IN_VINT);    // (2)
  private final ByteArrayDataOutput out = ...;

  private int skippedPositions;
  private int absPosition;
  private int curStart;
  private int curEnd;
  private int pairIdx;

  private State state = null;

  public PreAnnotatedTokenFilter(TokenStream input, int... markers) {
    super(input);
    this.markers = getSortedFilteredMarkers(markers);
  }

  /** Update the payload attribute for the {@link #ANY_ANNOTATION_TERM}. */
  private void outputAnyTerm() throws IOException {                         // (3)
    state = captureState();
    termAtt.setEmpty().append(ANY_ANNOTATION_TERM);
    out.reset(payloadBytes.bytes);
    out.writeVInt(curEnd - curStart + 1);
    payloadBytes.length = out.getPosition();
    payloadAtt.setPayload(payloadBytes);
  }

  private void outputFirstAnnotatedTerm() {                                 // (4)
    restoreState(state);
    state = null;
    // Output first annotated term at same position as ANY_ANNOTATION_TERM
    posIncrAtt.setPositionIncrement(0);
  }

  @Override
  public boolean incrementToken() throws IOException {
    if (state != null) {
      outputFirstAnnotatedTerm();
      return true;
    }

    skippedPositions = 0;
    while (input.incrementToken()) {
      final int posIncr = posIncrAtt.getPositionIncrement();
      absPosition += posIncr;
      if (acceptCurrentToken()) {                                           // (5)
        posIncrAtt.setPositionIncrement(posIncr + skippedPositions);
        // Output the ANY_ANNOTATION_TERM term first
        if (absPosition == curStart) {
          outputAnyTerm();
        }
        return true;
      }
      skippedPositions += posIncr;
    }
    return false;
  }

  ...

}

(1): We need to handle the _any term ourselves now, because we index it differently.
(2): The payload of the _any_ term holds the length of the annotation as VInt.
(3): When the filter outputs the _any_ term, it sets the payloadAtt value to the length of the annotation. Before it does so though, it captures the state of the token stream, so that whatever attributes it changes can later be restored.
(4): Restore the state of the stream to the one before the _any_ term. Also sets the position increment of the term to 0, so that it is output at the same position of the _any_ term.
(5): acceptCurrentToken() is identical to the implementation of the simple version.

So, it's slightly more involved, but hopefully still understandable. The main difference is the way this filter outputs the _any_ term, and because it adds a PayloadAttribute to that term only, it needs to be more careful with the state of the stream, and therefore captures and restores it. But other than that, it outputs each term that is covered by an annotation, so we should still be able to do everything that its simpler version allowed us. Let's take a look at the indexed information for the "color" field:

Terms for field [color], with positional info:
  _any_
    doc=0, freq=2
      pos=1, payload=[2]
      pos=6, payload=[3]
    doc=1, freq=1
      pos=1, payload=[1]
    doc=2, freq=1
      pos=2, payload=[1]
  brown
    doc=0, freq=1, pos=[2]
  pale
    doc=0, freq=1, pos=[6]
  red
    doc=0, freq=1, pos=[8]
    doc=1, freq=1, pos=[1]
    doc=2, freq=1, pos=[2]
  rosy
    doc=0, freq=1, pos=[1]
  violet
    doc=0, freq=1, pos=[7]

First, notice that the indexed information of the annotated terms is identical to the simple version of the filter. The only difference is how the _any_ term is indexed. You can see it has freq=2 (as opposed to 5 before) which exactly matches the number of "color" annotation markers in the document (after removing redundant ones). You'll also notice that it appears at positions 1 and 6 (the start of the annotation markers), and the payload of each position captures the length of the annotation. Nice, we were able to capture the same information in the index, with less term positions, and using Lucene's payloads!

However, if we try to search for colored foxes using the same search code from above, we will get no results. The reason is that SpanTermQuery defines the end position of the span to be 1+span.start() (which is what it should be for normal terms). Therefore, the terms _any_ (which appears at position 1) and fox (which appears at position 3) do not form a phrase. Fortunately, we can quite easily extend Lucene's SpanTermQuery to a MultiPositionSpanQuery, and override the end position to be span.start()+payloadValue. We then can search for colored foxes using this code:

final SpanQuery anyColor = 
    new MultiPositionSpanQuery(new Term(COLOR_FIELD, ANY_ANNOTATION_TERM));
final SpanQuery colorAsText = new FieldMaskingSpanQuery(anyColor, TEXT_FIELD);
final SpanQuery fox = new SpanTermQuery(new Term(TEXT_FIELD, "fox"));
final SpanQuery coloredFox = 
    new SpanNearQuery(new SpanQuery[] { colorAsText, fox }, 0, true);

Which, as expected, yields the following:

Searching for [spanNear([mask(color:_any_) as text, text:fox], 0, true)]:
  doc=0, text=quick rosy brown fox and a pale violet red dog

Which approach to choose?

So first, it's important to note that both approaches index the original annotated terms under the annotation field ("color" in this example). If we only indexed the _any_ term, then searching for a brown fox would translate into something like (text:brown WITHIN color:_any_) NEAR text:fox, which means that we now need to traverse 3 postings lists ("text:brown", "color:_any_" and "text:fox") which is of course more expensive.

The difference then is in how the _any_ term is indexed. Which approach is better for indexing it really depends on the nature of the index, annotations and search. For instance, if all your annotation markers span one position, then SimplePreAnnotatedTokenFilter is likely to perform better as it does not have to index or read the payload. If the annotation markers however span many positions, then PreAnnotatedTokenFilter is likely to perform better, since it indexes less data for the _any_ term.

If your search interface does not make frequent usage of the _any_ term (i.e. the queries are usually specific), then I guess that only if you know that your annotations span a single position, I would go with PreAnnotatedTokenFilter, as it is likely to index less data overall.

Lucene's TokenStream is a very powerful extension point. Knowing how to take advantage of it can greatly help in implementing rich and sophisticated search solutions!

Code Reduction

Monday, January 11, 2016

Indexing Tagged Data with Lucene (part 4)

`SimplePreAnnotatedTokenFilter`

Indexing multi-word annotations as a single term

Which approach to choose?

3 comments: