Monday, January 11, 2016

Indexing Tagged Data with Lucene (part 4)

Indexing single-word annotations is cool, but will likely not be enough. Let's take the following text (with crazy-colored animals): quick rosy brown fox and a pale violet red dog. Our simple ColorAnnotator only accepts single color words, such as "brown", "red" and "violet", however, it cannot detect "rosy brown" and "pale violet red" (and well, that is just unacceptable, isn't it!).

Annotating multiple words together can be done either before indexing, or during indexing by implementing another TokenFilter. In this post I focus on how to index those annotations, and therefore let's assume that our indexer receives the annotation markers together with the text to index. For simplicity, let's also assume that the information is given to us in an array of integer pairs, where the first integer denotes the start (word) position of the annotation in the text, and the second denotes the annotation's length. For example, for the text above we might get the following color annotation markers: [1,2,2,1,6,3,7,1,8,1].

quick rosy brown fox and a pale violet red dog
      ---------- (1,2)     --------------- (6,3)
           ----- (2,1)          ------     (7,1)
                                       --- (8,1)

To index these annotations we can choose between two approaches. Both index each word that is covered by an annotation marker, however they index the _any_ term differently. The first (and simpler one) will index it at every position as the original terms, while the second approach will represent it as a single token which spans multiple positions. The way to do that is to encode the annotation length in Lucene's PayloadAttribute.

SimplePreAnnotatedTokenFilter

The filter takes annotation markers as input (as described above), and filters tokens that do not fall within any annotation. It extends FilteringTokenFilter and only accepts or rejects tokens, but does not modify their attributes or injects new terms (hence its called Simple). Let's go through its code:

public final class SimplePreAnnotatedTokenFilter extends FilteringTokenFilter {

  private final PositionIncrementAttribute posIncrAtt = ...;
  private final int[] markers;

  private int absPosition;
  private int curStart;
  private int curEnd;
  private int pairIdx;

  public SimplePreAnnotatedTokenFilter(TokenStream input, int... markers) {
    super(input);
    this.markers = getSortedFilteredMarkers(markers);                 // (1)
  }

  @Override
  protected boolean accept() throws IOException {
    absPosition += posIncrAtt.getPositionIncrement();
    return acceptCurrentToken();
  }

  @Override
  public void reset() throws IOException {
    super.reset();
    absPosition = -1;
    pairIdx = 0;
    updateCurrentStartEnd();
  }

  private boolean acceptCurrentToken() {                                // (2)
    if (absPosition < curStart) {
      return false;
    }
    if (absPosition > curEnd) {
      pairIdx += 2;
      if (pairIdx < markers.length) {
        updateCurrentStartEnd();
        return acceptCurrentToken();                                    // (3)
      }
      // No more annotation markers                                     // (4)
      curStart = Integer.MAX_VALUE;
      curEnd = Integer.MAX_VALUE;
      return false;
    }
    return true;
 }

  private void updateCurrentStartEnd() {
    curStart = markers[pairIdx];
    curEnd = curStart + markers[pairIdx + 1] - 1;
  }

}
(1)
Some annotation markers may be covered by others. E.g. in the example above, both (7,1) and (8,1) are covered by (6,3) and therefore are redundant to keep.
(2)
Checks if the current token's position is covered by any annotation marker.
(3)
Since we move through the markers sequentially, if the current token appears after the current annotation marker, continue to search for other markers that may cover it.
(4)
All annotation markers were visited, therefore no additional tokens can be accepted. Since we extend FilteringTokenFilter, we can only accept/reject a token, but we cannot short-circuit the entire stream, so we will just ignore all future tokens.

Let's index few documents using this token filter and inspect the indexed terms as well try out some searches:

private void addDocument(IndexWriter writer, String text, int... colorAnnotations) {
  final Tokenizer tokenizer = ...;
  final TeeSinkTokenFilter textStream = ...;
  final TokenStream colorsStream = 
      new AnyAnnotationTokenFilter(                                // (1)
          new SimplePreAnnotatedTokenFilter(                       // (2)
              textStream.newSinkTokenStream(), colorAnnotations));
  ...
}

addDocument(writer, "quick rosy brown fox and a pale violet red dog", 
                    1, 2, 2, 1, 6, 3, 7, 1, 8, 1);
addDocument(writer, "only red dog", 1, 1);
addDocument(writer, "man with red pale face", 2, 1);
(1)
As before, add the _any_ token in addition to all annotated color tokens.
(2)
Our SimplePreAnnotatedTokenFilter, taking the input stream and annotation markers.

Let's print the information about the terms of the "color" field:

Terms for field [color], with positional info:
  _any_
    doc=0, freq=5, pos=[1, 2, 6, 7, 8]
    doc=1, freq=1, pos=[1]
    doc=2, freq=1, pos=[2]
  brown
    doc=0, freq=1, pos=[2]
  pale
    doc=0, freq=1, pos=[6]
  red
    doc=0, freq=1, pos=[8]
    doc=1, freq=1, pos=[1]
    doc=2, freq=1, pos=[2]
  rosy
    doc=0, freq=1, pos=[1]
  violet
    doc=0, freq=1, pos=[7]

As you can see, besides just the clear color terms ("red", "brown", "violet"), the "color" field also contains tokens that are not colors, but do appear as part of the color annotations. This is useful as it allows us to e.g. search for pale or rosy colors (color:(pale OR rosy)). Also, you can notice that the _any_ term appears in all documents and at the same position of other terms that were annotated as colors. This allows us to search for e.g. colored foxes, using this code:

final SpanQuery anyColor = 
    new SpanTermQuery(new Term(COLOR_FIELD, ANY_ANNOTATION_TERM));
final SpanQuery colorAsText = new FieldMaskingSpanQuery(anyColor, TEXT_FIELD);
final SpanQuery fox = new SpanTermQuery(new Term(TEXT_FIELD, "fox"));
final SpanQuery coloredFox = 
    new SpanNearQuery(new SpanQuery[] { colorAsText, fox }, 0, true);

You can notice how I set slop=0 on the SpanNearQuery, which is just like executing a phrase query. In this case our phrase query is "color:_any_ text:fox", only we cannot parse phrase queries across fields, so SpanNearQuery gives us what we need. Executing this query gives the following:

Searching for [spanNear([mask(color:_any_) as text, text:fox], 0, true)]:
  doc=0, text=quick rosy brown fox and a pale violet red dog

This document is found because the term _any_ appears at position 2 (same as brown), and since the term fox appears at position 3, these two follow each other and are considered a phrase. We now start to see the power of indexing multi-word annotations!

Indexing multi-word annotations as a single term

One down side of indexing an _any_ token for each original annotated term, is that if you have very long annotations, this might balloon your index more than you need, as well impact query performance. Instead, we can index a single _any_ term for each annotation, with enough information that will tell us the length of that annotation.

To achieve that, let's implement PreAnnotatedTokenFilter. Like its simpler cousin, it takes an array of annotation markers and filters terms that are not covered by any of the markers. Unlike it though, it also handles adding the _any_ term, and therefore does not extend FilteringTokenFilter. Let's take a look at its code:

public final class PreAnnotatedTokenFilter extends TokenFilter {

  public static final String ANY_ANNOTATION_TERM = "_any_";                 // (1)

  private static final int MAX_BYTES_IN_VINT = 5;

  private final CharTermAttribute termAtt = ...;
  private final PositionIncrementAttribute posIncrAtt = ...;
  private final PayloadAttribute payloadAtt = ...;
  private final int[] markers;

  private final BytesRef payloadBytes = new BytesRef(MAX_BYTES_IN_VINT);    // (2)
  private final ByteArrayDataOutput out = ...;

  private int skippedPositions;
  private int absPosition;
  private int curStart;
  private int curEnd;
  private int pairIdx;

  private State state = null;

  public PreAnnotatedTokenFilter(TokenStream input, int... markers) {
    super(input);
    this.markers = getSortedFilteredMarkers(markers);
  }

  /** Update the payload attribute for the {@link #ANY_ANNOTATION_TERM}. */
  private void outputAnyTerm() throws IOException {                         // (3)
    state = captureState();
    termAtt.setEmpty().append(ANY_ANNOTATION_TERM);
    out.reset(payloadBytes.bytes);
    out.writeVInt(curEnd - curStart + 1);
    payloadBytes.length = out.getPosition();
    payloadAtt.setPayload(payloadBytes);
  }

  private void outputFirstAnnotatedTerm() {                                 // (4)
    restoreState(state);
    state = null;
    // Output first annotated term at same position as ANY_ANNOTATION_TERM
    posIncrAtt.setPositionIncrement(0);
  }

  @Override
  public boolean incrementToken() throws IOException {
    if (state != null) {
      outputFirstAnnotatedTerm();
      return true;
    }

    skippedPositions = 0;
    while (input.incrementToken()) {
      final int posIncr = posIncrAtt.getPositionIncrement();
      absPosition += posIncr;
      if (acceptCurrentToken()) {                                           // (5)
        posIncrAtt.setPositionIncrement(posIncr + skippedPositions);
        // Output the ANY_ANNOTATION_TERM term first
        if (absPosition == curStart) {
          outputAnyTerm();
        }
        return true;
      }
      skippedPositions += posIncr;
    }
    return false;
  }

  ...

}
(1)
We need to handle the _any term ourselves now, because we index it differently.
(2)
The payload of the _any_ term holds the length of the annotation as VInt.
(3)
When the filter outputs the _any_ term, it sets the payloadAtt value to the length of the annotation. Before it does so though, it captures the state of the token stream, so that whatever attributes it changes can later be restored.
(4)
Restore the state of the stream to the one before the _any_ term. Also sets the position increment of the term to 0, so that it is output at the same position of the _any_ term.
(5)
acceptCurrentToken() is identical to the implementation of the simple version.

So, it's slightly more involved, but hopefully still understandable. The main difference is the way this filter outputs the _any_ term, and because it adds a PayloadAttribute to that term only, it needs to be more careful with the state of the stream, and therefore captures and restores it. But other than that, it outputs each term that is covered by an annotation, so we should still be able to do everything that its simpler version allowed us. Let's take a look at the indexed information for the "color" field:

Terms for field [color], with positional info:
  _any_
    doc=0, freq=2
      pos=1, payload=[2]
      pos=6, payload=[3]
    doc=1, freq=1
      pos=1, payload=[1]
    doc=2, freq=1
      pos=2, payload=[1]
  brown
    doc=0, freq=1, pos=[2]
  pale
    doc=0, freq=1, pos=[6]
  red
    doc=0, freq=1, pos=[8]
    doc=1, freq=1, pos=[1]
    doc=2, freq=1, pos=[2]
  rosy
    doc=0, freq=1, pos=[1]
  violet
    doc=0, freq=1, pos=[7]

First, notice that the indexed information of the annotated terms is identical to the simple version of the filter. The only difference is how the _any_ term is indexed. You can see it has freq=2 (as opposed to 5 before) which exactly matches the number of "color" annotation markers in the document (after removing redundant ones). You'll also notice that it appears at positions 1 and 6 (the start of the annotation markers), and the payload of each position captures the length of the annotation. Nice, we were able to capture the same information in the index, with less term positions, and using Lucene's payloads!

However, if we try to search for colored foxes using the same search code from above, we will get no results. The reason is that SpanTermQuery defines the end position of the span to be 1+span.start() (which is what it should be for normal terms). Therefore, the terms _any_ (which appears at position 1) and fox (which appears at position 3) do not form a phrase. Fortunately, we can quite easily extend Lucene's SpanTermQuery to a MultiPositionSpanQuery, and override the end position to be span.start()+payloadValue. We then can search for colored foxes using this code:

final SpanQuery anyColor = 
    new MultiPositionSpanQuery(new Term(COLOR_FIELD, ANY_ANNOTATION_TERM));
final SpanQuery colorAsText = new FieldMaskingSpanQuery(anyColor, TEXT_FIELD);
final SpanQuery fox = new SpanTermQuery(new Term(TEXT_FIELD, "fox"));
final SpanQuery coloredFox = 
    new SpanNearQuery(new SpanQuery[] { colorAsText, fox }, 0, true);

Which, as expected, yields the following:

Searching for [spanNear([mask(color:_any_) as text, text:fox], 0, true)]:
  doc=0, text=quick rosy brown fox and a pale violet red dog

Which approach to choose?

So first, it's important to note that both approaches index the original annotated terms under the annotation field ("color" in this example). If we only indexed the _any_ term, then searching for a brown fox would translate into something like (text:brown WITHIN color:_any_) NEAR text:fox, which means that we now need to traverse 3 postings lists ("text:brown", "color:_any_" and "text:fox") which is of course more expensive.

The difference then is in how the _any_ term is indexed. Which approach is better for indexing it really depends on the nature of the index, annotations and search. For instance, if all your annotation markers span one position, then SimplePreAnnotatedTokenFilter is likely to perform better as it does not have to index or read the payload. If the annotation markers however span many positions, then PreAnnotatedTokenFilter is likely to perform better, since it indexes less data for the _any_ term.

If your search interface does not make frequent usage of the _any_ term (i.e. the queries are usually specific), then I guess that only if you know that your annotations span a single position, I would go with PreAnnotatedTokenFilter, as it is likely to index less data overall.

Lucene's TokenStream is a very powerful extension point. Knowing how to take advantage of it can greatly help in implementing rich and sophisticated search solutions!

Tuesday, January 5, 2016

Indexing Tagged Data with Lucene (part 3)

So far we've seen two approaches for indexing tagged data with Lucene: processing the text for every annotation and using TeeSinkTokenFilter to share the common analysis chain with the annotating filters. We've also seen that AnnotatorTokenFilter adds the terms that were accepted by the annotator to the index, which allows us to search for e.g. color:red or animal:fox.

We have also seen how to search for all documents with colors, by sending the query color:*. However, this query can be very expensive when the "color" field has many terms, as the query is expanded to all the terms in the field. According to this table, there are 300+ named colors. With animals (well, species on earth) the situation is even worse, as our "animals" field could potentially have >8 million tokens. If we executed the query animal:*, it's going to expand into one humongous query...

Adding an _any_ term

Now that we know how to implement a TokenFilter, we can implement one which adds a fixed term, say _any_, to every document that contains a token that is accepted by the annotator. Then, we can simply search for animal:_any_, and our query becomes a simple TermQuery which traverses only one term's indexed list (also referred to as a posting list). Let's just do that:

public final class AnyAnnotationFilter extends TokenFilter {

  public static final String ANY_ANNOTATION_TERM = "_any_";

  private final CharTermAttribute termAtt = ...;
  private final PositionIncrementAttribute posIncrAtt = ...;

  private boolean addedOrigTerm = false;

  public AnyAnnotationFilter(TokenStream input) {
    super(input);
  }

  @Override
  public boolean incrementToken() throws IOException {
    if (!addedOrigTerm) {
      addedOrigTerm = true;                          // (1)
      return input.incrementToken();
    }

    termAtt.setEmpty().append(ANY_ANNOTATION_TERM);  // (2)
    posIncrAtt.setPositionIncrement(0);              // (3)
    addedOrigTerm = false;
    return true;
  }

  @Override
  public void reset() throws IOException {
    addedOrigTerm = false;
    super.reset();
  }
}
(1)
First return the original term.
(2)
Set the token's value to _any_ so that it's added to the index.
(3)
Set the position increment to 0, so that the _any_ term is added at the same position as the original term.

We should now change the code of addDocument() as shown here to wrap the token stream with the new filter:

private void addDocument(IndexWriter writer, String text) {
  ...
  final TokenStream colorsStream = 
      new AnyAnnotationFilter(new AnnotatorTokenFilter(...));
  final TokenStream animalsStream = 
      new AnyAnnotationFilter(new AnnotatorTokenFilter(...));
  ...

Let's print the terms of the "color" field, to see the information of its _any_ term:

Terms for field [color], with positional info:
  _any_
    doc=0, freq=2, pos=[0, 4]
    doc=1, freq=1, pos=[1]
    doc=2, freq=1, pos=[1]
  brown
    doc=0, freq=1, pos=[0]
  red
    doc=0, freq=1, pos=[4]
    doc=1, freq=1, pos=[1]
    doc=2, freq=1, pos=[1]

You can see that the term _any_ is added to every document with colors, and that in doc=0 it was added twice, at the same positions of the color terms ("brown" and "red"). This now allows us to search for color:red AND animals:_any_ (documents with red animals). Actually that query will match any document which contains an animal, and the color "red", but not necessarily "a red animal". We can construct a SpanQuery programmatically to achieve that:

final SpanQuery red = new SpanTermQuery(new Term(COLOR_FIELD, "red"));
final SpanQuery redColorAsAnimal = new FieldMaskingSpanQuery(red, ANIMAL_FIELD);
final SpanQuery anyAnimal = new SpanTermQuery(new Term(ANIMAL_FIELD, "_any_"));
final SpanQuery redAnimals = new SpanNearQuery(
    new SpanQuery[] { redColorAsAnimal, anyAnimal }, 1, true);

What's nice about this query is that we search only on the annotation fields, i.e. that metadata that never explicitly existed in the input text! And indeed, this returns the following results:

Searching for [spanNear([mask(color:red) as animal, animal:_any_], 1, true)]:
  doc=1, text=only red dog
  doc=0, text=brown fox and a red dog

The _any_ term adds more information to the index (i.e. it repeats the indexed information of all annotations). However, in practice this would normally not have a big impact on your index, since it will already contain plenty of other indexed data. But, if you don't expect to execute such global matching queries (animal:*), you can avoid the indexing of the extra term by removing the AnyAnnotationFilter from your analysis chain.

Multi word annotations

So far we have seen how to index single word annotations with Lucene. The techniques that I covered are simple, but not enough if you want to detect and index multi-word annotations, e.g. the color Dark Sea Green. To do that, we will need to use another of Lucene's very useful data structures, the Payload. I will demonstrate how to use it for our multi-word annotations in a follow-up post.

Indexing Tagged Data with Lucene (part 2)

In the previous post I demonstrated how to index tagged data with Lucene. I used an Analyzer which filters out terms that are not accepted/recognized by an Annotator and added the text twice to the document: once for the "text" field and again for the "color" field (where the color-annotated terms were indexed).

I also said that this approach is inefficient when indexing large amounts of data, or when using multiple annotators, and that Lucene allows us to improve that by using its TeeSinkTokenFilter. In this post I will show how to use it, but before I do that, let's index some more data and add an AnimalAnnotator to also detect animals.

private void indexDocs(IndexWriter writer) {
  addDocument(writer, "brown fox and a red dog");
  addDocument(writer, "black and white cow");
  addDocument(writer, "no red animals here");
  writer.commit();
}

private void addDocument(IndexWriter writer, String text) {
  final Document doc = new Document();
  doc.add(new TextField(TEXT_FIELD, text, Store.YES));
  doc.add(new TextField(COLOR_FIELD, text, Store.NO));
  doc.add(new TextField(ANIMAL_FIELD, text, Store.NO));         // (1)
  writer.addDocument(doc);
}

public final class AnimalAnnotatorAnalyzer extends Analyzer {   // (2)
  @Override
  protected TokenStreamComponents createComponents(String fieldName) {
    final Tokenizer tokenizer = new WhitespaceTokenizer();
    final TokenStream stream = new AnnotatorTokenFilter(tokenizer, 
        AnimalAnnotator.withDefaultAnimals());
    return new TokenStreamComponents(tokenizer, stream);
  }
}

private Analyzer createAnalyzer() {
  final Analyzer colorAnnotatorAnalyzer = new ColorAnnotatorAnalyzer();
  final Analyzer animalAnnotatorAnalyzer = new AnimalAnnotatorAnalyzer();
  final Analyzer defaultAnalyzer = new WhitespaceAnalyzer();
  return new PerFieldAnalyzerWrapper(defaultAnalyzer,
        ImmutableMap. of(
            COLOR_FIELD, colorAnnotatorAnalyzer,
            ANIMAL_FIELD, animalAnnotatorAnalyzer));            // (3)
}
(1)
Add an "animal" field, in addition to "text" and "color".
(2)
Similar to ColorAnnotatorAnalyzer, only chains AnnotatorFilter with AnimalAnnotator.
(3)
Add AnimalAnnotatorAnalyzer to the PerFieldAnalyzerWrapper.

We can now also print the terms of the "animal" field, and with their position information we get the following:

Terms for field [color], with positional info:
  brown
    doc=0, freq=1, pos=[0]
  red
    doc=0, freq=1, pos=[4]
    doc=1, freq=1, pos=[1]
    doc=2, freq=1, pos=[1]
Terms for field [animal], with positional info:
  dog
    doc=0, freq=1, pos=[5]
    doc=1, freq=1, pos=[2]
  fox
    doc=0, freq=1, pos=[1]

Now that we have "color" and "animal" annotations, we can explore more fun queries:

// Search for documents with animals and colors
Searching for [+animal:* +color:*]:
  doc=0, text=brown fox and a red dog
  doc=1, text=only red dog

// Search for documents with red animals
Searching for [+animal:* +color:red]:
  doc=1, text=only red dog
  doc=0, text=brown fox and a red dog

Indexing annotations with TeeSinkTokenFilter

TeeSinkTokenFilter is a TokenFilter which allows sharing analysis chains between multiple fields, that at some point need to go their separate ways. Named after Unix's tee command, it pipes the output of the source stream to one or more "sinks", which can then continue processing the token independently of the source as well the other "sinks".

This sounds perfect for our use case! We want to have the input text tokenized by whitespaces (and potentially also lowercased and have stopwords removed) and indexed in a "text" field (for regular searches). We also want to extract colors and animals, and index them in separate fields. With TeeSinkTokenFilter, our addDocument() code changes as follows:

private void addDocument(IndexWriter writer, String text) {
  final Tokenizer tokenizer = new WhitespaceTokenizer();
  tokenizer.setReader(new StringReader(text));
  final TeeSinkTokenFilter textStream = new TeeSinkTokenFilter(tokenizer);    // (1)
  final TokenStream colorsStream = new AnnotatorTokenFilter(                  // (2)
      textStream.newSinkTokenStream(), ColorAnnotator.withDefaultColors());
  final TokenStream animalsStream = new AnnotatorTokenFilter(                 // (2)
      textStream.newSinkTokenStream(), AnimalAnnotator.withDefaultAnimals());

  final Document doc = new Document();
  doc.add(new StoredField(TEXT_FIELD, text));                                 // (3)
  doc.add(new TextField(TEXT_FIELD, textStream));                             // (4)
  doc.add(new TextField(COLOR_FIELD, colorsStream));                          // (4)
  doc.add(new TextField(ANIMAL_FIELD, animalsStream));                        // (4)
  writer.addDocument(doc);
(1)
Create the TeeSinkTokenFilter over WhitespaceTokenizer. You can chain additional TokenFilters for additional analysis (lowercasing, stopwrods removal), and wrap the full chain with TeeSinkTokenFilter.
(2)
Create a SinkTokenStream from textStream, so that it receives all processed tokens and can perform independent analysis on them (here, applying AnnotatorTokenFilter). Notice that we create a separate sink for "colors" and "animals".
(3)
Since we now add the value of the "text" field as a TokenStream, we also need to separately add it as a StoredField (the former does not store the value).
(4)
Add each TokenStream to its respective field.

OK, so definitely the indexing code is not as simple as it was before. We now need to add fields to the document that depend on each other, and therefore we need to write a slightly more involved code. However, we do gain from the input text being processed only once, as opposed to 3 times before (the "text" and two annotation fields).

NOTE: when you work with TeeSinkTokenFilter, you should add the source field (textStream above) before you add the "sink" fields to the document, since it needs to be processed before they are.

Besides the changes to addDocument(), we no longer need to configure IndexWriter with the PerFieldAnalyzerWrapper that we created before. In fact, the way we add documents above, it doesn't matter what analyzer we configure the indexer with, since the indexer will use the pre-built TokenStream that we added the fields with.

Furthermore, if we run the same search code as we did before, we will get the exact same results. That's because we only changed the way the fields are processed, but not the way they are indexed. This is pretty cool!

Conclusions

Lucene does not allow one field to inject tokens into other fields, but with TeeSinkTokenFilter we can overcome that limitation. Each token that is processed by the source, is further sent to the "sinks", who can continue processing it independently of the source and other "sinks" (changes to the stream are only visible to them).

Using TeeSinkTokenFilter does come with a cost though. With the current implementation, every "sink" caches the state of the attributes of every produced token by the source. Furthermore, the "sinks" do not share these states with each other. That means that you might see a memory increase on your heap during indexing time, although these objects may be collected as soon as you're done indexing that document.

This can likely be improved by having the "sinks" share the cached states with each other. But even without it, if you're indexing large amounts of data, and especially if you have an expensive analysis chain that you can share with the "sinks", using TeeSinkTokenFilter might be beneficial to you.

Happy tee'ing!

Sunday, January 3, 2016

Indexing Tagged Data with Lucene

If you want to index the text quick brown fox and a red dog with Lucene, you just add a TextField to a Document and index it with IndexWriter. You can then search for it by sending a query which contains one or more of the indexed words. The following code snippet demonstrates how easy it is:

final String text = "quick brown fox and a red dog";
final Document doc = new Document();
doc.add(new TextField(TEXT_FIELD, text, Store.YES);           // (1)
writer.addDocument(doc);                                      // (2)
writer.commit();                                              // (3)

final Query q = new TermQuery(new Term(TEXT_FIELD, "quick")); // (4)
final TopDocs results = searcher.search(q, 10);               // (5)
System.out.println(searcher.doc(results.scoreDocs[0].doc).get(TEXT_FIELD));
(1)
Add the text to index as a TextField.
(2)
Index the document. This will use the default Analyzer that was configured for IndexWriter to extract and process the field's terms.
(3)
Commit the changes to the index so that they are visible to the searcher. Note, you rarely want to commit() after every indexed document, and in many cases it is preferred to use Lucene's near-realtime search.
(4)
Create a Query to search for the text. Usually you will use a QueryParser to parse a query such as text:quick.
(5)
Execute the search. TopDocs holds the list of matching documents; in the example above only one document.

However, what if you (or your users) are not interested in the direct information that’s embedded in the text, but rather about meta-information such as "documents with red colors" or "documents about foxes (animal)"? Clearly, feeding search queries such as animal:fox or color:red isn’t going to work since we never added "color" nor "animal" fields to the document.

Detecting color terms

Named-entity recognition (also known as Named-entity extraction) is a task that seeks to classify elements in the text into categories. There are various tools that cope with that task, and a quick search for entity extraction tools returns quite a few, as well this discussion. Such extraction techniques and tools are beyond the scope of this post though. Let's assume that we have a simple Annotator interface, which can either accept or reject tokens, with two implementations: a ColorAnnotator which accepts red and brown, and an AnimalAnnotator which accepts fox and dog.

How could we use this annotator to tag our text, such that the index captured the tagged data, and allowed us to find the document by searching for color:red or animal:fox?

An annotating TokenFilter

As mentioned above, when we add text fields to documents, they are processed by the Analyzer that was configured for the indexer. Lucene's Analyzer produces a TokenStream which processes the text and produces index tokens. That token stream comprises a Tokenizer and TokenFilter. The former is responsible for breaking the input text into tokens and the latter is responsible for processing them. Lucene offers a great variety of filters out of the box. Some drop tokens (e.g. stopwords), some replace tokens (e.g. stemming) and others inject additional ones (e.g. synonyms).

You can learn more about Lucene's analysis chain here.

Since we are not after breaking the input text into words, let's write an AnnotatorTokenFilter which emits only words that are accepted by an annotator. Fortunately, Lucene provides a base class for filters that keep only some of the input tokens, called FilteringTokenFilter. It lets us focus only on the task of accepting/rejecting tokens, while it handles the rest of the analysis stuff, which mainly consists of updating token attributes (it can sometimes be quite tricky). By extending it, and using our Annotator interface, we come up with the following implementation:

/**
 * A {@link FilteringTokenFilter} which uses an {@link Annotator} to 
 * {@link #accept()} tokens.
 */
public final class AnnotatorTokenFilter extends FilteringTokenFilter {

  private final CharTermAttribute termAtt = 
      addAttribute(CharTermAttribute.class);                             // (1)
  private final Annotator annotator;

  public AnnotatorTokenFilter(TokenStream input, Annotator annotator) {
    super(input);
    this.annotator = annotator;
  }

  @Override
  protected boolean accept() throws IOException {                        // (2)
    return annotator.accept(termAtt.buffer(), 0, termAtt.length());
  }
}
(1)
The standard way to get access to the term attribute on the chain. Note that we don't need to explicitly populate it, as it's being populated by the downstream tokenizer and additional filters.
(2)
The only method that we need to implement. If our annotator accepts the current token, we retain it on the stream. Otherwise, it will not be indexed.
Lucene also provides a KeepWordFilter which takes a set of words to keep. We could use it by passing a list of words for each category, however in reality a simple list of words may not be enough to extract entities from the text (e.g. consider muli-word colors).

Sweet! It really can't get any simpler than that. Now we just need to build an Analyzer which uses our filter and we can move on to indexing and searching colors:

/**
 * An {@link Analyzer} which chains {@link WhitespaceTokenizer} and 
 * {@link AnnotatingTokenFilter} with {@link ColorAnnotator}.
 */
public static final class ColorAnnotatorAnalyzer extends Analyzer {
  @Override
  protected TokenStreamComponents createComponents(String fieldName) {
    final Tokenizer tokenizer = new WhitespaceTokenizer();           // (1)
    final TokenStream stream = new AnnotatorTokenFilter(tokenizer,   // (2)
        ColorAnnotator.withDefaultColors());
    return new TokenStreamComponents(tokenizer, stream);
  }
}
(1)
For simplicity only, use a whitespace tokenizer.
(2)
You will most likely want to chain additional filters before the annotator, e.g. lower-casing, stemming etc.

Index and search annotated colors

To index the color annotations, we are going to index two fields: "text" and "color". The "text" field will index the original text's tokens while the "color" field will index only the ones that are also colors. Lucene provides PerFieldAnalyzerWrapper which can return a different analyzer per field. It's quite handy and easy to construct:

private static Analyzer createAnalyzer() {
  final Analyzer colorAnnotatorAnalyzer = new ColorAnnotatorAnalyzer(); // (1)
  final Analyzer defaultAnalyzer = new WhitespaceAnalyzer();            // (2)
  return new PerFieldAnalyzerWrapper(defaultAnalyzer,
      ImmutableMap.<String, Analyzer> of(
          COLOR_FIELD, colorAnnotatorAnalyzer));                        // (3)
}
(1)
The color annotating analyzer that we created before.
(2)
The default analyzer to be used for all fields without an explicit set analyzer.
(3)
Set the color annotating analyzer to the "color" field.

Our indexing code then changes as follows:

final Analyzer analyzer = createAnalyzer();
final IndexWriterConfig conf = new IndexWriterConfig(analyzer); // (1)
final IndexWriter writer = new IndexWriter(dir, conf);

final Document doc = new Document();
doc.add(new TextField(TEXT_FIELD, TEXT, Store.YES));
doc.add(new TextField(COLOR_FIELD, TEXT, Store.NO));            // (2)
writer.addDocument(doc);
writer.commit();
(1)
Configure IndexWriter with our PerFieldAnalyzerWrapper.
(2)
In addition to adding the text to the "text" field, also add it to the "color" field. Notice that there is no point storing it for this field too.

Printing the indexed terms for each field will yield the following output:

Terms for field [text]:
  a
  and
  brown
  dog
  fox
  quick
  red
Terms for field [color]:
  brown
  red

As you can see, the "text" field contains all words from the text that we indexed, while the "color" field contains only the words that were identified as colors by our ColorAnnotator. We can now also search for color:red to get the document back:

final Query q = new TermQuery(new Term(COLOR_FIELD, "red"));
final TopDocs results = searcher.search(q, 10);
System.out.println(searcher.doc(results.scoreDocs[0].doc).get(TEXT_FIELD));

That prints the following output, which is the value of the "text" field of the document:

quick brown fox and a red dog

Annotated terms' position

If we also print the full indexed information about the "color" terms, we will notice that they retain their original text position:

Terms for field [color], with additional info:
  brown
    doc=0, freq=1
      pos=1
  red
    doc=0, freq=1
      pos=5

This is important since it provides us a direct back-reference to the input text. That way we can search for "foxes that are brown-colored", while if we omit the exact position information of 'brown', we may not be able to associate the token 'fox' with the color 'brown'. Let's demonstrate that capability using Lucene's a SpanQuery:

final SpanQuery brown = new SpanTermQuery(new Term(COLOR_FIELD, "brown")); // (1)
final SpanQuery brownText = new FieldMaskingSpanQuery(brown, TEXT_FIELD);  // (2)
final SpanQuery quick = new SpanTermQuery(new Term(TEXT_FIELD, "quick"));  // (3)
final Query q = new SpanNearQuery(                                         // (4)
    new SpanQuery[] { quick, brownText }, 1, true);

final TopDocs results = searcher.search(q, 10);
System.out.println(searcher.doc(results.scoreDocs[0].doc).get(TEXT_FIELD));
(1)
Matches "brown" in the "color" field.
(2)
Matches "quick" in the "text" field.
(3)
Since SpanNearQuery requires the input SpanQuery-ies to refer to the same field, we mask the "color" one under the "text" field.
(4)
Search for "quick" followed by "brown".

A note about efficiency

In this post I've demonstrated a very basic approach to index and search tagged data with Lucene. As I noted above, the input data is added to a separate field for every annotator that you will apply to it. This may be OK in case you index short texts and only apply few annotators. However, if you index large data, and especially since most likely your analysis chain comprises something more complex than just whitespace tokenization, there will be performance implications to this approach.

In a follow-up post I will demonstrate how to process the input data only once, by using another of Lucene's cool TokenStream implementations, TeeSinkTokenFilter.