Code Reduction: Indexing Tagged Data with Lucene

If you want to index the text quick brown fox and a red dog with Lucene, you just add a TextField to a Document and index it with IndexWriter. You can then search for it by sending a query which contains one or more of the indexed words. The following code snippet demonstrates how easy it is:

final String text = "quick brown fox and a red dog";
final Document doc = new Document();
doc.add(new TextField(TEXT_FIELD, text, Store.YES);           // (1)
writer.addDocument(doc);                                      // (2)
writer.commit();                                              // (3)

final Query q = new TermQuery(new Term(TEXT_FIELD, "quick")); // (4)
final TopDocs results = searcher.search(q, 10);               // (5)
System.out.println(searcher.doc(results.scoreDocs[0].doc).get(TEXT_FIELD));

(1): Add the text to index as a TextField.
(2): Index the document. This will use the default Analyzer that was configured for IndexWriter to extract and process the field's terms.
(3): Commit the changes to the index so that they are visible to the searcher. Note, you rarely want to commit() after every indexed document, and in many cases it is preferred to use Lucene's near-realtime search.
(4): Create a Query to search for the text. Usually you will use a QueryParser to parse a query such as text:quick.
(5): Execute the search. TopDocs holds the list of matching documents; in the example above only one document.

However, what if you (or your users) are not interested in the direct information that’s embedded in the text, but rather about meta-information such as "documents with red colors" or "documents about foxes (animal)"? Clearly, feeding search queries such as animal:fox or color:red isn’t going to work since we never added "color" nor "animal" fields to the document.

Detecting color terms

Named-entity recognition (also known as Named-entity extraction) is a task that seeks to classify elements in the text into categories. There are various tools that cope with that task, and a quick search for entity extraction tools returns quite a few, as well this discussion. Such extraction techniques and tools are beyond the scope of this post though. Let's assume that we have a simple Annotator interface, which can either accept or reject tokens, with two implementations: a ColorAnnotator which accepts red and brown, and an AnimalAnnotator which accepts fox and dog.

How could we use this annotator to tag our text, such that the index captured the tagged data, and allowed us to find the document by searching for color:red or animal:fox?

An annotating `TokenFilter`

As mentioned above, when we add text fields to documents, they are processed by the Analyzer that was configured for the indexer. Lucene's Analyzer produces a TokenStream which processes the text and produces index tokens. That token stream comprises a Tokenizer and TokenFilter. The former is responsible for breaking the input text into tokens and the latter is responsible for processing them. Lucene offers a great variety of filters out of the box. Some drop tokens (e.g. stopwords), some replace tokens (e.g. stemming) and others inject additional ones (e.g. synonyms).

You can learn more about Lucene's analysis chain here.

Since we are not after breaking the input text into words, let's write an AnnotatorTokenFilter which emits only words that are accepted by an annotator. Fortunately, Lucene provides a base class for filters that keep only some of the input tokens, called FilteringTokenFilter. It lets us focus only on the task of accepting/rejecting tokens, while it handles the rest of the analysis stuff, which mainly consists of updating token attributes (it can sometimes be quite tricky). By extending it, and using our Annotator interface, we come up with the following implementation:

/**
 * A {@link FilteringTokenFilter} which uses an {@link Annotator} to 
 * {@link #accept()} tokens.
 */
public final class AnnotatorTokenFilter extends FilteringTokenFilter {

  private final CharTermAttribute termAtt = 
      addAttribute(CharTermAttribute.class);                             // (1)
  private final Annotator annotator;

  public AnnotatorTokenFilter(TokenStream input, Annotator annotator) {
    super(input);
    this.annotator = annotator;
  }

  @Override
  protected boolean accept() throws IOException {                        // (2)
    return annotator.accept(termAtt.buffer(), 0, termAtt.length());
  }
}

(1): The standard way to get access to the term attribute on the chain. Note that we don't need to explicitly populate it, as it's being populated by the downstream tokenizer and additional filters.
(2): The only method that we need to implement. If our annotator accepts the current token, we retain it on the stream. Otherwise, it will not be indexed.

Lucene also provides a KeepWordFilter which takes a set of words to keep. We could use it by passing a list of words for each category, however in reality a simple list of words may not be enough to extract entities from the text (e.g. consider muli-word colors).

Sweet! It really can't get any simpler than that. Now we just need to build an Analyzer which uses our filter and we can move on to indexing and searching colors:

/**
 * An {@link Analyzer} which chains {@link WhitespaceTokenizer} and 
 * {@link AnnotatingTokenFilter} with {@link ColorAnnotator}.
 */
public static final class ColorAnnotatorAnalyzer extends Analyzer {
  @Override
  protected TokenStreamComponents createComponents(String fieldName) {
    final Tokenizer tokenizer = new WhitespaceTokenizer();           // (1)
    final TokenStream stream = new AnnotatorTokenFilter(tokenizer,   // (2)
        ColorAnnotator.withDefaultColors());
    return new TokenStreamComponents(tokenizer, stream);
  }
}

(1): For simplicity only, use a whitespace tokenizer.
(2): You will most likely want to chain additional filters before the annotator, e.g. lower-casing, stemming etc.

Index and search annotated colors

To index the color annotations, we are going to index two fields: "text" and "color". The "text" field will index the original text's tokens while the "color" field will index only the ones that are also colors. Lucene provides PerFieldAnalyzerWrapper which can return a different analyzer per field. It's quite handy and easy to construct:

private static Analyzer createAnalyzer() {
  final Analyzer colorAnnotatorAnalyzer = new ColorAnnotatorAnalyzer(); // (1)
  final Analyzer defaultAnalyzer = new WhitespaceAnalyzer();            // (2)
  return new PerFieldAnalyzerWrapper(defaultAnalyzer,
      ImmutableMap.<String, Analyzer> of(
          COLOR_FIELD, colorAnnotatorAnalyzer));                        // (3)
}

(1): The color annotating analyzer that we created before.
(2): The default analyzer to be used for all fields without an explicit set analyzer.
(3): Set the color annotating analyzer to the "color" field.

Our indexing code then changes as follows:

final Analyzer analyzer = createAnalyzer();
final IndexWriterConfig conf = new IndexWriterConfig(analyzer); // (1)
final IndexWriter writer = new IndexWriter(dir, conf);

final Document doc = new Document();
doc.add(new TextField(TEXT_FIELD, TEXT, Store.YES));
doc.add(new TextField(COLOR_FIELD, TEXT, Store.NO));            // (2)
writer.addDocument(doc);
writer.commit();

(1): Configure IndexWriter with our PerFieldAnalyzerWrapper.
(2): In addition to adding the text to the "text" field, also add it to the "color" field. Notice that there is no point storing it for this field too.

Printing the indexed terms for each field will yield the following output:

Terms for field [text]:
  a
  and
  brown
  dog
  fox
  quick
  red
Terms for field [color]:
  brown
  red

As you can see, the "text" field contains all words from the text that we indexed, while the "color" field contains only the words that were identified as colors by our ColorAnnotator. We can now also search for color:red to get the document back:

final Query q = new TermQuery(new Term(COLOR_FIELD, "red"));
final TopDocs results = searcher.search(q, 10);
System.out.println(searcher.doc(results.scoreDocs[0].doc).get(TEXT_FIELD));

That prints the following output, which is the value of the "text" field of the document:

quick brown fox and a red dog

Annotated terms' position

If we also print the full indexed information about the "color" terms, we will notice that they retain their original text position:

Terms for field [color], with additional info:
  brown
    doc=0, freq=1
      pos=1
  red
    doc=0, freq=1
      pos=5

This is important since it provides us a direct back-reference to the input text. That way we can search for "foxes that are brown-colored", while if we omit the exact position information of 'brown', we may not be able to associate the token 'fox' with the color 'brown'. Let's demonstrate that capability using Lucene's a SpanQuery:

final SpanQuery brown = new SpanTermQuery(new Term(COLOR_FIELD, "brown")); // (1)
final SpanQuery brownText = new FieldMaskingSpanQuery(brown, TEXT_FIELD);  // (2)
final SpanQuery quick = new SpanTermQuery(new Term(TEXT_FIELD, "quick"));  // (3)
final Query q = new SpanNearQuery(                                         // (4)
    new SpanQuery[] { quick, brownText }, 1, true);

final TopDocs results = searcher.search(q, 10);
System.out.println(searcher.doc(results.scoreDocs[0].doc).get(TEXT_FIELD));

(1): Matches "brown" in the "color" field.
(2): Matches "quick" in the "text" field.
(3): Since SpanNearQuery requires the input SpanQuery-ies to refer to the same field, we mask the "color" one under the "text" field.
(4): Search for "quick" followed by "brown".

A note about efficiency

In this post I've demonstrated a very basic approach to index and search tagged data with Lucene. As I noted above, the input data is added to a separate field for every annotator that you will apply to it. This may be OK in case you index short texts and only apply few annotators. However, if you index large data, and especially since most likely your analysis chain comprises something more complex than just whitespace tokenization, there will be performance implications to this approach.

In a follow-up post I will demonstrate how to process the input data only once, by using another of Lucene's cool TokenStream implementations, TeeSinkTokenFilter.