Tuesday, January 5, 2016

Indexing Tagged Data with Lucene (part 2)

In the previous post I demonstrated how to index tagged data with Lucene. I used an Analyzer which filters out terms that are not accepted/recognized by an Annotator and added the text twice to the document: once for the "text" field and again for the "color" field (where the color-annotated terms were indexed).

I also said that this approach is inefficient when indexing large amounts of data, or when using multiple annotators, and that Lucene allows us to improve that by using its TeeSinkTokenFilter. In this post I will show how to use it, but before I do that, let's index some more data and add an AnimalAnnotator to also detect animals.

private void indexDocs(IndexWriter writer) {
  addDocument(writer, "brown fox and a red dog");
  addDocument(writer, "black and white cow");
  addDocument(writer, "no red animals here");
  writer.commit();
}

private void addDocument(IndexWriter writer, String text) {
  final Document doc = new Document();
  doc.add(new TextField(TEXT_FIELD, text, Store.YES));
  doc.add(new TextField(COLOR_FIELD, text, Store.NO));
  doc.add(new TextField(ANIMAL_FIELD, text, Store.NO));         // (1)
  writer.addDocument(doc);
}

public final class AnimalAnnotatorAnalyzer extends Analyzer {   // (2)
  @Override
  protected TokenStreamComponents createComponents(String fieldName) {
    final Tokenizer tokenizer = new WhitespaceTokenizer();
    final TokenStream stream = new AnnotatorTokenFilter(tokenizer, 
        AnimalAnnotator.withDefaultAnimals());
    return new TokenStreamComponents(tokenizer, stream);
  }
}

private Analyzer createAnalyzer() {
  final Analyzer colorAnnotatorAnalyzer = new ColorAnnotatorAnalyzer();
  final Analyzer animalAnnotatorAnalyzer = new AnimalAnnotatorAnalyzer();
  final Analyzer defaultAnalyzer = new WhitespaceAnalyzer();
  return new PerFieldAnalyzerWrapper(defaultAnalyzer,
        ImmutableMap. of(
            COLOR_FIELD, colorAnnotatorAnalyzer,
            ANIMAL_FIELD, animalAnnotatorAnalyzer));            // (3)
}
(1)
Add an "animal" field, in addition to "text" and "color".
(2)
Similar to ColorAnnotatorAnalyzer, only chains AnnotatorFilter with AnimalAnnotator.
(3)
Add AnimalAnnotatorAnalyzer to the PerFieldAnalyzerWrapper.

We can now also print the terms of the "animal" field, and with their position information we get the following:

Terms for field [color], with positional info:
  brown
    doc=0, freq=1, pos=[0]
  red
    doc=0, freq=1, pos=[4]
    doc=1, freq=1, pos=[1]
    doc=2, freq=1, pos=[1]
Terms for field [animal], with positional info:
  dog
    doc=0, freq=1, pos=[5]
    doc=1, freq=1, pos=[2]
  fox
    doc=0, freq=1, pos=[1]

Now that we have "color" and "animal" annotations, we can explore more fun queries:

// Search for documents with animals and colors
Searching for [+animal:* +color:*]:
  doc=0, text=brown fox and a red dog
  doc=1, text=only red dog

// Search for documents with red animals
Searching for [+animal:* +color:red]:
  doc=1, text=only red dog
  doc=0, text=brown fox and a red dog

Indexing annotations with TeeSinkTokenFilter

TeeSinkTokenFilter is a TokenFilter which allows sharing analysis chains between multiple fields, that at some point need to go their separate ways. Named after Unix's tee command, it pipes the output of the source stream to one or more "sinks", which can then continue processing the token independently of the source as well the other "sinks".

This sounds perfect for our use case! We want to have the input text tokenized by whitespaces (and potentially also lowercased and have stopwords removed) and indexed in a "text" field (for regular searches). We also want to extract colors and animals, and index them in separate fields. With TeeSinkTokenFilter, our addDocument() code changes as follows:

private void addDocument(IndexWriter writer, String text) {
  final Tokenizer tokenizer = new WhitespaceTokenizer();
  tokenizer.setReader(new StringReader(text));
  final TeeSinkTokenFilter textStream = new TeeSinkTokenFilter(tokenizer);    // (1)
  final TokenStream colorsStream = new AnnotatorTokenFilter(                  // (2)
      textStream.newSinkTokenStream(), ColorAnnotator.withDefaultColors());
  final TokenStream animalsStream = new AnnotatorTokenFilter(                 // (2)
      textStream.newSinkTokenStream(), AnimalAnnotator.withDefaultAnimals());

  final Document doc = new Document();
  doc.add(new StoredField(TEXT_FIELD, text));                                 // (3)
  doc.add(new TextField(TEXT_FIELD, textStream));                             // (4)
  doc.add(new TextField(COLOR_FIELD, colorsStream));                          // (4)
  doc.add(new TextField(ANIMAL_FIELD, animalsStream));                        // (4)
  writer.addDocument(doc);
(1)
Create the TeeSinkTokenFilter over WhitespaceTokenizer. You can chain additional TokenFilters for additional analysis (lowercasing, stopwrods removal), and wrap the full chain with TeeSinkTokenFilter.
(2)
Create a SinkTokenStream from textStream, so that it receives all processed tokens and can perform independent analysis on them (here, applying AnnotatorTokenFilter). Notice that we create a separate sink for "colors" and "animals".
(3)
Since we now add the value of the "text" field as a TokenStream, we also need to separately add it as a StoredField (the former does not store the value).
(4)
Add each TokenStream to its respective field.

OK, so definitely the indexing code is not as simple as it was before. We now need to add fields to the document that depend on each other, and therefore we need to write a slightly more involved code. However, we do gain from the input text being processed only once, as opposed to 3 times before (the "text" and two annotation fields).

NOTE: when you work with TeeSinkTokenFilter, you should add the source field (textStream above) before you add the "sink" fields to the document, since it needs to be processed before they are.

Besides the changes to addDocument(), we no longer need to configure IndexWriter with the PerFieldAnalyzerWrapper that we created before. In fact, the way we add documents above, it doesn't matter what analyzer we configure the indexer with, since the indexer will use the pre-built TokenStream that we added the fields with.

Furthermore, if we run the same search code as we did before, we will get the exact same results. That's because we only changed the way the fields are processed, but not the way they are indexed. This is pretty cool!

Conclusions

Lucene does not allow one field to inject tokens into other fields, but with TeeSinkTokenFilter we can overcome that limitation. Each token that is processed by the source, is further sent to the "sinks", who can continue processing it independently of the source and other "sinks" (changes to the stream are only visible to them).

Using TeeSinkTokenFilter does come with a cost though. With the current implementation, every "sink" caches the state of the attributes of every produced token by the source. Furthermore, the "sinks" do not share these states with each other. That means that you might see a memory increase on your heap during indexing time, although these objects may be collected as soon as you're done indexing that document.

This can likely be improved by having the "sinks" share the cached states with each other. But even without it, if you're indexing large amounts of data, and especially if you have an expensive analysis chain that you can share with the "sinks", using TeeSinkTokenFilter might be beneficial to you.

Happy tee'ing!

No comments:

Post a Comment