Tuesday, January 5, 2016

Indexing Tagged Data with Lucene (part 3)

So far we've seen two approaches for indexing tagged data with Lucene: processing the text for every annotation and using TeeSinkTokenFilter to share the common analysis chain with the annotating filters. We've also seen that AnnotatorTokenFilter adds the terms that were accepted by the annotator to the index, which allows us to search for e.g. color:red or animal:fox.

We have also seen how to search for all documents with colors, by sending the query color:*. However, this query can be very expensive when the "color" field has many terms, as the query is expanded to all the terms in the field. According to this table, there are 300+ named colors. With animals (well, species on earth) the situation is even worse, as our "animals" field could potentially have >8 million tokens. If we executed the query animal:*, it's going to expand into one humongous query...

Adding an _any_ term

Now that we know how to implement a TokenFilter, we can implement one which adds a fixed term, say _any_, to every document that contains a token that is accepted by the annotator. Then, we can simply search for animal:_any_, and our query becomes a simple TermQuery which traverses only one term's indexed list (also referred to as a posting list). Let's just do that:

public final class AnyAnnotationFilter extends TokenFilter {

  public static final String ANY_ANNOTATION_TERM = "_any_";

  private final CharTermAttribute termAtt = ...;
  private final PositionIncrementAttribute posIncrAtt = ...;

  private boolean addedOrigTerm = false;

  public AnyAnnotationFilter(TokenStream input) {
    super(input);
  }

  @Override
  public boolean incrementToken() throws IOException {
    if (!addedOrigTerm) {
      addedOrigTerm = true;                          // (1)
      return input.incrementToken();
    }

    termAtt.setEmpty().append(ANY_ANNOTATION_TERM);  // (2)
    posIncrAtt.setPositionIncrement(0);              // (3)
    addedOrigTerm = false;
    return true;
  }

  @Override
  public void reset() throws IOException {
    addedOrigTerm = false;
    super.reset();
  }
}
(1)
First return the original term.
(2)
Set the token's value to _any_ so that it's added to the index.
(3)
Set the position increment to 0, so that the _any_ term is added at the same position as the original term.

We should now change the code of addDocument() as shown here to wrap the token stream with the new filter:

private void addDocument(IndexWriter writer, String text) {
  ...
  final TokenStream colorsStream = 
      new AnyAnnotationFilter(new AnnotatorTokenFilter(...));
  final TokenStream animalsStream = 
      new AnyAnnotationFilter(new AnnotatorTokenFilter(...));
  ...

Let's print the terms of the "color" field, to see the information of its _any_ term:

Terms for field [color], with positional info:
  _any_
    doc=0, freq=2, pos=[0, 4]
    doc=1, freq=1, pos=[1]
    doc=2, freq=1, pos=[1]
  brown
    doc=0, freq=1, pos=[0]
  red
    doc=0, freq=1, pos=[4]
    doc=1, freq=1, pos=[1]
    doc=2, freq=1, pos=[1]

You can see that the term _any_ is added to every document with colors, and that in doc=0 it was added twice, at the same positions of the color terms ("brown" and "red"). This now allows us to search for color:red AND animals:_any_ (documents with red animals). Actually that query will match any document which contains an animal, and the color "red", but not necessarily "a red animal". We can construct a SpanQuery programmatically to achieve that:

final SpanQuery red = new SpanTermQuery(new Term(COLOR_FIELD, "red"));
final SpanQuery redColorAsAnimal = new FieldMaskingSpanQuery(red, ANIMAL_FIELD);
final SpanQuery anyAnimal = new SpanTermQuery(new Term(ANIMAL_FIELD, "_any_"));
final SpanQuery redAnimals = new SpanNearQuery(
    new SpanQuery[] { redColorAsAnimal, anyAnimal }, 1, true);

What's nice about this query is that we search only on the annotation fields, i.e. that metadata that never explicitly existed in the input text! And indeed, this returns the following results:

Searching for [spanNear([mask(color:red) as animal, animal:_any_], 1, true)]:
  doc=1, text=only red dog
  doc=0, text=brown fox and a red dog

The _any_ term adds more information to the index (i.e. it repeats the indexed information of all annotations). However, in practice this would normally not have a big impact on your index, since it will already contain plenty of other indexed data. But, if you don't expect to execute such global matching queries (animal:*), you can avoid the indexing of the extra term by removing the AnyAnnotationFilter from your analysis chain.

Multi word annotations

So far we have seen how to index single word annotations with Lucene. The techniques that I covered are simple, but not enough if you want to detect and index multi-word annotations, e.g. the color Dark Sea Green. To do that, we will need to use another of Lucene's very useful data structures, the Payload. I will demonstrate how to use it for our multi-word annotations in a follow-up post.

1 comment:

  1. Describes essential steps in the writing and composing of two most popular types of the essays- Process essay and personal statement essay. See more sop biotechnology

    ReplyDelete