So far we've seen two approaches for indexing tagged data with Lucene: processing the text for
every annotation and
using TeeSinkTokenFilter to share the common analysis chain with the annotating filters.
We've also seen that AnnotatorTokenFilter
adds the terms that were accepted by the annotator
to the index, which allows us to search for e.g. color:red or
animal:fox.
We have also seen how to search for all documents with colors, by sending the query color:*. However, this query can be very expensive when the "color" field has many terms, as the query is expanded to all the terms in the field. According to this table, there are 300+ named colors. With animals (well, species on earth) the situation is even worse, as our "animals" field could potentially have >8 million tokens. If we executed the query animal:*, it's going to expand into one humongous query...
Adding an _any_ term
Now that we know how to implement a TokenFilter
, we can implement one which adds a fixed term,
say _any_, to every document that contains a token that is accepted by the annotator.
Then, we can simply search for animal:_any_, and our query becomes a simple
TermQuery
which traverses only one term's indexed list (also referred to as a posting list).
Let's just do that:
public final class AnyAnnotationFilter extends TokenFilter {
public static final String ANY_ANNOTATION_TERM = "_any_";
private final CharTermAttribute termAtt = ...;
private final PositionIncrementAttribute posIncrAtt = ...;
private boolean addedOrigTerm = false;
public AnyAnnotationFilter(TokenStream input) {
super(input);
}
@Override
public boolean incrementToken() throws IOException {
if (!addedOrigTerm) {
addedOrigTerm = true; // (1)
return input.incrementToken();
}
termAtt.setEmpty().append(ANY_ANNOTATION_TERM); // (2)
posIncrAtt.setPositionIncrement(0); // (3)
addedOrigTerm = false;
return true;
}
@Override
public void reset() throws IOException {
addedOrigTerm = false;
super.reset();
}
}
- (1)
- First return the original term.
- (2)
- Set the token's value to _any_ so that it's added to the index.
- (3)
- Set the position increment to 0, so that the _any_ term is added at the same position as the original term.
We should now change the code of addDocument() as shown here to wrap the token stream with the new filter:
private void addDocument(IndexWriter writer, String text) {
...
final TokenStream colorsStream =
new AnyAnnotationFilter(new AnnotatorTokenFilter(...));
final TokenStream animalsStream =
new AnyAnnotationFilter(new AnnotatorTokenFilter(...));
...
Let's print the terms of the "color" field, to see the information of its _any_ term:
Terms for field [color], with positional info:
_any_
doc=0, freq=2, pos=[0, 4]
doc=1, freq=1, pos=[1]
doc=2, freq=1, pos=[1]
brown
doc=0, freq=1, pos=[0]
red
doc=0, freq=1, pos=[4]
doc=1, freq=1, pos=[1]
doc=2, freq=1, pos=[1]
You can see that the term _any_ is added to every document with colors, and that in
doc=0 it was added twice, at the same positions of the color terms ("brown" and "red").
This now allows us to search for color:red AND animals:_any_ (documents
with red animals). Actually that query will match any document which contains an animal, and the color
"red", but not necessarily "a red animal". We can construct a SpanQuery
programmatically
to achieve that:
final SpanQuery red = new SpanTermQuery(new Term(COLOR_FIELD, "red"));
final SpanQuery redColorAsAnimal = new FieldMaskingSpanQuery(red, ANIMAL_FIELD);
final SpanQuery anyAnimal = new SpanTermQuery(new Term(ANIMAL_FIELD, "_any_"));
final SpanQuery redAnimals = new SpanNearQuery(
new SpanQuery[] { redColorAsAnimal, anyAnimal }, 1, true);
What's nice about this query is that we search only on the annotation fields, i.e. that metadata that never explicitly existed in the input text! And indeed, this returns the following results:
Searching for [spanNear([mask(color:red) as animal, animal:_any_], 1, true)]:
doc=1, text=only red dog
doc=0, text=brown fox and a red dog
The _any_ term adds more information to the index (i.e. it repeats the indexed information of all
annotations). However, in practice this would normally not have a big impact on your index, since it
will already contain plenty of other indexed data. But, if you don't expect to execute such global matching
queries (animal:*), you can avoid the indexing of the extra term by removing the
AnyAnnotationFilter
from your analysis chain.
Multi word annotations
So far we have seen how to index single word annotations with Lucene. The techniques that I covered are simple, but not enough if you want to detect and index multi-word annotations, e.g. the color Dark Sea Green. To do that, we will need to use another of Lucene's very useful data structures, the Payload. I will demonstrate how to use it for our multi-word annotations in a follow-up post.
Describes essential steps in the writing and composing of two most popular types of the essays- Process essay and personal statement essay. See more sop biotechnology
ReplyDelete