In the previous post I
demonstrated how to index tagged data with Lucene. I used an Analyzer
which filters out terms that
are not accepted/recognized by an Annotator
and added the text twice to the document: once for the
"text" field and again for the "color" field (where the color-annotated terms were indexed).
I also said that this approach is inefficient when indexing large amounts of data, or when using multiple
annotators, and that Lucene allows us to improve that by using its
TeeSinkTokenFilter. In this post I will show how to use it, but before I do that, let's index
some more data and add an AnimalAnnotator
to also detect animals.
private void indexDocs(IndexWriter writer) {
addDocument(writer, "brown fox and a red dog");
addDocument(writer, "black and white cow");
addDocument(writer, "no red animals here");
writer.commit();
}
private void addDocument(IndexWriter writer, String text) {
final Document doc = new Document();
doc.add(new TextField(TEXT_FIELD, text, Store.YES));
doc.add(new TextField(COLOR_FIELD, text, Store.NO));
doc.add(new TextField(ANIMAL_FIELD, text, Store.NO)); // (1)
writer.addDocument(doc);
}
public final class AnimalAnnotatorAnalyzer extends Analyzer { // (2)
@Override
protected TokenStreamComponents createComponents(String fieldName) {
final Tokenizer tokenizer = new WhitespaceTokenizer();
final TokenStream stream = new AnnotatorTokenFilter(tokenizer,
AnimalAnnotator.withDefaultAnimals());
return new TokenStreamComponents(tokenizer, stream);
}
}
private Analyzer createAnalyzer() {
final Analyzer colorAnnotatorAnalyzer = new ColorAnnotatorAnalyzer();
final Analyzer animalAnnotatorAnalyzer = new AnimalAnnotatorAnalyzer();
final Analyzer defaultAnalyzer = new WhitespaceAnalyzer();
return new PerFieldAnalyzerWrapper(defaultAnalyzer,
ImmutableMap. of(
COLOR_FIELD, colorAnnotatorAnalyzer,
ANIMAL_FIELD, animalAnnotatorAnalyzer)); // (3)
}
- (1)
- Add an "animal" field, in addition to "text" and "color".
- (2)
- Similar to
ColorAnnotatorAnalyzer
, only chainsAnnotatorFilter
withAnimalAnnotator
. - (3)
- Add
AnimalAnnotatorAnalyzer
to thePerFieldAnalyzerWrapper
.
We can now also print the terms of the "animal" field, and with their position information we get the following:
Terms for field [color], with positional info:
brown
doc=0, freq=1, pos=[0]
red
doc=0, freq=1, pos=[4]
doc=1, freq=1, pos=[1]
doc=2, freq=1, pos=[1]
Terms for field [animal], with positional info:
dog
doc=0, freq=1, pos=[5]
doc=1, freq=1, pos=[2]
fox
doc=0, freq=1, pos=[1]
Now that we have "color" and "animal" annotations, we can explore more fun queries:
// Search for documents with animals and colors
Searching for [+animal:* +color:*]:
doc=0, text=brown fox and a red dog
doc=1, text=only red dog
// Search for documents with red animals
Searching for [+animal:* +color:red]:
doc=1, text=only red dog
doc=0, text=brown fox and a red dog
Indexing annotations with TeeSinkTokenFilter
TeeSinkTokenFilter is a TokenFilter
which allows sharing analysis chains between multiple
fields, that at some point need to go their separate ways. Named after Unix's
tee command, it pipes the output of the source stream to one
or more "sinks", which can then continue processing the token independently of the source as well the other "sinks".
This sounds perfect for our use case! We want to have the input text tokenized by whitespaces (and potentially also
lowercased and have stopwords removed) and indexed in a "text" field (for regular searches). We also want to extract
colors and animals, and index them in separate fields. With TeeSinkTokenFilter
, our addDocument()
code changes as follows:
private void addDocument(IndexWriter writer, String text) {
final Tokenizer tokenizer = new WhitespaceTokenizer();
tokenizer.setReader(new StringReader(text));
final TeeSinkTokenFilter textStream = new TeeSinkTokenFilter(tokenizer); // (1)
final TokenStream colorsStream = new AnnotatorTokenFilter( // (2)
textStream.newSinkTokenStream(), ColorAnnotator.withDefaultColors());
final TokenStream animalsStream = new AnnotatorTokenFilter( // (2)
textStream.newSinkTokenStream(), AnimalAnnotator.withDefaultAnimals());
final Document doc = new Document();
doc.add(new StoredField(TEXT_FIELD, text)); // (3)
doc.add(new TextField(TEXT_FIELD, textStream)); // (4)
doc.add(new TextField(COLOR_FIELD, colorsStream)); // (4)
doc.add(new TextField(ANIMAL_FIELD, animalsStream)); // (4)
writer.addDocument(doc);
- (1)
- Create the
TeeSinkTokenFilter
overWhitespaceTokenizer
. You can chain additional TokenFilters for additional analysis (lowercasing, stopwrods removal), and wrap the full chain with TeeSinkTokenFilter. - (2)
- Create a
SinkTokenStream
from textStream, so that it receives all processed tokens and can perform independent analysis on them (here, applyingAnnotatorTokenFilter
). Notice that we create a separate sink for "colors" and "animals". - (3)
- Since we now add the value of the "text" field as a
TokenStream
, we also need to separately add it as aStoredField
(the former does not store the value). - (4)
- Add each
TokenStream
to its respective field.
OK, so definitely the indexing code is not as simple as it was before. We now need to add fields to the document that depend on each other, and therefore we need to write a slightly more involved code. However, we do gain from the input text being processed only once, as opposed to 3 times before (the "text" and two annotation fields).
TeeSinkTokenFilter
, you should add the source field (textStream above) before you add the
"sink" fields to the document, since it needs to be processed before they are.
Besides the changes to addDocument(), we no longer need to configure IndexWriter
with
the PerFieldAnalyzerWrapper
that we created before. In fact, the way we add documents above, it doesn't
matter what analyzer we configure the indexer with, since the indexer will use the pre-built TokenStream
that we added the fields with.
Furthermore, if we run the same search code as we did before, we will get the exact same results. That's because we only changed the way the fields are processed, but not the way they are indexed. This is pretty cool!
Conclusions
Lucene does not allow one field to inject tokens into other fields, but with TeeSinkTokenFilter
we can
overcome that limitation. Each token that is processed by the source, is further sent to the "sinks", who can continue
processing it independently of the source and other "sinks" (changes to the stream are only visible to them).
Using TeeSinkTokenFilter
does come with a cost though. With the current implementation, every "sink"
caches the state of the attributes of every produced token by the source. Furthermore, the "sinks" do not share these
states with each other. That means that you might see a memory increase on your heap during indexing time, although
these objects may be collected as soon as you're done indexing that document.
This can likely be improved by having the "sinks" share the cached states with each other. But even without it, if
you're indexing large amounts of data, and especially if you have an expensive analysis chain that you can share
with the "sinks", using TeeSinkTokenFilter
might be beneficial to you.
Happy tee'ing!
No comments:
Post a Comment