If you want to index the text quick brown fox and a red dog with Lucene, you just add a
TextField
to a Document
and index it with IndexWriter
. You can then
search for it by sending a query which contains one or more of the indexed words. The following code snippet
demonstrates how easy it is:
final String text = "quick brown fox and a red dog";
final Document doc = new Document();
doc.add(new TextField(TEXT_FIELD, text, Store.YES); // (1)
writer.addDocument(doc); // (2)
writer.commit(); // (3)
final Query q = new TermQuery(new Term(TEXT_FIELD, "quick")); // (4)
final TopDocs results = searcher.search(q, 10); // (5)
System.out.println(searcher.doc(results.scoreDocs[0].doc).get(TEXT_FIELD));
- (1)
- Add the text to index as a
TextField
. - (2)
- Index the document. This will use the default
Analyzer
that was configured forIndexWriter
to extract and process the field's terms. - (3)
- Commit the changes to the index so that they are visible to the searcher. Note, you rarely want to commit() after every indexed document, and in many cases it is preferred to use Lucene's near-realtime search.
- (4)
- Create a
Query
to search for the text. Usually you will use aQueryParser
to parse a query such as text:quick. - (5)
- Execute the search.
TopDocs
holds the list of matching documents; in the example above only one document.
However, what if you (or your users) are not interested in the direct information that’s embedded in the text, but rather about meta-information such as "documents with red colors" or "documents about foxes (animal)"? Clearly, feeding search queries such as animal:fox or color:red isn’t going to work since we never added "color" nor "animal" fields to the document.
Detecting color terms
Named-entity recognition (also known as
Named-entity extraction) is a task that seeks to classify elements in the text into categories. There are various
tools that cope with that task, and a quick search for entity extraction tools returns
quite a few, as well this
discussion. Such extraction techniques and tools are beyond the scope of this post though. Let's assume that we
have a simple Annotator
interface, which can either accept or reject tokens, with two implementations:
a ColorAnnotator
which accepts red and brown, and an AnimalAnnotator
which
accepts fox and dog.
How could we use this annotator to tag our text, such that the index captured the tagged data, and allowed us to find the document by searching for color:red or animal:fox?
An annotating TokenFilter
As mentioned above, when we add text fields to documents, they are processed by the Analyzer
that was
configured for the indexer. Lucene's Analyzer
produces a TokenStream
which processes the
text and produces index tokens. That token stream comprises a Tokenizer
and TokenFilter
.
The former is responsible for breaking the input text into tokens and the latter is responsible for processing them.
Lucene offers a great variety of filters out of the box. Some drop tokens (e.g. stopwords), some replace tokens
(e.g. stemming) and others inject additional ones (e.g. synonyms).
Since we are not after breaking the input text into words, let's write an AnnotatorTokenFilter
which
emits only words that are accepted by an annotator. Fortunately, Lucene provides a base class for filters that keep
only some of the input tokens, called FilteringTokenFilter
. It lets us focus only on the task of
accepting/rejecting tokens, while it handles the rest of the analysis stuff, which mainly consists of updating token
attributes (it can sometimes be quite tricky). By extending it, and using our Annotator
interface, we
come up with the following implementation:
/**
* A {@link FilteringTokenFilter} which uses an {@link Annotator} to
* {@link #accept()} tokens.
*/
public final class AnnotatorTokenFilter extends FilteringTokenFilter {
private final CharTermAttribute termAtt =
addAttribute(CharTermAttribute.class); // (1)
private final Annotator annotator;
public AnnotatorTokenFilter(TokenStream input, Annotator annotator) {
super(input);
this.annotator = annotator;
}
@Override
protected boolean accept() throws IOException { // (2)
return annotator.accept(termAtt.buffer(), 0, termAtt.length());
}
}
- (1)
- The standard way to get access to the term attribute on the chain. Note that we don't need to explicitly populate it, as it's being populated by the downstream tokenizer and additional filters.
- (2)
- The only method that we need to implement. If our annotator accepts the current token, we retain it on the stream. Otherwise, it will not be indexed.
Sweet! It really can't get any simpler than that. Now we just need to build an Analyzer
which uses our
filter and we can move on to indexing and searching colors:
/**
* An {@link Analyzer} which chains {@link WhitespaceTokenizer} and
* {@link AnnotatingTokenFilter} with {@link ColorAnnotator}.
*/
public static final class ColorAnnotatorAnalyzer extends Analyzer {
@Override
protected TokenStreamComponents createComponents(String fieldName) {
final Tokenizer tokenizer = new WhitespaceTokenizer(); // (1)
final TokenStream stream = new AnnotatorTokenFilter(tokenizer, // (2)
ColorAnnotator.withDefaultColors());
return new TokenStreamComponents(tokenizer, stream);
}
}
- (1)
- For simplicity only, use a whitespace tokenizer.
- (2)
- You will most likely want to chain additional filters before the annotator, e.g. lower-casing, stemming etc.
Index and search annotated colors
To index the color annotations, we are going to index two fields: "text" and "color". The "text" field will index
the original text's tokens while the "color" field will index only the ones that are also colors. Lucene provides
PerFieldAnalyzerWrapper
which can return a different analyzer per field. It's quite handy and easy
to construct:
private static Analyzer createAnalyzer() {
final Analyzer colorAnnotatorAnalyzer = new ColorAnnotatorAnalyzer(); // (1)
final Analyzer defaultAnalyzer = new WhitespaceAnalyzer(); // (2)
return new PerFieldAnalyzerWrapper(defaultAnalyzer,
ImmutableMap.<String, Analyzer> of(
COLOR_FIELD, colorAnnotatorAnalyzer)); // (3)
}
- (1)
- The color annotating analyzer that we created before.
- (2)
- The default analyzer to be used for all fields without an explicit set analyzer.
- (3)
- Set the color annotating analyzer to the "color" field.
Our indexing code then changes as follows:
final Analyzer analyzer = createAnalyzer();
final IndexWriterConfig conf = new IndexWriterConfig(analyzer); // (1)
final IndexWriter writer = new IndexWriter(dir, conf);
final Document doc = new Document();
doc.add(new TextField(TEXT_FIELD, TEXT, Store.YES));
doc.add(new TextField(COLOR_FIELD, TEXT, Store.NO)); // (2)
writer.addDocument(doc);
writer.commit();
- (1)
- Configure
IndexWriter
with ourPerFieldAnalyzerWrapper
. - (2)
- In addition to adding the text to the "text" field, also add it to the "color" field. Notice that there is no point storing it for this field too.
Printing the indexed terms for each field will yield the following output:
Terms for field [text]:
a
and
brown
dog
fox
quick
red
Terms for field [color]:
brown
red
As you can see, the "text" field contains all words from the text that we indexed, while the "color" field contains
only the words that were identified as colors by our ColorAnnotator
. We can now also search for
color:red to get the document back:
final Query q = new TermQuery(new Term(COLOR_FIELD, "red"));
final TopDocs results = searcher.search(q, 10);
System.out.println(searcher.doc(results.scoreDocs[0].doc).get(TEXT_FIELD));
That prints the following output, which is the value of the "text" field of the document:
quick brown fox and a red dog
Annotated terms' position
If we also print the full indexed information about the "color" terms, we will notice that they retain their original text position:
Terms for field [color], with additional info:
brown
doc=0, freq=1
pos=1
red
doc=0, freq=1
pos=5
This is important since it provides us a direct back-reference to the input text. That way we can search for "foxes
that are brown-colored", while if we omit the exact position information of 'brown', we may not be able to associate
the token 'fox' with the color 'brown'. Let's demonstrate that capability using Lucene's a SpanQuery
:
final SpanQuery brown = new SpanTermQuery(new Term(COLOR_FIELD, "brown")); // (1)
final SpanQuery brownText = new FieldMaskingSpanQuery(brown, TEXT_FIELD); // (2)
final SpanQuery quick = new SpanTermQuery(new Term(TEXT_FIELD, "quick")); // (3)
final Query q = new SpanNearQuery( // (4)
new SpanQuery[] { quick, brownText }, 1, true);
final TopDocs results = searcher.search(q, 10);
System.out.println(searcher.doc(results.scoreDocs[0].doc).get(TEXT_FIELD));
- (1)
- Matches "brown" in the "color" field.
- (2)
- Matches "quick" in the "text" field.
- (3)
- Since
SpanNearQuery
requires the inputSpanQuery-ies
to refer to the same field, we mask the "color" one under the "text" field. - (4)
- Search for "quick" followed by "brown".
A note about efficiency
In this post I've demonstrated a very basic approach to index and search tagged data with Lucene. As I noted above, the input data is added to a separate field for every annotator that you will apply to it. This may be OK in case you index short texts and only apply few annotators. However, if you index large data, and especially since most likely your analysis chain comprises something more complex than just whitespace tokenization, there will be performance implications to this approach.
In a follow-up post I will demonstrate how to process the input data only once, by using another of Lucene's cool
TokenStream
implementations,
TeeSinkTokenFilter.
Great post! Looking forward for more details.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteSimple and useful, thanks!
ReplyDeleteUsing Lucene 6.0, I can't get this code to do anything. What is the variable `TEXT` in this snippet:
ReplyDelete```doc.add(new TextField(TEXT_FIELD, TEXT, Store.YES));
doc.add(new TextField(COLOR_FIELD, TEXT, Store.NO));
```
Basically, unless I iterate through the TokenStream myself, per the Javadocs for 6.0, nothing ends up in my equivalent of your "color" field.
ReplyDeleteI've updated the example code to Lucene 6.2.1. Would you like to try it? https://github.com/shaie/lucenelab/
ReplyDeleteLet me give that a try. This is my first shot at Lucene, and the available documentation for v6.0 is pretty thin. I appreciate the notice!
ReplyDeleteHey, thanks for the blog article.Really looking forward to read more. Cool.
ReplyDeletedata science course in hyderabad
data science training in hyderabad
This is very Nice Article, this helpful for me. Digital Marketing Agency for Schools
ReplyDeleteThanks for posting valuable information Get cheap flights from chennai to mumbai
ReplyDeleteThank so much for posting this article. Digital Marketing Agency for Pubs
ReplyDeleteGreat Post Buy One Gram Gold Bridal Sets jewelry
ReplyDelete