Author
and Pub Date
facets. Since I would like to focus on the aspects of indexing and searching with facets, and in order to keep the examples simple, I index only facets for each book.
Lucene Facets are packaged with example code that you can use to start building your faceted search application. Examples are provided for simple as well as advanced scenarios (e.g. using sampling and complements).First, let's define the facets of each book:
List<CategoryPath> book1 = new ArrayList<CategoryPath>(); book1.add(new CategoryPath("Author", "Erik Hatcher")); book1.add(new CategoryPath("Author", "Otis Gospodnetić")); book1.add(new CategoryPath("Pub Date", "2004", "December", "1")); List<CategoryPath> book2 = new ArrayList<CategoryPath>(); book2.add(new CategoryPath("Author", "Michael McCandless")); book2.add(new CategoryPath("Author", "Erik Hatcher")); book2.add(new CategoryPath("Author", "Otis Gospodnetić")); book2.add(new CategoryPath("Pub Date", "2010", "July", "28"));
Note how each category is initialized as a
CategoryPath
, which holds the category hierarchy. It can be initialized with the category path components passed separately to the constructor (as in the example above), or by passing the full hierarchy string with a delimiter, e.g. new CategoryPath("Author/Erik Hatcher", '/')
.
Facets Indexing
Next, we need to initialize some components that are required for indexing the books and their facets. I previously mentioned that Lucene manages a hierarchical taxonomy of categories. That management is done byDirectoryTaxonomyWriter
, which is responsible for adding categories to the taxonomy. Let's take a look at the following code:
Directory indexDir = new RAMDirectory(); Directory taxoDir = new RAMDirectory(); IndexWriter indexWriter = new IndexWriter(indexDir, new IndexWriterConfig(Version.LUCENE_50, new KeywordAnalyzer())); DirectoryTaxonomyWriter taxoWriter = new DirectoryTaxonomyWriter(taxDir); FacetFields facetFields = new FacetFields(taxoWriter);
- Note that the taxonomy and search index use their own
Directory
instances. This is currently mandatory and cannot be avoided. FacetFields
is the helper class which takes care of adding the categories to the taxonomy (viaDirectoryTaxonomyWriter
), as well as adding the needed fields to the search index. It is usually used either before or after you add the normal search fields to theDocument
.
Document bookDoc = new Document(); // now you will normally add all the fields that your application // needs to index / store, e.g. title, content, price etc. // add the categories to the taxonomy and the needed fields to the document facetFields.addFields(bookDoc, bookCategories); indexWriter.addDocument(bookDoc);
Faceted Search
In order to execute faceted search, we need to initialize some search components:DirectoryReader indexr = DirectoryReader.open(indexWriter, false); IndexSearcher searcher = new IndexSearcher(indexr); DirectoryTaxonomyReader taxor = new DirectoryTaxonomyReader(taxoWriter);
DirectoryReader
and IndexSearcher
are needed for executing queries on the search index. DirectoryTaxonomyReader
is used to fetch data from the taxonomy. We open the readers on their respective writers in order to get NRT behavior, which allows the readers to view the changes made by the writers, without those changes committed first (notice that the code doesn't call commit()
).
Faceted search aggregates requested facets on all documents that match a query via
FacetsCollector
. You first define a FacetRequest
per root node for which you are interested to aggregate top categories. The root node can be any node in the taxonomy tree, e.g. Pub Date/2004
. You can also specify the lowest level in the taxonomy tree for which you would like to receive the aggregations. I.e., level=1
(default) means that you are interested in aggregating the immediate children of the root node, while level=2
means that you are interested to get the top categories of the immediate children of root, as well as each of their top categories. This is a powerful capability of Lucene Facets, which allows you to do recursive top categories aggregations.
The following code initializes
FacetSearchParams
to aggregate the top 10 immediate categories of the Author
and Pub Date
facets, and finally executes a MatchAllDocsQuery
to aggregate facets on all indexed documents (books):
FacetSearchParams fsp = new FacetSearchParams(); fsp.addFacetRequest(new CountFacetRequest(new CategoryPath("Author"), 10)); fsp.addFacetRequest(new CountFacetRequest(new CategoryPath("Pub Date"), 10)); FacetsCollector facetsCollector = FacetsCollector.create(fsp, indexr, taxor); searcher.search(new MatchAllDocsQuery(), facetsCollector);
NOTE: in a normal search you will usually execute a differentWe can print the top categories and their weight using the following code:Query
, e.g. one that was parsed from the user's request, and wrapFacetsCollector
and e.g.TopDocsCollector
withMultiCollector
, to retrieve both the top ranking documents to the query, as well as the top categories for the entire result set.
for (FacetResult fres : facetsCollector.getFacetResults()) { FacetResultNode root = fres.getFacetResultNode(); System.out.println(String.format("%s (%d)", root.label, root.value)); for (FacetResultNode cat : root.getSubResults()) { System.out.println(String.format(" %s (%d)", cat.label.components[1], cat.value)); } }Let's take a moment to review the code.
FacetsCollector
returns a list of FacetResult
, and there is one item in the list per requested facet. FacetResult
exposes a tree-like API and the root node corresponds to the one that was specified in the request. The root will have as many children as were requested in the request (or less, as in this case). The traversal starts by getting the root and then its children (which denote the top categories for this request).
The code prints the label and weight of each category that it visits. Note that for the root node, the weight denotes the aggregated weight of all of its children (even those that did not make it to the top list). This can quickly tell you how many documents in the result set are not associated with the root node at all. If you execute the code, you will get the following print:
Author (2.0) Otis Gospodnetić (2.0) Erik Hatcher (2.0) Michael McCandless (1.0) Pub Date (2.0) 2010 (1.0) 2004 (1.0)
NOTE: the label of each child node is actually the full path from the root, e.g.Author/Erik Hatcher
. For brevity, I excluded the level of the root from the print (see the call to.getComponent(1)
).
Drill-down / Narrowing on facet
One of the operations that users will probably want to do with the returned facets, is to use them as a filter to the query, to narrow down the result set. This can be easily achieved with theDrillDown
helper class, as follows:
Query base = new MatchAllDocsQuery(); DrillDownQuery ddq = new DrillDownQuery(FacetIndexingParams.DEFAULT, base); ddq.add(new CategoryPath("Author", "Michael McCandless"));This code returns all documents that matched the original query and are associated with the category
Author/Michael McCandless
. If we execute that query by calling searcher.search(q, 10)
(i.e., return the top 10 documents), we'll receive only the second book, since only it is associated with that category.
That's it ! You now know the basics of indexing and searching with Lucene Facets. You can use these code examples (and the ones that are packaged with Lucene) to start building your faceted search application.
You've saved me from a long coding night.
ReplyDeletethanks for this create article.
Thorsten
Hello
ReplyDeleteThanks you for this great article.
Is it possible to add simple facets at search-only time or is it necessary to bake them in at index time ?
clive
Clive, you need to specify the facets at indexing time. While this requires some planning in advance, this is what makes Lucene facets so fast, because we are able to encode the facets in optimized data structures so that faceted search does (relatively) the easy job!
DeleteHow can we perform facet searching in LUCENE 4.6??
ReplyDeleteDo you have downloadable solution for this tutorial?
ReplyDeleteFacets Search is really complicated .... the pronouced "Simple API" is a big lie ...
ReplyDeleteI am new to Lucene. Can you tell me how to exclude "dim=Author path=[] value=17 childCount=3" this part from result . It would be of great help.
ReplyDeleteHey.. I want to double search. Example.. If search term is "A B", I want to do first search "A" and then want to dynamically create a new index of results and then perform search "B" only on these new documents...
ReplyDeleteIs it possible with facets ?
Not sure why you want to build a new index with a set of search results. Why not just do an AND query to apply both criteria?
DeleteIf you really need to build a second index, facets are not going to help with that. You could theoretically do a normal search, then create a temporary index with those search results and apply the second query to that, but it's going to be slow and has no benefit over an AND query that I can think of.