Computers/Software/Information Retrieval/Fulltext/Apache Lucene.
Faceted search gives the user a quick overview of the break down of the search results (e.g., how many documents are in category
Year/2011). It can also be used to simplify the user's interaction with the search application. For example, the screenshot below was taken from a book search application, which uses faceted search to let the user narrow his search by clicking facets of interest. As you can see, the query returned 46,392 books, out of which 5,772 were published in 2005. To narrow the search to books that were published only in that year, the user only needs to click the "2005" link, and does not need to know how the books were indexed (i.e. the index field which contains the publication date information), nor the application's search syntax.
In order to narrow the search to books that were published in May, 2005, the user needs to click the "2005" link, followed by the "May" link.
Lucene FacetsFaceted search was added to Lucene in LUCENE-3079. It contains a very rich and extensive userguide which describes its capabilities and provides code examples for using it. Below, I summarize at a high-level some of the capabilities that the package supports:
TaxonomyThe facets module manages a taxonomy. The taxonomy is discovered and built on-the-fly as categories are indexed for documents. It is used to write the needed data in the search index, in order to later execute faceted search efficiently.
Indexing and search helper classes
CategoryDocumentBuilder can be used to easily index categories for documents. It receives a list of categories for a document and then adds them to the taxonomy as well as the needed fields to the document so that it can later support narrowing down a search by a facet value, as well as be used for facet aggregation.
FacetsCollector can be used to easily execute faceted search and retrieve the top categories for a query. It receives a list of category roots to aggregate and returns the top categories for each requested root. The root can be any node in the taxonomy tree (e.g. "count all facets under
/date/2011"), and top categories can be computed for the immediate children of root, or any in its sub-tree.
More than just counting
Faceted search is executed by assigning weights to categories that are associated with matching documents. Traditionally, the weight of a category denotes its number of occurrences in the result set (hence 'counting'). Lucene facets, however, allow you to compute any weight for categories.
For instance, the function
sum(#relevance) will set a category's weight to the sum of relevance scores of the documents that matched the query. This function returns as the top categories the ones that are cumulatively associated with higher ranking documents, while the 'counting' function may return categories that are associated with documents that are less relevant to the query (and hence were ranked lower). The screenshot below shows the
Publication Date facet, when categories are weighted by
Notice how the facet
Publication Date/2005, which exists in 5,772 result documents is now second in the list, with cumulative relevance weight of 1482.55 where the facet
Publication Date/2006, which exists in 4,995 result documents (~15% less) is now first in the list, with cumulative relevance weight of 1525.92. An interesting side effect from weighting the facets like so, is that we can now observe that books that were written in 2006 are more relevant to this query (which we couldn't tell when we only counted facet occurrences).
When the number of results is high (e.g. millions), aggregating the facets of every document may be expensive. Instead, you can aggregate the facets of a sample of these documents, and compute the top categories on that sample only. You can also request to get an exact aggregation of these top categories.
When the number of search results is larger than 50% of the number of documents in the index, it may be cheaper to aggregate the facets of the complement results set (i.e. those that didn't match the query). To do that, we keep a
TotalFacetCounts which maintains total counts for requested facets (i.e. their total count over all documents), and during search, we count the complement set of the matching documents. The complement counts are then subtracted from the total counts, to compute the top categories.
NOTE that I use the term 'count' and not 'aggregation'. That is because not every aggregation function can support facet complements counting, e.g.
Near-real-time (NRT) search was added to Lucene in version 2.9. It eliminated the need to perform the expensive commit operation in order for readers to pick up the index changes. With NRT, you can choose to commit only when you would like to ensure that the index data resides on stable storage, and use the NRT capabilities in order to expose index changes to the readers. Lucene NRT is discussed in many blogs and web pages; you can read about it here, here and here.
Faceted search recently became NRT (to be released with Lucene 4.1). This means that you no longer need to commit changes to the search and taxonomy indexes, but rather reopen your
TaxonomyReader, just like you would reopen an
IndexReader. If you are a Lucene facets user, you can give it a try by using a 4.1-SNAPSHOT version. Don't forget to report back any issues that you find!
One of the requirements to being NRT is for the reopen task to do work that is proportional to the amount of new data that was indexed. I.e., if your index already has 100 million documents that are visible to an
IndexReader, and in the latest session you indexed only 10,000 documents, then reopening that reader should only read data relevant to these new documents. Lucene facets support NRT in two ways: (1) when you reopen the taxonomy index, only new categories are read, and (2) faceted search does not use any global data structures in the search index, and therefore support NRT as well.