Saturday, November 24, 2012

Lucene Facets, Part 1

Faceted search, also called Faceted Navigation, is a technique for accessing documents that were classified into a taxonomy of categories. An example of a taxonomy is the Open Directory Project (ODP), which is an open source project aimed at building a catalog for web pages. For example, Lucene is classified in the ODP under Computers/Software/Information Retrieval/Fulltext/Apache Lucene.

Faceted search gives the user a quick overview of the break down of the search results (e.g., how many documents are in category Year/2012 vs. Year/2011). It can also be used to simplify the user's interaction with the search application. For example, the screenshot below was taken from a book search application, which uses faceted search to let the user narrow his search by clicking facets of interest. As you can see, the query returned 46,392 books, out of which 5,772 were published in 2005. To narrow the search to books that were published only in that year, the user only needs to click the "2005" link, and does not need to know how the books were indexed (i.e. the index field which contains the publication date information), nor the application's search syntax.


In order to narrow the search to books that were published in May, 2005, the user needs to click the "2005" link, followed by the "May" link.

Lucene Facets

Faceted search was added to Lucene in LUCENE-3079. It contains a very rich and extensive userguide which describes its capabilities and provides code examples for using it. Below, I summarize at a high-level some of the capabilities that the package supports:

Taxonomy

The facets module manages a taxonomy. The taxonomy is discovered and built on-the-fly as categories are indexed for documents. It is used to write the needed data in the search index, in order to later execute faceted search efficiently.

Indexing and search helper classes

FacetFields can be used to easily index categories for documents. It receives a list of categories for a document and then adds them to the taxonomy as well as the needed fields to the document so that it can later support narrowing down a search by a facet value, as well as be used for facet aggregation.

FacetsCollector can be used to easily execute faceted search and retrieve the top categories for a query. It receives a list of category roots to aggregate and returns the top categories for each requested root. The root can be any node in the taxonomy tree (e.g. "count all facets under /date/2011"), and top categories can be computed for the immediate children of root, or any in its sub-tree.

More than just counting

Faceted search is executed by assigning weights to categories that are associated with matching documents. Traditionally, the weight of a category denotes its number of occurrences in the result set (hence 'counting'). Lucene facets, however, allow you to compute any weight for categories.

For instance, the function sum(#relevance) will set a category's weight to the sum of relevance scores of the documents that matched the query. This function returns as the top categories the ones that are cumulatively associated with higher ranking documents, while the 'counting' function may return categories that are associated with documents that are less relevant to the query (and hence were ranked lower). The screenshot below shows the Publication Date facet, when categories are weighted by sum(#relevance):


Notice how the facet Publication Date/2005, which exists in 5,772 result documents is now second in the list, with cumulative relevance weight of 1482.55 where the facet Publication Date/2006, which exists in 4,995 result documents (~15% less) is now first in the list, with cumulative relevance weight of 1525.92. An interesting side effect from weighting the facets like so, is that we can now observe that books that were written in 2006 are more relevant to this query (which we couldn't tell when we only counted facet occurrences).

Sampled aggregation

When the number of results is high (e.g. millions), aggregating the facets of every document may be expensive. Instead, you can aggregate the facets of a sample of these documents, and compute the top categories on that sample only. You can also request to get an exact aggregation of these top categories.

Complements aggregation

When the number of search results is larger than 50% of the number of documents in the index, it may be cheaper to aggregate the facets of the complement results set (i.e. those that didn't match the query). To do that, we keep a TotalFacetCounts which maintains total counts for requested facets (i.e. their total count over all documents), and during search, we count the complement set of the matching documents. The complement counts are then subtracted from the total counts, to compute the top categories.

NOTE that I use the term 'count' and not 'aggregation'. That is because not every aggregation function can support facet complements counting, e.g. sum(#relevance).

Near-real-time faceting

Near-real-time (NRT) search was added to Lucene in version 2.9. It eliminated the need to perform the expensive commit operation in order for readers to pick up the index changes. With NRT, you can choose to commit only when you would like to ensure that the index data resides on stable storage, and use the NRT capabilities in order to expose index changes to the readers. Lucene NRT is discussed in many blogs and web pages; you can read about it here, here and here.

Faceted search recently became NRT (to be released with Lucene 4.1). This means that you no longer need to commit changes to the search and taxonomy indexes, but rather reopen your TaxonomyReader, just like you would reopen an IndexReader. If you are a Lucene facets user, you can give it a try by using a 4.1-SNAPSHOT version. Don't forget to report back any issues that you find!

One of the requirements to being NRT is for the reopen task to do work that is proportional to the amount of new data that was indexed. I.e., if your index already has 100 million documents that are visible to an IndexReader, and in the latest session you indexed only 10,000 documents, then reopening that reader should only read data relevant to these new documents. Lucene facets support NRT in two ways: (1) when you reopen the taxonomy index, only new categories are read, and (2) faceted search does not use any global data structures in the search index, and therefore support NRT as well.

8 comments:

  1. Very nice stuff!

    Is there a way to quickly get the Facet counts without doing a search? For indexes that contain over 100M documents it would be nice to show the initial facets; but it is very expensive on a MatchAllDocumentsQuery()...

    ReplyDelete
  2. The 'complements aggregation' feature is implemented by maintaining an in-memory structure of the counts of all categories in the index. You could use it to compute that overview on all facets. Currently, that's how you can achieve that:

    {code}
    ScoredDocIDs docIDs = ScoredDocIdsUtils.createAllDocsScoredDocIDs(reader);
    StandardFacetsAccumulator sfa = new StandardFacetsAccumulator(...);
    sfa.setComplementThreshold(StandardFacetsAccumulator.FORCE_COMPLEMENT);
    List results = sfa.accumulate(docIDs);
    {code}

    ReplyDelete
  3. This comment has been removed by the author.

    ReplyDelete
  4. Hi Shai,
    I think the following phrase:
    "Near-real-time (NRT) search was added to Lucene in version 2.9. It eliminated the need to perform the expensive commit operation in order for readers to pick up the index changes. "

    may suggest to some readers that NRTReader is reading from the same segment IndexWriter is writing to. (Someone sited your article as a proof of it :) )

    If I am not wrong, Every reopen of NRT reader commits a segment to a RAMDirectory(NRTCachingDirectory), although this is not a durable commit.

    ReplyDelete
  5. I don't see how this sentence may suggest that, but that's why I didn't go into great details about NRT and put pointers to 3 other blogs which discuss it more. Also, when you reopen an NRT reader a segment is flushed to the Directory you are using. It can be a RAMDirectory or FSDirectory. If you are using NRTCachingDirectory, then the segment is flushed to a private RAMDirectory, but that's not the default behavior - only when you are using NRTCachingDirectory. The important thing about NRT is that it doesn't "commit" the changes to the Directory, which avoids the expensive fsync call on every file. A segment is still flushed to the Directory though, and the cost of that flush depends on the Directory you are using.

    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete
  7. How can we change the traditional counting for a facet with our custom function aggregation in LUCENE 4.4?

    ReplyDelete
  8. Hi,
    sometimes it could be useful to set a alphabethic order on a particular facet, for example 'Publisher'.
    is it possible? how?

    thanks,
    Max

    ReplyDelete