Monday, December 10, 2012

Lucene Facets Under The Hood

In my previous posts I gave an overview of Lucene Facets and showed how to write a simple faceted search program. This time I go through some of the internals of the facets module, in particular the structure of the taxonomy and search index as well as how faceted search is performed.

The Taxonomy Index

The job of the taxonomy is to map hierarchical categories to ordinals (integers), and provide taxonomy browsing services. The taxonomy is managed through a sidecar Lucene index through DirectoryTaxonomyWriter and DirectoryTaxonomyReader. Often, I refer to the taxonomy as the "taxonomy index".

When categories are added to the taxonomy, they are broken down to their path elements and an ordinal is assigned to every level in the path. In addition, the taxonomy index guarantees that every node in the taxonomy will have a smaller ordinal than all its children, and also that the node that was added last at level N will have a greater ordinal than all its existing siblings. For example, if you add the category Date/2012/March/20 to an empty taxonomy index, the following entries will be added:
  • Date, with ordinal=1
  • Date/2012, with ordinal=2
  • Date/2012/March, with ordinal=3
  • Date/2012/March/20, with ordinal=4
You can then use the taxonomy to retrieve the ordinal of Date/2012 (and vice versa), its child categories (Date/2012/March) as well as its siblings (categories at the same level in the tree; none in this example).

The taxonomy maintains the tree information in three parallel arrays:
  • Parents: parents[i] denotes the parent of category ordinal i. Note that all root categories (e.g. Author, Date and so forth), often called "dimensions", are put under a virtual root which always has the ordinal 0. Therefore parents[0]=-1 (always), and for any i≠0, parents[i]≠-1. Also, parents[i]=0 for all root categories.
  • Children: children[i] denotes the ordinal of the youngest child of category ordinal i. The youngest child is defined as the one that was added last to the taxonomy (and has a larger ordinal than all its siblings). children[i]=-1 for all leaf nodes.
  • Siblings: siblings[i] denotes an ordinal j, such that j<i and parents[i]=parents[j] (i.e. the two categories are children of the same parent). j denotes the category ordinal that was the former "youngest child" of the parent ordinal. siblings[0] is always -1, and the first child of a parent will always have siblings[i]=-1.
The following diagram gives a visual representation of the information above. Note how the parents, children and siblings structures allow easy and efficient traversal of the taxonomy tree, as well as updating when new categories are added.

The Search Index

When categories are added to a document, they are added as drill-down terms to the search index. In addition, their ordinals are encoded in a special term's payload (called the "fulltree posting"). For example, if you add the category Date/2012/March/20 to a document, then four drill-down terms are added, one for each level, e.g. $facets:Date, $facets:Date/2012 and so forth. Additionally, all four category ordinals are encoded in the fulltree posting's payload, and are used during faceted search to aggregate the respective categories.

Search Time

Now that we understand the role of the taxonomy index and the data structures it provides for traversing the taxonomy, as well as the structure of the facets in the search index, we can now look at how faceted search is executed. As mentioned in my previous post, every faceted search consists of a Query which matches indexed documents and one or more FacetRequests for facets aggregation (one instance per requested root). The aggregation is done by FacetsCollector, from which you obtain a FacetResult per FacetRequest.

FacetsCollector computes the top categories in three steps. First, all documents that match the query are marked in a bitset (and depends on the facet request, their scores may also be stored in memory). Then it traverses the fulltree posting, fetches the ordinals of every matching document and aggregates them in an in-memory array, where every entry holds the aggregated weight of that ordinal. After all documents have been visited and their categories aggregated, we use the children[] and siblings[] information from the taxonomy to traverse the children of the root category which was specified by the request, and compute the top categories among them. In the end, the taxonomy index is used to label the top categories.

When you create a FacetRequest, you can set the depth in the taxonomy for which it aggregates categories. The default depth=1 means that top categories are computed among the immediate children of root. An interesting capability is to compute the top categories recursively (in one search) up to depth=N. For example, if you create a FacetRequest over the Date root, and call facetRequest.setDepth(2) as well as facetRequest.setResultMode(ResultMode.PER_NODE_IN_TREE), you will retrieve the top categories of Date, as well as each of their top children, as depicted below:
Date (123)
  2010 (34)
    October  (18)
    March    (8)
  2008 (23)
    November (12)
    October  (7)

Discussion

Global Ordinals
The global ordinals that are encoded in the search index allow very efficient per-segment faceting and aggregation, as all operations are done on ordinals (integers). If the ordinals were stored per-segment, i.e. the category Date/2010 would receive different ordinals in different segments, computing the top categories would need to be done on the category string, or hold some in-memory data structures that e.g. map local ordinals to global ones - that would have affected the scalability of the facets significantly.

Global ordinals do have a downside though. When you want to merge two search indexes which have respective taxonomies, their ordinals need to be translated. That means that you cannot use the efficient IndexWriter.addIndexes(Directory...) API to merge the indexes, but rather the .addIndexes(IndexReader...) expensive API, in order to translate the ordinals that are encoded in the source search index, so they match the target one.

Two-Phase Aggregation
The two-phase aggregation done by FacetsCollector may seem unnecessary and expensive. Mike McCandless, who has recently added faceted search to Lucene's nightly benchmark, started exploring doing the aggregation in one step, as documents are collected. The two-phase aggregation does have some strengths though. For example, it allows to compute facet weights on a sample set of the results, rather than over all documents. That is an important capability for large indexes, since the cost of doing faceted search is proportional to the number of documents that matched the query.

The two-phase aggregation is also useful when you want to compute the facet weights on the complement set of the matching documents (i.e. those that did not match the query). This is quite difficult to do when facets are aggregated as documents are collected. First, you can only aggregate the facets of the complement set retroactively, i.e., only after seeing docs 1 and 17 you know that docs 2-16 did not match the query. But, when out of order query matching is done, you can tell the complement set only after all documents have been collected! Second, complements aggregation is useful only when the number of results is more than 50% of the documents in the index, but that is also unknown until the query matching phase is done.

Lucene facets are very powerful and they come with a multitude of extension points to help you customize and optimize them to your application needs. I'm sure that now that facets are part of the nightly benchmark, they will see a lot of improvements, both to their API and more importantly, their performance. Stay tuned for updates!

Wednesday, November 28, 2012

Lucene Facets, Part 2

In my last post I gave a quick introduction to Lucene's facets; this time I'll drill down into some working code, by writing a very simple faceted search program which demonstrates the basic principles of Lucene Facets. The program indexes the two (fantastic !) Lucene books: Lucene in Action and Lucene in Action, Second Edition. For each book it adds an Author and Pub Date facets. Since I would like to focus on the aspects of indexing and searching with facets, and in order to keep the examples simple, I index only facets for each book.
Lucene Facets are packaged with example code that you can use to start building your faceted search application. Examples are provided for simple as well as advanced scenarios (e.g. using sampling and complements).
First, let's define the facets of each book:
List<CategoryPath> book1 = new ArrayList<CategoryPath>();
book1.add(new CategoryPath("Author", "Erik Hatcher"));
book1.add(new CategoryPath("Author", "Otis Gospodnetić"));
book1.add(new CategoryPath("Pub Date", "2004", "December", "1"));

List<CategoryPath> book2 = new ArrayList<CategoryPath>();
book2.add(new CategoryPath("Author", "Michael McCandless"));
book2.add(new CategoryPath("Author", "Erik Hatcher"));
book2.add(new CategoryPath("Author", "Otis Gospodnetić"));
book2.add(new CategoryPath("Pub Date", "2010", "July", "28"));

Note how each category is initialized as a CategoryPath, which holds the category hierarchy. It can be initialized with the category path components passed separately to the constructor (as in the example above), or by passing the full hierarchy string with a delimiter, e.g. new CategoryPath("Author/Erik Hatcher", '/').

Facets Indexing

Next, we need to initialize some components that are required for indexing the books and their facets. I previously mentioned that Lucene manages a hierarchical taxonomy of categories. That management is done by DirectoryTaxonomyWriter, which is responsible for adding categories to the taxonomy. Let's take a look at the following code:
Directory indexDir = new RAMDirectory();
Directory taxoDir = new RAMDirectory();
IndexWriter indexWriter = new IndexWriter(indexDir,
  new IndexWriterConfig(Version.LUCENE_50, new KeywordAnalyzer()));
DirectoryTaxonomyWriter taxoWriter = new DirectoryTaxonomyWriter(taxDir);
FacetFields facetFields = new FacetFields(taxoWriter);
  • Note that the taxonomy and search index use their own Directory instances. This is currently mandatory and cannot be avoided.
  • FacetFields is the helper class which takes care of adding the categories to the taxonomy (via DirectoryTaxonomyWriter), as well as adding the needed fields to the search index. It is usually used either before or after you add the normal search fields to the Document.
We can now index the books. The following code indexes the categories of a book, and is executed for each book:
Document bookDoc = new Document();

// now you will normally add all the fields that your application
// needs to index / store, e.g. title, content, price etc.

// add the categories to the taxonomy and the needed fields to the document
facetFields.addFields(bookDoc, bookCategories);
indexWriter.addDocument(bookDoc);

Faceted Search

In order to execute faceted search, we need to initialize some search components:
DirectoryReader indexr = DirectoryReader.open(indexWriter, false);
IndexSearcher searcher = new IndexSearcher(indexr);
DirectoryTaxonomyReader taxor = new DirectoryTaxonomyReader(taxoWriter);
DirectoryReader and IndexSearcher are needed for executing queries on the search index. DirectoryTaxonomyReader is used to fetch data from the taxonomy. We open the readers on their respective writers in order to get NRT behavior, which allows the readers to view the changes made by the writers, without those changes committed first (notice that the code doesn't call commit()).

Faceted search aggregates requested facets on all documents that match a query via FacetsCollector. You first define a FacetRequest per root node for which you are interested to aggregate top categories. The root node can be any node in the taxonomy tree, e.g. Pub Date/2004. You can also specify the lowest level in the taxonomy tree for which you would like to receive the aggregations. I.e., level=1 (default) means that you are interested in aggregating the immediate children of the root node, while level=2 means that you are interested to get the top categories of the immediate children of root, as well as each of their top categories. This is a powerful capability of Lucene Facets, which allows you to do recursive top categories aggregations.

The following code initializes FacetSearchParams to aggregate the top 10 immediate categories of the Author and Pub Date facets, and finally executes a MatchAllDocsQuery to aggregate facets on all indexed documents (books):
FacetSearchParams fsp = new FacetSearchParams();
fsp.addFacetRequest(new CountFacetRequest(new CategoryPath("Author"), 10));
fsp.addFacetRequest(new CountFacetRequest(new CategoryPath("Pub Date"), 10));
FacetsCollector facetsCollector = FacetsCollector.create(fsp, indexr, taxor);
searcher.search(new MatchAllDocsQuery(), facetsCollector);
NOTE: in a normal search you will usually execute a different Query, e.g. one that was parsed from the user's request, and wrap FacetsCollector and e.g. TopDocsCollector with MultiCollector, to retrieve both the top ranking documents to the query, as well as the top categories for the entire result set.
We can print the top categories and their weight using the following code:
for (FacetResult fres : facetsCollector.getFacetResults()) {
  FacetResultNode root = fres.getFacetResultNode();
  System.out.println(String.format("%s (%d)", root.label, root.value));
  for (FacetResultNode cat : root.getSubResults()) {
    System.out.println(String.format("  %s (%d)", cat.label.components[1], 
                       cat.value));
  }
}
Let's take a moment to review the code. FacetsCollector returns a list of FacetResult, and there is one item in the list per requested facet. FacetResult exposes a tree-like API and the root node corresponds to the one that was specified in the request. The root will have as many children as were requested in the request (or less, as in this case). The traversal starts by getting the root and then its children (which denote the top categories for this request).

The code prints the label and weight of each category that it visits. Note that for the root node, the weight denotes the aggregated weight of all of its children (even those that did not make it to the top list). This can quickly tell you how many documents in the result set are not associated with the root node at all. If you execute the code, you will get the following print:
Author (2.0)
  Otis Gospodnetić (2.0)
  Erik Hatcher (2.0)
  Michael McCandless (1.0)
Pub Date (2.0)
  2010 (1.0)
  2004 (1.0)
NOTE: the label of each child node is actually the full path from the root, e.g. Author/Erik Hatcher. For brevity, I excluded the level of the root from the print (see the call to .getComponent(1)).

Drill-down / Narrowing on facet

One of the operations that users will probably want to do with the returned facets, is to use them as a filter to the query, to narrow down the result set. This can be easily achieved with the DrillDown helper class, as follows:
Query base = new MatchAllDocsQuery();
DrillDownQuery ddq = new DrillDownQuery(FacetIndexingParams.DEFAULT, base);
ddq.add(new CategoryPath("Author", "Michael McCandless"));
This code returns all documents that matched the original query and are associated with the category Author/Michael McCandless. If we execute that query by calling searcher.search(q, 10) (i.e., return the top 10 documents), we'll receive only the second book, since only it is associated with that category.

That's it ! You now know the basics of indexing and searching with Lucene Facets. You can use these code examples (and the ones that are packaged with Lucene) to start building your faceted search application.

Saturday, November 24, 2012

Lucene Facets, Part 1

Faceted search, also called Faceted Navigation, is a technique for accessing documents that were classified into a taxonomy of categories. An example of a taxonomy is the Open Directory Project (ODP), which is an open source project aimed at building a catalog for web pages. For example, Lucene is classified in the ODP under Computers/Software/Information Retrieval/Fulltext/Apache Lucene.

Faceted search gives the user a quick overview of the break down of the search results (e.g., how many documents are in category Year/2012 vs. Year/2011). It can also be used to simplify the user's interaction with the search application. For example, the screenshot below was taken from a book search application, which uses faceted search to let the user narrow his search by clicking facets of interest. As you can see, the query returned 46,392 books, out of which 5,772 were published in 2005. To narrow the search to books that were published only in that year, the user only needs to click the "2005" link, and does not need to know how the books were indexed (i.e. the index field which contains the publication date information), nor the application's search syntax.


In order to narrow the search to books that were published in May, 2005, the user needs to click the "2005" link, followed by the "May" link.

Lucene Facets

Faceted search was added to Lucene in LUCENE-3079. It contains a very rich and extensive userguide which describes its capabilities and provides code examples for using it. Below, I summarize at a high-level some of the capabilities that the package supports:

Taxonomy

The facets module manages a taxonomy. The taxonomy is discovered and built on-the-fly as categories are indexed for documents. It is used to write the needed data in the search index, in order to later execute faceted search efficiently.

Indexing and search helper classes

FacetFields can be used to easily index categories for documents. It receives a list of categories for a document and then adds them to the taxonomy as well as the needed fields to the document so that it can later support narrowing down a search by a facet value, as well as be used for facet aggregation.

FacetsCollector can be used to easily execute faceted search and retrieve the top categories for a query. It receives a list of category roots to aggregate and returns the top categories for each requested root. The root can be any node in the taxonomy tree (e.g. "count all facets under /date/2011"), and top categories can be computed for the immediate children of root, or any in its sub-tree.

More than just counting

Faceted search is executed by assigning weights to categories that are associated with matching documents. Traditionally, the weight of a category denotes its number of occurrences in the result set (hence 'counting'). Lucene facets, however, allow you to compute any weight for categories.

For instance, the function sum(#relevance) will set a category's weight to the sum of relevance scores of the documents that matched the query. This function returns as the top categories the ones that are cumulatively associated with higher ranking documents, while the 'counting' function may return categories that are associated with documents that are less relevant to the query (and hence were ranked lower). The screenshot below shows the Publication Date facet, when categories are weighted by sum(#relevance):


Notice how the facet Publication Date/2005, which exists in 5,772 result documents is now second in the list, with cumulative relevance weight of 1482.55 where the facet Publication Date/2006, which exists in 4,995 result documents (~15% less) is now first in the list, with cumulative relevance weight of 1525.92. An interesting side effect from weighting the facets like so, is that we can now observe that books that were written in 2006 are more relevant to this query (which we couldn't tell when we only counted facet occurrences).

Sampled aggregation

When the number of results is high (e.g. millions), aggregating the facets of every document may be expensive. Instead, you can aggregate the facets of a sample of these documents, and compute the top categories on that sample only. You can also request to get an exact aggregation of these top categories.

Complements aggregation

When the number of search results is larger than 50% of the number of documents in the index, it may be cheaper to aggregate the facets of the complement results set (i.e. those that didn't match the query). To do that, we keep a TotalFacetCounts which maintains total counts for requested facets (i.e. their total count over all documents), and during search, we count the complement set of the matching documents. The complement counts are then subtracted from the total counts, to compute the top categories.

NOTE that I use the term 'count' and not 'aggregation'. That is because not every aggregation function can support facet complements counting, e.g. sum(#relevance).

Near-real-time faceting

Near-real-time (NRT) search was added to Lucene in version 2.9. It eliminated the need to perform the expensive commit operation in order for readers to pick up the index changes. With NRT, you can choose to commit only when you would like to ensure that the index data resides on stable storage, and use the NRT capabilities in order to expose index changes to the readers. Lucene NRT is discussed in many blogs and web pages; you can read about it here, here and here.

Faceted search recently became NRT (to be released with Lucene 4.1). This means that you no longer need to commit changes to the search and taxonomy indexes, but rather reopen your TaxonomyReader, just like you would reopen an IndexReader. If you are a Lucene facets user, you can give it a try by using a 4.1-SNAPSHOT version. Don't forget to report back any issues that you find!

One of the requirements to being NRT is for the reopen task to do work that is proportional to the amount of new data that was indexed. I.e., if your index already has 100 million documents that are visible to an IndexReader, and in the latest session you indexed only 10,000 documents, then reopening that reader should only read data relevant to these new documents. Lucene facets support NRT in two ways: (1) when you reopen the taxonomy index, only new categories are read, and (2) faceted search does not use any global data structures in the search index, and therefore support NRT as well.