Sunday, September 22, 2013

Boosting Documents in Lucene

In Information Retrieval, a document's relevance to a search is measured by how similar it is to the query. There are several similarity models implemented in Lucene, and you can implement your own by extending the Similarity class and using the index statistics that Lucene saves. Documents can also be assigned a static score to denote their importance in the overall corpus, irrespective of the query that is being executed, e.g their popularity, ratings or PageRank.

Prior to Lucene 4.0, you could assign a static score to a document by calling document.setBoost. Internally, the boost was applied to every field of the document, by multiplying the field's boost factor with the document's. However, as explained here, this has never worked correctly and depending on the type of query executed, might not affect the document's rank at all.

With the addition of DocValues to Lucene, boosting documents is as easy as adding a NumericDocValuesField and use it in a CustomScoreQuery, which multiplies the computed score by the value of the 'boost' field. The code example below illustrates how to achieve that:
// add two documents to the index
Document doc = new Document();
doc.add(new TextField("f", "test document", Store.NO));
doc.add(new NumericDocValuesField("boost", 1L));
writer.addDocument(doc);

doc = new Document();
doc.add(new TextField("f", "test document", Store.NO));
doc.add(new NumericDocValuesField("boost", 2L));
writer.addDocument(doc);

// search for 'test' while boosting by field 'boost'
Query baseQuery = new TermQuery(new Term("f", "test"));
Query boostQuery = new FunctionQuery(new LongFieldSource("boost"));
Query q = new CustomScoreQuery(baseQuery, boostQuery);
searcher.search(q, 10);

The new Expressions module can also be used for boosting documents by writing a simple formula, as depicted below. While it's more verbose than using CustomScoreQuery, it makes boosting by computing more complex formulas trivial, e.g. sqrt(_score) + ln(boost).
Expression expr = JavascriptCompiler.compile("_score * boost");
SimpleBindings bindings = new SimpleBindings();    
bindings.add(new SortField("_score", SortField.Type.SCORE));
bindings.add(new SortField("boost", SortField.Type.LONG));
Sort sort = new Sort(expr.getSortField(bindings, true));
searcher.search(baseQuery, null, 10, sort);

Now that Lucene allows updating NumericDocValuesFields without re-indexing the documents, you can incorporate frequently changing fields (popularity, ratings, price, last-modified time...) in the boosting factor without re-indexing the document every time any one of them changes.

Tuesday, May 14, 2013

The Replicator

No, as much as I want to, I don't (yet) have a true Replicator at my disposal to blog about. Nor does it look like I will have one in the near future, even though scientists are making great progress towards it. I have recently made my contribution to the global replication effort though, by adding an index replication module to Lucene. It does not convert energy to mass, nor mass to mass, so I'm not sure if scientifically it qualifies as a Replicator at all, but if you are into search indexes replication with Lucene, read on!

In computer science, replication is defined as "sharing information so as to ensure consistency between redundant resources ... to improve reliability, fault-tolerance, or accessibility" (Wikipedia). When you design a search engine, especially at large scales, replication is one approach to achieve these. Index replicas can replace primary nodes that become unavailable (e.g due to planned maintenance or severe hardware failures), as well as to support higher query loads by load-balancing search requests across them. Even if you are not building a large-scale search engine, you can use replication to take hot backups of your primary index, while searches are running on it.

Lucene's replication framework implements a primary/backup approach, where the primary process performs all indexing operations (updating the index, merging segments etc.), and the backup/replica processes copy over the changes in an asynchronous manner. The framework consists of few key components that control the replication actions:
  • Replicator mediates between the clients and server. The primary process publishes Revisions while the replica processes update to the most recent revision following their own. When a new Revision is published, the previous one is released (unless it is currently being replicated), so that the resources it consumes can be reclaimed (e.g. remove unneeded files).

  • Revision describes a list of files and their metadata. The Revision is responsible to ensure that the files are available as long as clients copy its files. For example, IndexRevision takes a snapshot on the index using SnapshotDeletionPolicy, to guarantee the files are not deleted until the snapshot is released.

  • ReplicationClient performs the replication operation on the replica side. It first copies the needed files from the primary server (e.g. the files that the replica is missing) and then invokes the ReplicationHandler to act on the copied files. If any errors occur during the copy process, it restarts the replication session. That way, when the handler is invoked, it is guaranteed that all files were copied safely to the local storage.

  • ReplicationHandler acts on the copied files. IndexReplicationHandler copies the files over to the index directory and then syncs them to ensure the files are on stable storage. If any errors occur, it aborts the replication session and cleans up the index directory. Only after successfully copying and syncing the files, it notifies a callback that the index has been updated, so that e.g. the application can refresh its SearcherManager or perform whatever tasks it needs on the updated index.
The replication framework supports replicating any type of files. It offers built-in support for replicating a single index (as described above), as well as an index and taxonomy pair (for faceted search) via IndexAndTaxonomyRevision and IndexAndTaxonomyReplicationHandler. To replicate other types of files, you need to implement Revision and ReplicationHandler. But be advised, implementing a handler properly is ... tricky!

Following example code shows how to use the replicator on the server side:
// publish a new revision on the server
IndexWriter indexWriter; // the writer used for indexing
Replicator replicator = new LocalReplicator();
replicator.publish(new IndexRevision(indexWriter));
Using the replicator on the client side is a bit more involved but hey, that's where the important stuff happens!
// client can replicate either via LocalReplicator (e.g. for backups)
// or HttpReplicator (e.g. when the server is located on a different
// node).
Replicator replicator;

// refresh SearcherManager after the index is updated
Callable<Boolean> callback = new Callable<Boolean>() {
  public void call() throws Exception {
    // index was updated, refresh manager
    searcherManager.maybeRefresh();
  }
}

// initialize a matching handler for the published revisions
ReplicationHandler handler = new IndexReplicationHandler(indexDir, callback);

// where should the client store the files before invoking the handler
SourceDirectoryFactory factory = new PerSessionDirectoryFactory(workDir);

ReplicationClient client = new ReplicationClient(replicator, handler, factory);

client.updateNow(); // invoke client manually
// -- OR --
client.startUpdateThread(30000); // check for updates every 30 seconds

So What's Next?

The replication framework currently restarts a replication session upon failures. It would be great if it supported resuming a session by e.g. copying only the files that weren't already successfully copied. This can be done by having both server and client e.g. compute a checksum on the files, so the client knows which ones were successfully copied and which weren't. Patches welcome!

Another improvement would be to support resume at the file level, i.e. don't copy parts of the file that were already copied safely. This can be implemented by modifying ReplicationClient to discard the first N bytes of the file it already copied, but would be better if it's supported by the server as well, so that it only sends the required parts. Here too, patches are welcome!

Yet another improvement is to support peer-to-peer replication. Today, the primary server is the bottleneck of the replication process since all clients access it for pulling files. This can be somewhat mitigated by e.g. having a tree network topology, where the primary node is accessed by only a few nodes, which are then accessed by other nodes and so forth. Such topology is however harder to maintain as well as very sensitive to the root node going out of action. In a peer-to-peer replication however, there is no single node from which all clients replicate, but rather each server can broadcast its current revision as well as choose any server that has a newer revision available to replicate from. An example of such implementation over Apache Solr is described here. Hmmm ... did I say patches are welcome?

Lucene has a new replication module. It's only 1 days-old and can already do so much. You are welcome to use it and help us teach it new things!

Sunday, April 28, 2013

Index Sorting with Lucene

When you index documents with Lucene, you often index them in some arbitrary order, usually by a first-come first-served manner. Most applications may not give a second thought about the order of the documents in the index, but some applications could greatly benefit if they could control that. For example, an application which always displays search results sorted by date could benefit if the index kept the documents sorted like that, since then it could traverse exactly the first K documents that match the query, and not process the entire result set to determine the first K.

Application-level Sort

Applications can try to control the order of the documents in the index by e.g. sorting them beforehand. This will work if the index is created once, from a single batch of documents. However, indexes are often created incrementally, where documents are indexed as they are encountered. Therefore, even if the application sorts each batch, the index will quickly get out-of-order as more batches of documents are indexed.
Actually, as long as the application does not use a MergePolicy, each segment will be sorted by itself, and it could process the first K documents from each segment to determine the global top K documents. However, as soon as segments are merged, they will lose their internal sort order.
If the application indexes in multiple threads (which significantly improves indexing performance), then keeping the index and segments sorted becomes an even harder task.

Static Index Sorting

In the past, Lucene offered an IndexSorter utility which allowed you to statically sort an index. It was dropped following the move to Lucene 4.0.0 APIs, but recently brought back as SortingAtomicReader, which exposes a sorted view of the index. It uses a Sorter which returns a permutation on the documents, so when it is asked for document X, it returns document Y, where perm(X)=Y. To sort an index with it, you should pass it to IndexWriter.addIndexes().

Sorting Segments During Merge

Static index sorting is less useful to applications that incrementally build their indexes. As explained above, one challenge is to keep the documents sorted in a segment that is the result of a merge. SortingMergePolicy (introduced in LUCENE-4752) wraps segments that are picked for merge with SortingAtomicReader, and guarantees that merged segments are also sorted.
NOTE: SortingMergePolicy was measured to be slower by x2-3 times than other merge policies, which is not so bad considering the task at hand. If your application's characteristics justify the use of this merge policy, it is important to compare the effects it has on indexing slowdown vs search speedup for your application!

Search Early Termination

Having an incrementally built index kept globally sorted, so that the first K documents you traverse are the ones to return as search results, is very hard and a solution to this is not yet offered by Lucene (patches are welcome!). However, SortingMergePolicy guarantees that merged segments are always sorted, and so we can apply early-termination search logic at each segment level.
EarlyTerminatingSortingCollector (introduced in LUCENE-4858) helps you do that, by early-terminating document collection on sorted segments, while doing full evaluation on unsorted ones (e.g. newly indexed documents or segments in an existing index).
NOTE: in a real-time indexing application, newly flushed segments will usually contain a small number of documents anyway, and therefore fully evaluating them is not likely to be very costly.

Applications for Index Sorting

Besides sorting the index for search early termination purposes, there are other applications to index sorting, for example compression. Sorting the index by documents' size, type or date attributes, can result in better representation of the documents in the index, leading to a smaller index footprint, which can eventually speed up searches too since more parts of the index can fit into RAM. I'm sure that Lucene users will find even more useful applications to index sorting, so stay tuned!

Sunday, January 20, 2013

Lucene Facets just got faster!

Recently, Mike McCandless had some fun with Lucene Facets, and came up with interesting performance numbers and some ideas on how facets can be improved, by e.g. utilizing DocValues to store per-document category ordinals, instead of the payload. This triggered a series of improvements and API changes, which resulted (so far) in 2x speedups to faceted search!

From Facets nightly benchmark

The biggest contribution to the spike came from cutting over facets to DocValues (LUCENE-4602), which got us nearly 70% improvements over term payloads. Not only are DocValues more efficient than payloads, they also let you cache their information in-memory, for faster access, which we now take advantage by default during faceted search.

Additional improvements came from specializing the decode logic (LUCENE-4686), as well as moving various components that take part during faceted search from an iterator-style API to a bulk-read (decode, aggregate) API (LUCENE-4620). The latter got us an additional ~20% improvements, which is amazing given that decoding and aggregation logic weren't changed - it was all just different software design. This is an important lesson for "hot" code - sometimes in order to make it efficient, you need to let go of some Object Oriented best practices!

While that's a major step forward, there's more to come. There's already work going on (LUCENE-4600 and LUCENE-4610), which will bring even more improvements for the common faceted search use cases (facet counting, no associations, no partitions etc.). Also, I plan to address two points that Mike raised in his blog: (1) making TotalFacetCounts (aka complements) segment-aware (to make it more NRT-friendly and reduce its reloading cost), and (2) experiment with a more efficient category ordinals cache, to reduce its RAM consumption as well as loading speed. There's also fun work going on here, which explores an alternative encoding for category ordinals. So stay tuned for more updates!

NOTE: cutting over facets to DocValues breaks existing indexes. You need to either rebuild them, or use FacetsPayloadMigrationReader to do a one-time migration of the payload information to DocValues.

Thursday, January 3, 2013

Facet Associations

Suppose that you work with an automatic categorization system, which given a document and some metadata, outputs Topic categories with confidence level. For example, an article about the Apache Lucene project might be categorized with Topic/Apache Software Foundation (0.34), Topic/Apache Lucene (0.95) and Topic/Information Retrieval (0.84). An article about Apache Nutch might be categorized with Topic/Apache Software Foundation (0.22), Topic/Apache Nutch (0.93) and Topic/Distributed Crawler (0.87). If you index the articles with those facets, and ignore the confidence level during facet aggregation, you will get Apache Software Foundation as the top category (with count=2). This might give the user a false impression as if the result set focuses mainly on the Apache Software Foundation topic, while the documents don't discuss general ASF issues at all. However, if you take the confidence level into account, you will get Apache Lucene and Apache Nutch as the top categories, while Apache Software Foundation would come last.

Facet Associations

Lucene Facets let you index categories with confidence level very easily, by assigning a CategoryAssociation with each category. The following short code snippet demonstrates how you would index the first article's categories:
FacetFields facetFields = new AssociationsFacetFields(taxoWriter);

// first article's categories with confidence level
CategoryAssociationsContainer article1 = new CategoryAssociationsContainer();
article1.setAssociation(
  new CategoryPath("Topic", "Apache Software Foundation"),
  new CategoryFloatAssociation(0.34f));
article1.setAssociation(
  new CategoryPath("Topic", "Apache Lucene"), 
  new CategoryFloatAssociation(0.95f));
article1.setAssociation(
  new CategoryPath("Topic", "Information Retrieval"), 
  new CategoryFloatAssociation(0.84f));

Document doc = new Document();

// add the facets to the document
facetFields.addFields(doc, associations);

// index the document
indexWriter.addDocument(doc);
Let's take a closer look at the code:
  • CategoryAssociationsContainer holds a mapping from a category to its association.
  • AssociationsFacetFields adds the needed fields (drill-down terms and category list payload) to the document, along with the associations values.
  • Finally, the document is indexed with IndexWriter.
Quite simple, ha? After you have indexed both documents like so, you can compute the top categories by summing their confidence level, as demonstrated in the following code:
CategoryPath cp = new CategoryPath("Topic");
FacetRequest topic = new AssociationFloatSumFacetRequest(cp, 10);
FacetSearchParams fsp = new FacetSearchParams(topic);
FacetsCollector fc = new FacetsCollector(fsp, indexReader, taxoReader);
searcher.search(new MatchAllDocsQuery(), fc);
If you print the top categories (using e.g. the code snippet from here), you will get the following output. Note how the top categories are now those with the higher confidence level, rather than those that appear in more documents:
Topic (0.0)
  Apache Lucene (0.95)
  Apache Nutch (0.93)
  Distributed Crawler (0.87)
  Information Retrieval (0.84)
  Apache Software Foundation (0.56)
Lucene provides two CategoryAssociation implementations for integer and float values, as well as two matching FacetRequests which set the weight of a category as the sum of its association values (AssociationFloatSumFacetRequest used in the above code sample). You can extend facet associations by either implementing a CategoryAssociation (and matching FacetRequest), or implement a FacetRequest which computes a different function over the integer/float association values.

Monday, December 10, 2012

Lucene Facets Under The Hood

In my previous posts I gave an overview of Lucene Facets and showed how to write a simple faceted search program. This time I go through some of the internals of the facets module, in particular the structure of the taxonomy and search index as well as how faceted search is performed.

The Taxonomy Index

The job of the taxonomy is to map hierarchical categories to ordinals (integers), and provide taxonomy browsing services. The taxonomy is managed through a sidecar Lucene index through DirectoryTaxonomyWriter and DirectoryTaxonomyReader. Often, I refer to the taxonomy as the "taxonomy index".

When categories are added to the taxonomy, they are broken down to their path elements and an ordinal is assigned to every level in the path. In addition, the taxonomy index guarantees that every node in the taxonomy will have a smaller ordinal than all its children, and also that the node that was added last at level N will have a greater ordinal than all its existing siblings. For example, if you add the category Date/2012/March/20 to an empty taxonomy index, the following entries will be added:
  • Date, with ordinal=1
  • Date/2012, with ordinal=2
  • Date/2012/March, with ordinal=3
  • Date/2012/March/20, with ordinal=4
You can then use the taxonomy to retrieve the ordinal of Date/2012 (and vice versa), its child categories (Date/2012/March) as well as its siblings (categories at the same level in the tree; none in this example).

The taxonomy maintains the tree information in three parallel arrays:
  • Parents: parents[i] denotes the parent of category ordinal i. Note that all root categories (e.g. Author, Date and so forth), often called "dimensions", are put under a virtual root which always has the ordinal 0. Therefore parents[0]=-1 (always), and for any i≠0, parents[i]≠-1. Also, parents[i]=0 for all root categories.
  • Children: children[i] denotes the ordinal of the youngest child of category ordinal i. The youngest child is defined as the one that was added last to the taxonomy (and has a larger ordinal than all its siblings). children[i]=-1 for all leaf nodes.
  • Siblings: siblings[i] denotes an ordinal j, such that j<i and parents[i]=parents[j] (i.e. the two categories are children of the same parent). j denotes the category ordinal that was the former "youngest child" of the parent ordinal. siblings[0] is always -1, and the first child of a parent will always have siblings[i]=-1.
The following diagram gives a visual representation of the information above. Note how the parents, children and siblings structures allow easy and efficient traversal of the taxonomy tree, as well as updating when new categories are added.

The Search Index

When categories are added to a document, they are added as drill-down terms to the search index. In addition, their ordinals are encoded in a special term's payload (called the "fulltree posting"). For example, if you add the category Date/2012/March/20 to a document, then four drill-down terms are added, one for each level, e.g. $facets:Date, $facets:Date/2012 and so forth. Additionally, all four category ordinals are encoded in the fulltree posting's payload, and are used during faceted search to aggregate the respective categories.

Search Time

Now that we understand the role of the taxonomy index and the data structures it provides for traversing the taxonomy, as well as the structure of the facets in the search index, we can now look at how faceted search is executed. As mentioned in my previous post, every faceted search consists of a Query which matches indexed documents and one or more FacetRequests for facets aggregation (one instance per requested root). The aggregation is done by FacetsCollector, from which you obtain a FacetResult per FacetRequest.

FacetsCollector computes the top categories in three steps. First, all documents that match the query are marked in a bitset (and depends on the facet request, their scores may also be stored in memory). Then it traverses the fulltree posting, fetches the ordinals of every matching document and aggregates them in an in-memory array, where every entry holds the aggregated weight of that ordinal. After all documents have been visited and their categories aggregated, we use the children[] and siblings[] information from the taxonomy to traverse the children of the root category which was specified by the request, and compute the top categories among them. In the end, the taxonomy index is used to label the top categories.

When you create a FacetRequest, you can set the depth in the taxonomy for which it aggregates categories. The default depth=1 means that top categories are computed among the immediate children of root. An interesting capability is to compute the top categories recursively (in one search) up to depth=N. For example, if you create a FacetRequest over the Date root, and call facetRequest.setDepth(2) as well as facetRequest.setResultMode(ResultMode.PER_NODE_IN_TREE), you will retrieve the top categories of Date, as well as each of their top children, as depicted below:
Date (123)
  2010 (34)
    October  (18)
    March    (8)
  2008 (23)
    November (12)
    October  (7)

Discussion

Global Ordinals
The global ordinals that are encoded in the search index allow very efficient per-segment faceting and aggregation, as all operations are done on ordinals (integers). If the ordinals were stored per-segment, i.e. the category Date/2010 would receive different ordinals in different segments, computing the top categories would need to be done on the category string, or hold some in-memory data structures that e.g. map local ordinals to global ones - that would have affected the scalability of the facets significantly.

Global ordinals do have a downside though. When you want to merge two search indexes which have respective taxonomies, their ordinals need to be translated. That means that you cannot use the efficient IndexWriter.addIndexes(Directory...) API to merge the indexes, but rather the .addIndexes(IndexReader...) expensive API, in order to translate the ordinals that are encoded in the source search index, so they match the target one.

Two-Phase Aggregation
The two-phase aggregation done by FacetsCollector may seem unnecessary and expensive. Mike McCandless, who has recently added faceted search to Lucene's nightly benchmark, started exploring doing the aggregation in one step, as documents are collected. The two-phase aggregation does have some strengths though. For example, it allows to compute facet weights on a sample set of the results, rather than over all documents. That is an important capability for large indexes, since the cost of doing faceted search is proportional to the number of documents that matched the query.

The two-phase aggregation is also useful when you want to compute the facet weights on the complement set of the matching documents (i.e. those that did not match the query). This is quite difficult to do when facets are aggregated as documents are collected. First, you can only aggregate the facets of the complement set retroactively, i.e., only after seeing docs 1 and 17 you know that docs 2-16 did not match the query. But, when out of order query matching is done, you can tell the complement set only after all documents have been collected! Second, complements aggregation is useful only when the number of results is more than 50% of the documents in the index, but that is also unknown until the query matching phase is done.

Lucene facets are very powerful and they come with a multitude of extension points to help you customize and optimize them to your application needs. I'm sure that now that facets are part of the nightly benchmark, they will see a lot of improvements, both to their API and more importantly, their performance. Stay tuned for updates!

Wednesday, November 28, 2012

Lucene Facets, Part 2

In my last post I gave a quick introduction to Lucene's facets; this time I'll drill down into some working code, by writing a very simple faceted search program which demonstrates the basic principles of Lucene Facets. The program indexes the two (fantastic !) Lucene books: Lucene in Action and Lucene in Action, Second Edition. For each book it adds an Author and Pub Date facets. Since I would like to focus on the aspects of indexing and searching with facets, and in order to keep the examples simple, I index only facets for each book.
Lucene Facets are packaged with example code that you can use to start building your faceted search application. Examples are provided for simple as well as advanced scenarios (e.g. using sampling and complements).
First, let's define the facets of each book:
List<CategoryPath> book1 = new ArrayList<CategoryPath>();
book1.add(new CategoryPath("Author", "Erik Hatcher"));
book1.add(new CategoryPath("Author", "Otis Gospodnetić"));
book1.add(new CategoryPath("Pub Date", "2004", "December", "1"));

List<CategoryPath> book2 = new ArrayList<CategoryPath>();
book2.add(new CategoryPath("Author", "Michael McCandless"));
book2.add(new CategoryPath("Author", "Erik Hatcher"));
book2.add(new CategoryPath("Author", "Otis Gospodnetić"));
book2.add(new CategoryPath("Pub Date", "2010", "July", "28"));

Note how each category is initialized as a CategoryPath, which holds the category hierarchy. It can be initialized with the category path components passed separately to the constructor (as in the example above), or by passing the full hierarchy string with a delimiter, e.g. new CategoryPath("Author/Erik Hatcher", '/').

Facets Indexing

Next, we need to initialize some components that are required for indexing the books and their facets. I previously mentioned that Lucene manages a hierarchical taxonomy of categories. That management is done by DirectoryTaxonomyWriter, which is responsible for adding categories to the taxonomy. Let's take a look at the following code:
Directory indexDir = new RAMDirectory();
Directory taxoDir = new RAMDirectory();
IndexWriter indexWriter = new IndexWriter(indexDir,
  new IndexWriterConfig(Version.LUCENE_50, new KeywordAnalyzer()));
DirectoryTaxonomyWriter taxoWriter = new DirectoryTaxonomyWriter(taxDir);
FacetFields facetFields = new FacetFields(taxoWriter);
  • Note that the taxonomy and search index use their own Directory instances. This is currently mandatory and cannot be avoided.
  • FacetFields is the helper class which takes care of adding the categories to the taxonomy (via DirectoryTaxonomyWriter), as well as adding the needed fields to the search index. It is usually used either before or after you add the normal search fields to the Document.
We can now index the books. The following code indexes the categories of a book, and is executed for each book:
Document bookDoc = new Document();

// now you will normally add all the fields that your application
// needs to index / store, e.g. title, content, price etc.

// add the categories to the taxonomy and the needed fields to the document
facetFields.addFields(bookDoc, bookCategories);
indexWriter.addDocument(bookDoc);

Faceted Search

In order to execute faceted search, we need to initialize some search components:
DirectoryReader indexr = DirectoryReader.open(indexWriter, false);
IndexSearcher searcher = new IndexSearcher(indexr);
DirectoryTaxonomyReader taxor = new DirectoryTaxonomyReader(taxoWriter);
DirectoryReader and IndexSearcher are needed for executing queries on the search index. DirectoryTaxonomyReader is used to fetch data from the taxonomy. We open the readers on their respective writers in order to get NRT behavior, which allows the readers to view the changes made by the writers, without those changes committed first (notice that the code doesn't call commit()).

Faceted search aggregates requested facets on all documents that match a query via FacetsCollector. You first define a FacetRequest per root node for which you are interested to aggregate top categories. The root node can be any node in the taxonomy tree, e.g. Pub Date/2004. You can also specify the lowest level in the taxonomy tree for which you would like to receive the aggregations. I.e., level=1 (default) means that you are interested in aggregating the immediate children of the root node, while level=2 means that you are interested to get the top categories of the immediate children of root, as well as each of their top categories. This is a powerful capability of Lucene Facets, which allows you to do recursive top categories aggregations.

The following code initializes FacetSearchParams to aggregate the top 10 immediate categories of the Author and Pub Date facets, and finally executes a MatchAllDocsQuery to aggregate facets on all indexed documents (books):
FacetSearchParams fsp = new FacetSearchParams();
fsp.addFacetRequest(new CountFacetRequest(new CategoryPath("Author"), 10));
fsp.addFacetRequest(new CountFacetRequest(new CategoryPath("Pub Date"), 10));
FacetsCollector facetsCollector = FacetsCollector.create(fsp, indexr, taxor);
searcher.search(new MatchAllDocsQuery(), facetsCollector);
NOTE: in a normal search you will usually execute a different Query, e.g. one that was parsed from the user's request, and wrap FacetsCollector and e.g. TopDocsCollector with MultiCollector, to retrieve both the top ranking documents to the query, as well as the top categories for the entire result set.
We can print the top categories and their weight using the following code:
for (FacetResult fres : facetsCollector.getFacetResults()) {
  FacetResultNode root = fres.getFacetResultNode();
  System.out.println(String.format("%s (%d)", root.label, root.value));
  for (FacetResultNode cat : root.getSubResults()) {
    System.out.println(String.format("  %s (%d)", cat.label.components[1], 
                       cat.value));
  }
}
Let's take a moment to review the code. FacetsCollector returns a list of FacetResult, and there is one item in the list per requested facet. FacetResult exposes a tree-like API and the root node corresponds to the one that was specified in the request. The root will have as many children as were requested in the request (or less, as in this case). The traversal starts by getting the root and then its children (which denote the top categories for this request).

The code prints the label and weight of each category that it visits. Note that for the root node, the weight denotes the aggregated weight of all of its children (even those that did not make it to the top list). This can quickly tell you how many documents in the result set are not associated with the root node at all. If you execute the code, you will get the following print:
Author (2.0)
  Otis Gospodnetić (2.0)
  Erik Hatcher (2.0)
  Michael McCandless (1.0)
Pub Date (2.0)
  2010 (1.0)
  2004 (1.0)
NOTE: the label of each child node is actually the full path from the root, e.g. Author/Erik Hatcher. For brevity, I excluded the level of the root from the print (see the call to .getComponent(1)).

Drill-down / Narrowing on facet

One of the operations that users will probably want to do with the returned facets, is to use them as a filter to the query, to narrow down the result set. This can be easily achieved with the DrillDown helper class, as follows:
Query base = new MatchAllDocsQuery();
DrillDownQuery ddq = new DrillDownQuery(FacetIndexingParams.DEFAULT, base);
ddq.add(new CategoryPath("Author", "Michael McCandless"));
This code returns all documents that matched the original query and are associated with the category Author/Michael McCandless. If we execute that query by calling searcher.search(q, 10) (i.e., return the top 10 documents), we'll receive only the second book, since only it is associated with that category.

That's it ! You now know the basics of indexing and searching with Lucene Facets. You can use these code examples (and the ones that are packaged with Lucene) to start building your faceted search application.