Wednesday, April 23, 2014

Updatable DocValues Under the Hood

We recently committed updatable numeric and binary doc-values fields, which lets you update the value of such fields in a document without re-indexing the whole document. This allows using such fields for e.g. boosting documents relevancy to a query, especially when the boosting factor is frequently changing (like date, ratings, counters etc.). In this post I describe the under-the-hood mechanics of updating those fields.

A brief overview on DocValues

DocValues have been around since Lucene 4.0, and while they have gone through several changes and improvements since then, the basic concept remained the same – they let you record a value per-document per-field in each segment, and allow for very efficient random-access retrieval of those values. Lucene offers several doc-values types: numeric for single-valued numeric fields (date, ratings, counters), sorted/sortedset for single/multi-valued string (mostly, but can be any BytesRef) fields used for sorting, faceting, grouping etc., and binary fields for storing arbitrary byte[] per-document.

Also, Lucene provides several implementations for writing and reading DocValues: Lucene45 (the current default) stores the values on disk, but keeps some information in-memory for fast lookups. Memory loads the values into compressed arrays, while Direct loads the values into uncompressed arrays (faster lookups, but higher memory consumption). If your application runs in a resource constrained environment, you can use Disk which keeps absolutely everything on disk, at the cost of slower lookups.

DocValues record a value for every document in every field and segment even when a document has no value associated with a field. In most cases this is acceptable because you usually store a value for most, if not all, documents in a field (e.g. scoring factors, facets). But even if you index very sparse doc-values fields, the default encoding is very efficient at indexing blocks of "missing" values. Still, if you index hundreds of millions of documents with few dozen doc-values fields, but only a small percentage of them contain a value, writing your own sparse doc-values encoding may be beneficial.

Document deletions are a form of in-place document updates!

The mechanism by which Lucene updates DocValues fields is very similar to how document deletions are handled, so let’s look at how that’s done first. When you issue e.g. a deleteDocument(term) request, Lucene caches the request until an updated SegmentReader is needed, or the index is committed. It then resolves all the issued deletes to determine which documents are affected, both in already flushed segments as well as documents that were in IndexWriter’s RAM buffer when the delete was issued.
Up-to-date SegmentReaders are needed in two cases: when a merge kicks off (SegmentMerger uses SegmentReaders to merge segments) and when the application reopens its NRT (near-real-time) reader.
If an updated reader is needed, Lucene carries the deletes in memory by turning off a bit in the in-memory bit vector of each affected segment, and avoids flushing the bit vector to the index directory, which may reside on a slow storage device (e.g. NFS share). If the application also has an NRT reader open, the bit vector is cloned (because the app’s reader should not be made aware of the new deletes until it’s reopened), leading to double RAM consumption by those vectors. However, once that cost is paid once, it is "fixed" (1 bit/doc, or maxDoc/8 bytes overall), no matter how many documents are later deleted. Also, as soon as the application releases its NRT reader (usually after reopening it), one of the vectors is released and can be reclaimed by the GC.

Updatable DocValues

Like document deletions, when the application issues e.g. an updateNumericDocValue(term, field, value) request, Lucene caches the request until an updated reader is needed or the index is committed, at which point it resolves the updates against all affected documents. Unlike deletes though, doc-values consume much more RAM, which makes carrying the updates in-memory not feasible for many applications: numeric doc-values may potentially consume 8 bytes per-document, while binary values are usually even more expensive.

Therefore, when the updates are resolved, Lucene currently rewrites the entire field that was updated to the index directory. So for example, if you update a numeric doc-values field of a document in a 1 million documents segment, Lucene rewrites 1 million numeric values, or 8MB (at most). Because Lucene segments are write-once, the updated field is written to a gen’d doc-values file of the segment. A reader then reads the values of a field from the respective gen’d doc-values file. While this has a highish cost during indexing, it does not affect search at all, because the values are read from a single file, and often this is a tradeoff that most applications are willing to make.

To illustrate, the code below indexes several documents with "date" and "ratings" NumericDocValuesFields and then lists the files in the index directory:

IndexWriterConfig conf = new IndexWriterConfig(...);
conf.setCodec(new SimpleTextCodec());
IndexWriter writer = new IndexWriter(dir, conf);
ReaderManager manager = new ReaderManager(writer, true);
for (int i = 0; i < 5; i++) {
  Document doc = new Document();
  doc.add(new StringField("id", ""+i, Store.NO));
  doc.add(new NumericDocValuesField("date", i));
  doc.add(new NumericDocValuesField("ratings", i*10));

manager.maybeRefresh(); // triggers a flush
// update the 'date' field's value of document '0'
  new Term("id", "0"), "date", System.currentTimeMillis());
manager.maybeRefresh(); // triggers applying the update
You will see the following print:

Notice how after the "date" field was updated and NRT reader re-opened, the index directory contains two sets of doc-values files: _0.dat stores both the "date" and "ratings" values and _0_1.dat stores just the "date" field’s values (the _1 part in the file name is the doc-values generation for the "date" field). Since the code uses SimpleTextCodec, we can easily print the contents of each file. Below is such print, clipped to show just the header of the files:
file: _0_1.dat
field date
  type NUMERIC

file: _0.dat
field date
  type NUMERIC
field ratings
  type NUMERIC
If we update the value of the "date" field for the same or any other document and then list the doc-values files in the index, we will see:
Notice how _0_1.dat no longer exists in the index - that’s because it only contains values for the "date" field, which was entirely rewritten by the second update (under generation 2), and so _0_1.dat can be deleted.

Stacked Segments

On LUCENE-4258 we’ve started to implement field updates taking a stacked-segments approach. A stacked segment can be thought of as a layer/filter onto an existing segment, which indexes only the updates to that segment. When the full view of the segment is required (e.g. during search), the segment "collapses" all the updates down, so that the most up-to-date view of the segment is presented to the application.

The approach taken to implement updatable doc-values is similar in concept, only the doc-values "stacks" index the values of all documents. If we were to implement the approach taken on LUCENE-4258, then the doc-values stacks would index only the values of the updated documents. While this could have improved the indexing performance (writing a single value ought to be faster than writing 1 million values, no matter how fast your IO system is), it would affect search time since in order to resolve the value of a document, we would need to look for its most up-to-date value in any of the stacks. There are efficient ways to do this, but as usual, each comes with its own tradeoffs.

The key point is that stacked segments have a tradeoff between indexing and search and it is important to measure that tradeoff, as in Lucene, we often prefer to tradeoff in favor of search. I will publish in a separate post some benchmark numbers of the current approach, which will also serve as a baseline for measuring alternative approaches in the future.

Sunday, September 22, 2013

Boosting Documents in Lucene

In Information Retrieval, a document's relevance to a search is measured by how similar it is to the query. There are several similarity models implemented in Lucene, and you can implement your own by extending the Similarity class and using the index statistics that Lucene saves. Documents can also be assigned a static score to denote their importance in the overall corpus, irrespective of the query that is being executed, e.g their popularity, ratings or PageRank.

Prior to Lucene 4.0, you could assign a static score to a document by calling document.setBoost. Internally, the boost was applied to every field of the document, by multiplying the field's boost factor with the document's. However, as explained here, this has never worked correctly and depending on the type of query executed, might not affect the document's rank at all.

With the addition of DocValues to Lucene, boosting documents is as easy as adding a NumericDocValuesField and use it in a CustomScoreQuery, which multiplies the computed score by the value of the 'boost' field. The code example below illustrates how to achieve that:
// add two documents to the index
Document doc = new Document();
doc.add(new TextField("f", "test document", Store.NO));
doc.add(new NumericDocValuesField("boost", 1L));

doc = new Document();
doc.add(new TextField("f", "test document", Store.NO));
doc.add(new NumericDocValuesField("boost", 2L));

// search for 'test' while boosting by field 'boost'
Query baseQuery = new TermQuery(new Term("f", "test"));
Query boostQuery = new FunctionQuery(new LongFieldSource("boost"));
Query q = new CustomScoreQuery(baseQuery, boostQuery);, 10);

The new Expressions module can also be used for boosting documents by writing a simple formula, as depicted below. While it's more verbose than using CustomScoreQuery, it makes boosting by computing more complex formulas trivial, e.g. sqrt(_score) + ln(boost).
Expression expr = JavascriptCompiler.compile("_score * boost");
SimpleBindings bindings = new SimpleBindings();    
bindings.add(new SortField("_score", SortField.Type.SCORE));
bindings.add(new SortField("boost", SortField.Type.LONG));
Sort sort = new Sort(expr.getSortField(bindings, true));, null, 10, sort);

Now that Lucene allows updating NumericDocValuesFields without re-indexing the documents, you can incorporate frequently changing fields (popularity, ratings, price, last-modified time...) in the boosting factor without re-indexing the document every time any one of them changes.

Tuesday, May 14, 2013

The Replicator

No, as much as I want to, I don't (yet) have a true Replicator at my disposal to blog about. Nor does it look like I will have one in the near future, even though scientists are making great progress towards it. I have recently made my contribution to the global replication effort though, by adding an index replication module to Lucene. It does not convert energy to mass, nor mass to mass, so I'm not sure if scientifically it qualifies as a Replicator at all, but if you are into search indexes replication with Lucene, read on!

In computer science, replication is defined as "sharing information so as to ensure consistency between redundant resources ... to improve reliability, fault-tolerance, or accessibility" (Wikipedia). When you design a search engine, especially at large scales, replication is one approach to achieve these. Index replicas can replace primary nodes that become unavailable (e.g due to planned maintenance or severe hardware failures), as well as to support higher query loads by load-balancing search requests across them. Even if you are not building a large-scale search engine, you can use replication to take hot backups of your primary index, while searches are running on it.

Lucene's replication framework implements a primary/backup approach, where the primary process performs all indexing operations (updating the index, merging segments etc.), and the backup/replica processes copy over the changes in an asynchronous manner. The framework consists of few key components that control the replication actions:
  • Replicator mediates between the clients and server. The primary process publishes Revisions while the replica processes update to the most recent revision following their own. When a new Revision is published, the previous one is released (unless it is currently being replicated), so that the resources it consumes can be reclaimed (e.g. remove unneeded files).

  • Revision describes a list of files and their metadata. The Revision is responsible to ensure that the files are available as long as clients copy its files. For example, IndexRevision takes a snapshot on the index using SnapshotDeletionPolicy, to guarantee the files are not deleted until the snapshot is released.

  • ReplicationClient performs the replication operation on the replica side. It first copies the needed files from the primary server (e.g. the files that the replica is missing) and then invokes the ReplicationHandler to act on the copied files. If any errors occur during the copy process, it restarts the replication session. That way, when the handler is invoked, it is guaranteed that all files were copied safely to the local storage.

  • ReplicationHandler acts on the copied files. IndexReplicationHandler copies the files over to the index directory and then syncs them to ensure the files are on stable storage. If any errors occur, it aborts the replication session and cleans up the index directory. Only after successfully copying and syncing the files, it notifies a callback that the index has been updated, so that e.g. the application can refresh its SearcherManager or perform whatever tasks it needs on the updated index.
The replication framework supports replicating any type of files. It offers built-in support for replicating a single index (as described above), as well as an index and taxonomy pair (for faceted search) via IndexAndTaxonomyRevision and IndexAndTaxonomyReplicationHandler. To replicate other types of files, you need to implement Revision and ReplicationHandler. But be advised, implementing a handler properly is ... tricky!

Following example code shows how to use the replicator on the server side:
// publish a new revision on the server
IndexWriter indexWriter; // the writer used for indexing
Replicator replicator = new LocalReplicator();
replicator.publish(new IndexRevision(indexWriter));
Using the replicator on the client side is a bit more involved but hey, that's where the important stuff happens!
// client can replicate either via LocalReplicator (e.g. for backups)
// or HttpReplicator (e.g. when the server is located on a different
// node).
Replicator replicator;

// refresh SearcherManager after the index is updated
Callable<Boolean> callback = new Callable<Boolean>() {
  public void call() throws Exception {
    // index was updated, refresh manager

// initialize a matching handler for the published revisions
ReplicationHandler handler = new IndexReplicationHandler(indexDir, callback);

// where should the client store the files before invoking the handler
SourceDirectoryFactory factory = new PerSessionDirectoryFactory(workDir);

ReplicationClient client = new ReplicationClient(replicator, handler, factory);

client.updateNow(); // invoke client manually
// -- OR --
client.startUpdateThread(30000); // check for updates every 30 seconds

So What's Next?

The replication framework currently restarts a replication session upon failures. It would be great if it supported resuming a session by e.g. copying only the files that weren't already successfully copied. This can be done by having both server and client e.g. compute a checksum on the files, so the client knows which ones were successfully copied and which weren't. Patches welcome!

Another improvement would be to support resume at the file level, i.e. don't copy parts of the file that were already copied safely. This can be implemented by modifying ReplicationClient to discard the first N bytes of the file it already copied, but would be better if it's supported by the server as well, so that it only sends the required parts. Here too, patches are welcome!

Yet another improvement is to support peer-to-peer replication. Today, the primary server is the bottleneck of the replication process since all clients access it for pulling files. This can be somewhat mitigated by e.g. having a tree network topology, where the primary node is accessed by only a few nodes, which are then accessed by other nodes and so forth. Such topology is however harder to maintain as well as very sensitive to the root node going out of action. In a peer-to-peer replication however, there is no single node from which all clients replicate, but rather each server can broadcast its current revision as well as choose any server that has a newer revision available to replicate from. An example of such implementation over Apache Solr is described here. Hmmm ... did I say patches are welcome?

Lucene has a new replication module. It's only 1 days-old and can already do so much. You are welcome to use it and help us teach it new things!

Sunday, April 28, 2013

Index Sorting with Lucene

When you index documents with Lucene, you often index them in some arbitrary order, usually by a first-come first-served manner. Most applications may not give a second thought about the order of the documents in the index, but some applications could greatly benefit if they could control that. For example, an application which always displays search results sorted by date could benefit if the index kept the documents sorted like that, since then it could traverse exactly the first K documents that match the query, and not process the entire result set to determine the first K.

Application-level Sort

Applications can try to control the order of the documents in the index by e.g. sorting them beforehand. This will work if the index is created once, from a single batch of documents. However, indexes are often created incrementally, where documents are indexed as they are encountered. Therefore, even if the application sorts each batch, the index will quickly get out-of-order as more batches of documents are indexed.
Actually, as long as the application does not use a MergePolicy, each segment will be sorted by itself, and it could process the first K documents from each segment to determine the global top K documents. However, as soon as segments are merged, they will lose their internal sort order.
If the application indexes in multiple threads (which significantly improves indexing performance), then keeping the index and segments sorted becomes an even harder task.

Static Index Sorting

In the past, Lucene offered an IndexSorter utility which allowed you to statically sort an index. It was dropped following the move to Lucene 4.0.0 APIs, but recently brought back as SortingAtomicReader, which exposes a sorted view of the index. It uses a Sorter which returns a permutation on the documents, so when it is asked for document X, it returns document Y, where perm(X)=Y. To sort an index with it, you should pass it to IndexWriter.addIndexes().

Sorting Segments During Merge

Static index sorting is less useful to applications that incrementally build their indexes. As explained above, one challenge is to keep the documents sorted in a segment that is the result of a merge. SortingMergePolicy (introduced in LUCENE-4752) wraps segments that are picked for merge with SortingAtomicReader, and guarantees that merged segments are also sorted.
NOTE: SortingMergePolicy was measured to be slower by x2-3 times than other merge policies, which is not so bad considering the task at hand. If your application's characteristics justify the use of this merge policy, it is important to compare the effects it has on indexing slowdown vs search speedup for your application!

Search Early Termination

Having an incrementally built index kept globally sorted, so that the first K documents you traverse are the ones to return as search results, is very hard and a solution to this is not yet offered by Lucene (patches are welcome!). However, SortingMergePolicy guarantees that merged segments are always sorted, and so we can apply early-termination search logic at each segment level.
EarlyTerminatingSortingCollector (introduced in LUCENE-4858) helps you do that, by early-terminating document collection on sorted segments, while doing full evaluation on unsorted ones (e.g. newly indexed documents or segments in an existing index).
NOTE: in a real-time indexing application, newly flushed segments will usually contain a small number of documents anyway, and therefore fully evaluating them is not likely to be very costly.

Applications for Index Sorting

Besides sorting the index for search early termination purposes, there are other applications to index sorting, for example compression. Sorting the index by documents' size, type or date attributes, can result in better representation of the documents in the index, leading to a smaller index footprint, which can eventually speed up searches too since more parts of the index can fit into RAM. I'm sure that Lucene users will find even more useful applications to index sorting, so stay tuned!

Sunday, January 20, 2013

Lucene Facets just got faster!

Recently, Mike McCandless had some fun with Lucene Facets, and came up with interesting performance numbers and some ideas on how facets can be improved, by e.g. utilizing DocValues to store per-document category ordinals, instead of the payload. This triggered a series of improvements and API changes, which resulted (so far) in 2x speedups to faceted search!

From Facets nightly benchmark

The biggest contribution to the spike came from cutting over facets to DocValues (LUCENE-4602), which got us nearly 70% improvements over term payloads. Not only are DocValues more efficient than payloads, they also let you cache their information in-memory, for faster access, which we now take advantage by default during faceted search.

Additional improvements came from specializing the decode logic (LUCENE-4686), as well as moving various components that take part during faceted search from an iterator-style API to a bulk-read (decode, aggregate) API (LUCENE-4620). The latter got us an additional ~20% improvements, which is amazing given that decoding and aggregation logic weren't changed - it was all just different software design. This is an important lesson for "hot" code - sometimes in order to make it efficient, you need to let go of some Object Oriented best practices!

While that's a major step forward, there's more to come. There's already work going on (LUCENE-4600 and LUCENE-4610), which will bring even more improvements for the common faceted search use cases (facet counting, no associations, no partitions etc.). Also, I plan to address two points that Mike raised in his blog: (1) making TotalFacetCounts (aka complements) segment-aware (to make it more NRT-friendly and reduce its reloading cost), and (2) experiment with a more efficient category ordinals cache, to reduce its RAM consumption as well as loading speed. There's also fun work going on here, which explores an alternative encoding for category ordinals. So stay tuned for more updates!

NOTE: cutting over facets to DocValues breaks existing indexes. You need to either rebuild them, or use FacetsPayloadMigrationReader to do a one-time migration of the payload information to DocValues.

Thursday, January 3, 2013

Facet Associations

Suppose that you work with an automatic categorization system, which given a document and some metadata, outputs Topic categories with confidence level. For example, an article about the Apache Lucene project might be categorized with Topic/Apache Software Foundation (0.34), Topic/Apache Lucene (0.95) and Topic/Information Retrieval (0.84). An article about Apache Nutch might be categorized with Topic/Apache Software Foundation (0.22), Topic/Apache Nutch (0.93) and Topic/Distributed Crawler (0.87). If you index the articles with those facets, and ignore the confidence level during facet aggregation, you will get Apache Software Foundation as the top category (with count=2). This might give the user a false impression as if the result set focuses mainly on the Apache Software Foundation topic, while the documents don't discuss general ASF issues at all. However, if you take the confidence level into account, you will get Apache Lucene and Apache Nutch as the top categories, while Apache Software Foundation would come last.

Facet Associations

Lucene Facets let you index categories with confidence level very easily, by assigning a CategoryAssociation with each category. The following short code snippet demonstrates how you would index the first article's categories:
FacetFields facetFields = new AssociationsFacetFields(taxoWriter);

// first article's categories with confidence level
CategoryAssociationsContainer article1 = new CategoryAssociationsContainer();
  new CategoryPath("Topic", "Apache Software Foundation"),
  new CategoryFloatAssociation(0.34f));
  new CategoryPath("Topic", "Apache Lucene"), 
  new CategoryFloatAssociation(0.95f));
  new CategoryPath("Topic", "Information Retrieval"), 
  new CategoryFloatAssociation(0.84f));

Document doc = new Document();

// add the facets to the document
facetFields.addFields(doc, associations);

// index the document
Let's take a closer look at the code:
  • CategoryAssociationsContainer holds a mapping from a category to its association.
  • AssociationsFacetFields adds the needed fields (drill-down terms and category list payload) to the document, along with the associations values.
  • Finally, the document is indexed with IndexWriter.
Quite simple, ha? After you have indexed both documents like so, you can compute the top categories by summing their confidence level, as demonstrated in the following code:
CategoryPath cp = new CategoryPath("Topic");
FacetRequest topic = new AssociationFloatSumFacetRequest(cp, 10);
FacetSearchParams fsp = new FacetSearchParams(topic);
FacetsCollector fc = new FacetsCollector(fsp, indexReader, taxoReader); MatchAllDocsQuery(), fc);
If you print the top categories (using e.g. the code snippet from here), you will get the following output. Note how the top categories are now those with the higher confidence level, rather than those that appear in more documents:
Topic (0.0)
  Apache Lucene (0.95)
  Apache Nutch (0.93)
  Distributed Crawler (0.87)
  Information Retrieval (0.84)
  Apache Software Foundation (0.56)
Lucene provides two CategoryAssociation implementations for integer and float values, as well as two matching FacetRequests which set the weight of a category as the sum of its association values (AssociationFloatSumFacetRequest used in the above code sample). You can extend facet associations by either implementing a CategoryAssociation (and matching FacetRequest), or implement a FacetRequest which computes a different function over the integer/float association values.

Monday, December 10, 2012

Lucene Facets Under The Hood

In my previous posts I gave an overview of Lucene Facets and showed how to write a simple faceted search program. This time I go through some of the internals of the facets module, in particular the structure of the taxonomy and search index as well as how faceted search is performed.

The Taxonomy Index

The job of the taxonomy is to map hierarchical categories to ordinals (integers), and provide taxonomy browsing services. The taxonomy is managed through a sidecar Lucene index through DirectoryTaxonomyWriter and DirectoryTaxonomyReader. Often, I refer to the taxonomy as the "taxonomy index".

When categories are added to the taxonomy, they are broken down to their path elements and an ordinal is assigned to every level in the path. In addition, the taxonomy index guarantees that every node in the taxonomy will have a smaller ordinal than all its children, and also that the node that was added last at level N will have a greater ordinal than all its existing siblings. For example, if you add the category Date/2012/March/20 to an empty taxonomy index, the following entries will be added:
  • Date, with ordinal=1
  • Date/2012, with ordinal=2
  • Date/2012/March, with ordinal=3
  • Date/2012/March/20, with ordinal=4
You can then use the taxonomy to retrieve the ordinal of Date/2012 (and vice versa), its child categories (Date/2012/March) as well as its siblings (categories at the same level in the tree; none in this example).

The taxonomy maintains the tree information in three parallel arrays:
  • Parents: parents[i] denotes the parent of category ordinal i. Note that all root categories (e.g. Author, Date and so forth), often called "dimensions", are put under a virtual root which always has the ordinal 0. Therefore parents[0]=-1 (always), and for any i≠0, parents[i]≠-1. Also, parents[i]=0 for all root categories.
  • Children: children[i] denotes the ordinal of the youngest child of category ordinal i. The youngest child is defined as the one that was added last to the taxonomy (and has a larger ordinal than all its siblings). children[i]=-1 for all leaf nodes.
  • Siblings: siblings[i] denotes an ordinal j, such that j<i and parents[i]=parents[j] (i.e. the two categories are children of the same parent). j denotes the category ordinal that was the former "youngest child" of the parent ordinal. siblings[0] is always -1, and the first child of a parent will always have siblings[i]=-1.
The following diagram gives a visual representation of the information above. Note how the parents, children and siblings structures allow easy and efficient traversal of the taxonomy tree, as well as updating when new categories are added.

The Search Index

When categories are added to a document, they are added as drill-down terms to the search index. In addition, their ordinals are encoded in a special term's payload (called the "fulltree posting"). For example, if you add the category Date/2012/March/20 to a document, then four drill-down terms are added, one for each level, e.g. $facets:Date, $facets:Date/2012 and so forth. Additionally, all four category ordinals are encoded in the fulltree posting's payload, and are used during faceted search to aggregate the respective categories.

Search Time

Now that we understand the role of the taxonomy index and the data structures it provides for traversing the taxonomy, as well as the structure of the facets in the search index, we can now look at how faceted search is executed. As mentioned in my previous post, every faceted search consists of a Query which matches indexed documents and one or more FacetRequests for facets aggregation (one instance per requested root). The aggregation is done by FacetsCollector, from which you obtain a FacetResult per FacetRequest.

FacetsCollector computes the top categories in three steps. First, all documents that match the query are marked in a bitset (and depends on the facet request, their scores may also be stored in memory). Then it traverses the fulltree posting, fetches the ordinals of every matching document and aggregates them in an in-memory array, where every entry holds the aggregated weight of that ordinal. After all documents have been visited and their categories aggregated, we use the children[] and siblings[] information from the taxonomy to traverse the children of the root category which was specified by the request, and compute the top categories among them. In the end, the taxonomy index is used to label the top categories.

When you create a FacetRequest, you can set the depth in the taxonomy for which it aggregates categories. The default depth=1 means that top categories are computed among the immediate children of root. An interesting capability is to compute the top categories recursively (in one search) up to depth=N. For example, if you create a FacetRequest over the Date root, and call facetRequest.setDepth(2) as well as facetRequest.setResultMode(ResultMode.PER_NODE_IN_TREE), you will retrieve the top categories of Date, as well as each of their top children, as depicted below:
Date (123)
  2010 (34)
    October  (18)
    March    (8)
  2008 (23)
    November (12)
    October  (7)


Global Ordinals
The global ordinals that are encoded in the search index allow very efficient per-segment faceting and aggregation, as all operations are done on ordinals (integers). If the ordinals were stored per-segment, i.e. the category Date/2010 would receive different ordinals in different segments, computing the top categories would need to be done on the category string, or hold some in-memory data structures that e.g. map local ordinals to global ones - that would have affected the scalability of the facets significantly.

Global ordinals do have a downside though. When you want to merge two search indexes which have respective taxonomies, their ordinals need to be translated. That means that you cannot use the efficient IndexWriter.addIndexes(Directory...) API to merge the indexes, but rather the .addIndexes(IndexReader...) expensive API, in order to translate the ordinals that are encoded in the source search index, so they match the target one.

Two-Phase Aggregation
The two-phase aggregation done by FacetsCollector may seem unnecessary and expensive. Mike McCandless, who has recently added faceted search to Lucene's nightly benchmark, started exploring doing the aggregation in one step, as documents are collected. The two-phase aggregation does have some strengths though. For example, it allows to compute facet weights on a sample set of the results, rather than over all documents. That is an important capability for large indexes, since the cost of doing faceted search is proportional to the number of documents that matched the query.

The two-phase aggregation is also useful when you want to compute the facet weights on the complement set of the matching documents (i.e. those that did not match the query). This is quite difficult to do when facets are aggregated as documents are collected. First, you can only aggregate the facets of the complement set retroactively, i.e., only after seeing docs 1 and 17 you know that docs 2-16 did not match the query. But, when out of order query matching is done, you can tell the complement set only after all documents have been collected! Second, complements aggregation is useful only when the number of results is more than 50% of the documents in the index, but that is also unknown until the query matching phase is done.

Lucene facets are very powerful and they come with a multitude of extension points to help you customize and optimize them to your application needs. I'm sure that now that facets are part of the nightly benchmark, they will see a lot of improvements, both to their API and more importantly, their performance. Stay tuned for updates!