The Taxonomy Index
The job of the taxonomy is to map hierarchical categories to ordinals (integers), and provide taxonomy browsing services. The taxonomy is managed through a sidecar Lucene index throughDirectoryTaxonomyWriter
and DirectoryTaxonomyReader
. Often, I refer to the taxonomy as the "taxonomy index".
When categories are added to the taxonomy, they are broken down to their path elements and an ordinal is assigned to every level in the path. In addition, the taxonomy index guarantees that every node in the taxonomy will have a smaller ordinal than all its children, and also that the node that was added last at level
N
will have a greater ordinal than all its existing siblings. For example, if you add the category Date/2012/March/20
to an empty taxonomy index, the following entries will be added:
Date
, withordinal=1
Date/2012
, withordinal=2
Date/2012/March
, withordinal=3
Date/2012/March/20
, withordinal=4
Date/2012
(and vice versa), its child categories (Date/2012/March
) as well as its siblings (categories at the same level in the tree; none in this example).
The taxonomy maintains the tree information in three parallel arrays:
- Parents:
parents[i]
denotes the parent of category ordinali
. Note that all root categories (e.g.Author
,Date
and so forth), often called "dimensions", are put under a virtual root which always has the ordinal 0. Thereforeparents[0]=-1
(always), and for anyi≠0
,parents[i]≠-1
. Also,parents[i]=0
for all root categories. - Children:
children[i]
denotes the ordinal of the youngest child of category ordinali
. The youngest child is defined as the one that was added last to the taxonomy (and has a larger ordinal than all its siblings).children[i]=-1
for all leaf nodes. - Siblings:
siblings[i]
denotes an ordinalj
, such thatj<i
andparents[i]=parents[j]
(i.e. the two categories are children of the same parent).j
denotes the category ordinal that was the former "youngest child" of the parent ordinal.siblings[0]
is always-1
, and the first child of a parent will always havesiblings[i]=-1
.
The Search Index
When categories are added to a document, they are added as drill-down terms to the search index. In addition, their ordinals are encoded in a special term's payload (called the "fulltree posting"). For example, if you add the categoryDate/2012/March/20
to a document, then four drill-down terms are added, one for each level, e.g. $facets:Date
, $facets:Date/2012
and so forth. Additionally, all four category ordinals are encoded in the fulltree posting's payload, and are used during faceted search to aggregate the respective categories.
Search Time
Now that we understand the role of the taxonomy index and the data structures it provides for traversing the taxonomy, as well as the structure of the facets in the search index, we can now look at how faceted search is executed. As mentioned in my previous post, every faceted search consists of aQuery
which matches indexed documents and one or more FacetRequests
for facets aggregation (one instance per requested root). The aggregation is done by FacetsCollector
, from which you obtain a FacetResult
per FacetRequest
.
FacetsCollector
computes the top categories in three steps. First, all documents that match the query are marked in a bitset (and depends on the facet request, their scores may also be stored in memory). Then it traverses the fulltree posting, fetches the ordinals of every matching document and aggregates them in an in-memory array, where every entry holds the aggregated weight of that ordinal. After all documents have been visited and their categories aggregated, we use the children[]
and siblings[]
information from the taxonomy to traverse the children of the root category which was specified by the request, and compute the top categories among them. In the end, the taxonomy index is used to label the top categories.
When you create a
FacetRequest
, you can set the depth in the taxonomy for which it aggregates categories. The default depth=1
means that top categories are computed among the immediate children of root. An interesting capability is to compute the top categories recursively (in one search) up to depth=N
. For example, if you create a FacetRequest
over the Date
root, and call facetRequest.setDepth(2)
as well as facetRequest.setResultMode(ResultMode.PER_NODE_IN_TREE)
, you will retrieve the top categories of Date
, as well as each of their top children, as depicted below:
Date (123) 2010 (34) October (18) March (8) 2008 (23) November (12) October (7)
Discussion
Global OrdinalsThe global ordinals that are encoded in the search index allow very efficient per-segment faceting and aggregation, as all operations are done on ordinals (integers). If the ordinals were stored per-segment, i.e. the category
Date/2010
would receive different ordinals in different segments, computing the top categories would need to be done on the category string, or hold some in-memory data structures that e.g. map local ordinals to global ones - that would have affected the scalability of the facets significantly.
Global ordinals do have a downside though. When you want to merge two search indexes which have respective taxonomies, their ordinals need to be translated. That means that you cannot use the efficient
IndexWriter.addIndexes(Directory...)
API to merge the indexes, but rather the .addIndexes(IndexReader...)
expensive API, in order to translate the ordinals that are encoded in the source search index, so they match the target one.
Two-Phase Aggregation
The two-phase aggregation done by
FacetsCollector
may seem unnecessary and expensive. Mike McCandless, who has recently added faceted search to Lucene's nightly benchmark, started exploring doing the aggregation in one step, as documents are collected. The two-phase aggregation does have some strengths though. For example, it allows to compute facet weights on a sample set of the results, rather than over all documents. That is an important capability for large indexes, since the cost of doing faceted search is proportional to the number of documents that matched the query.
The two-phase aggregation is also useful when you want to compute the facet weights on the complement set of the matching documents (i.e. those that did not match the query). This is quite difficult to do when facets are aggregated as documents are collected. First, you can only aggregate the facets of the complement set retroactively, i.e., only after seeing docs 1 and 17 you know that docs 2-16 did not match the query. But, when out of order query matching is done, you can tell the complement set only after all documents have been collected! Second, complements aggregation is useful only when the number of results is more than 50% of the documents in the index, but that is also unknown until the query matching phase is done.
Lucene facets are very powerful and they come with a multitude of extension points to help you customize and optimize them to your application needs. I'm sure that now that facets are part of the nightly benchmark, they will see a lot of improvements, both to their API and more importantly, their performance. Stay tuned for updates!
I am using lucene version 6.5, so asking this question in current context.
ReplyDeleteMy application creates (separate taxonomy index from main index) sidecar index.
I have around 125 million documents so wanted to index them in 5 slots (25 million).
Merging main indexes I can use writer.addIndexes(Directory...) ?
Merging Taxonomy indexes T can use writer.addIndexes(IndexReader...) ?
Am i correct on this ?
MainIndex (I1, I2, I3, I4, I5) = I(Merged main index writer.addIndexes(Directory...))
TaxonomyIndex (T1, T2, T3, T4, T5) = T(Merged taxonomy index writer.addIndexes(IndexReader...))
At the end i will get I and T compatible (i.e) final ordinals in search index will co-relate with T ?
Will this approach work ?