Sunday, January 20, 2013

Lucene Facets just got faster!

Recently, Mike McCandless had some fun with Lucene Facets, and came up with interesting performance numbers and some ideas on how facets can be improved, by e.g. utilizing DocValues to store per-document category ordinals, instead of the payload. This triggered a series of improvements and API changes, which resulted (so far) in 2x speedups to faceted search!

From Facets nightly benchmark

The biggest contribution to the spike came from cutting over facets to DocValues (LUCENE-4602), which got us nearly 70% improvements over term payloads. Not only are DocValues more efficient than payloads, they also let you cache their information in-memory, for faster access, which we now take advantage by default during faceted search.

Additional improvements came from specializing the decode logic (LUCENE-4686), as well as moving various components that take part during faceted search from an iterator-style API to a bulk-read (decode, aggregate) API (LUCENE-4620). The latter got us an additional ~20% improvements, which is amazing given that decoding and aggregation logic weren't changed - it was all just different software design. This is an important lesson for "hot" code - sometimes in order to make it efficient, you need to let go of some Object Oriented best practices!

While that's a major step forward, there's more to come. There's already work going on (LUCENE-4600 and LUCENE-4610), which will bring even more improvements for the common faceted search use cases (facet counting, no associations, no partitions etc.). Also, I plan to address two points that Mike raised in his blog: (1) making TotalFacetCounts (aka complements) segment-aware (to make it more NRT-friendly and reduce its reloading cost), and (2) experiment with a more efficient category ordinals cache, to reduce its RAM consumption as well as loading speed. There's also fun work going on here, which explores an alternative encoding for category ordinals. So stay tuned for more updates!

NOTE: cutting over facets to DocValues breaks existing indexes. You need to either rebuild them, or use FacetsPayloadMigrationReader to do a one-time migration of the payload information to DocValues.

Thursday, January 3, 2013

Facet Associations

Suppose that you work with an automatic categorization system, which given a document and some metadata, outputs Topic categories with confidence level. For example, an article about the Apache Lucene project might be categorized with Topic/Apache Software Foundation (0.34), Topic/Apache Lucene (0.95) and Topic/Information Retrieval (0.84). An article about Apache Nutch might be categorized with Topic/Apache Software Foundation (0.22), Topic/Apache Nutch (0.93) and Topic/Distributed Crawler (0.87). If you index the articles with those facets, and ignore the confidence level during facet aggregation, you will get Apache Software Foundation as the top category (with count=2). This might give the user a false impression as if the result set focuses mainly on the Apache Software Foundation topic, while the documents don't discuss general ASF issues at all. However, if you take the confidence level into account, you will get Apache Lucene and Apache Nutch as the top categories, while Apache Software Foundation would come last.

Facet Associations

Lucene Facets let you index categories with confidence level very easily, by assigning a CategoryAssociation with each category. The following short code snippet demonstrates how you would index the first article's categories:
FacetFields facetFields = new AssociationsFacetFields(taxoWriter);

// first article's categories with confidence level
CategoryAssociationsContainer article1 = new CategoryAssociationsContainer();
article1.setAssociation(
  new CategoryPath("Topic", "Apache Software Foundation"),
  new CategoryFloatAssociation(0.34f));
article1.setAssociation(
  new CategoryPath("Topic", "Apache Lucene"), 
  new CategoryFloatAssociation(0.95f));
article1.setAssociation(
  new CategoryPath("Topic", "Information Retrieval"), 
  new CategoryFloatAssociation(0.84f));

Document doc = new Document();

// add the facets to the document
facetFields.addFields(doc, associations);

// index the document
indexWriter.addDocument(doc);
Let's take a closer look at the code:
  • CategoryAssociationsContainer holds a mapping from a category to its association.
  • AssociationsFacetFields adds the needed fields (drill-down terms and category list payload) to the document, along with the associations values.
  • Finally, the document is indexed with IndexWriter.
Quite simple, ha? After you have indexed both documents like so, you can compute the top categories by summing their confidence level, as demonstrated in the following code:
CategoryPath cp = new CategoryPath("Topic");
FacetRequest topic = new AssociationFloatSumFacetRequest(cp, 10);
FacetSearchParams fsp = new FacetSearchParams(topic);
FacetsCollector fc = new FacetsCollector(fsp, indexReader, taxoReader);
searcher.search(new MatchAllDocsQuery(), fc);
If you print the top categories (using e.g. the code snippet from here), you will get the following output. Note how the top categories are now those with the higher confidence level, rather than those that appear in more documents:
Topic (0.0)
  Apache Lucene (0.95)
  Apache Nutch (0.93)
  Distributed Crawler (0.87)
  Information Retrieval (0.84)
  Apache Software Foundation (0.56)
Lucene provides two CategoryAssociation implementations for integer and float values, as well as two matching FacetRequests which set the weight of a category as the sum of its association values (AssociationFloatSumFacetRequest used in the above code sample). You can extend facet associations by either implementing a CategoryAssociation (and matching FacetRequest), or implement a FacetRequest which computes a different function over the integer/float association values.