Sunday, September 22, 2013

Boosting Documents in Lucene

In Information Retrieval, a document's relevance to a search is measured by how similar it is to the query. There are several similarity models implemented in Lucene, and you can implement your own by extending the Similarity class and using the index statistics that Lucene saves. Documents can also be assigned a static score to denote their importance in the overall corpus, irrespective of the query that is being executed, e.g their popularity, ratings or PageRank.

Prior to Lucene 4.0, you could assign a static score to a document by calling document.setBoost. Internally, the boost was applied to every field of the document, by multiplying the field's boost factor with the document's. However, as explained here, this has never worked correctly and depending on the type of query executed, might not affect the document's rank at all.

With the addition of DocValues to Lucene, boosting documents is as easy as adding a NumericDocValuesField and use it in a CustomScoreQuery, which multiplies the computed score by the value of the 'boost' field. The code example below illustrates how to achieve that:
// add two documents to the index
Document doc = new Document();
doc.add(new TextField("f", "test document", Store.NO));
doc.add(new NumericDocValuesField("boost", 1L));
writer.addDocument(doc);

doc = new Document();
doc.add(new TextField("f", "test document", Store.NO));
doc.add(new NumericDocValuesField("boost", 2L));
writer.addDocument(doc);

// search for 'test' while boosting by field 'boost'
Query baseQuery = new TermQuery(new Term("f", "test"));
Query boostQuery = new FunctionQuery(new LongFieldSource("boost"));
Query q = new CustomScoreQuery(baseQuery, boostQuery);
searcher.search(q, 10);

The new Expressions module can also be used for boosting documents by writing a simple formula, as depicted below. While it's more verbose than using CustomScoreQuery, it makes boosting by computing more complex formulas trivial, e.g. sqrt(_score) + ln(boost).
Expression expr = JavascriptCompiler.compile("_score * boost");
SimpleBindings bindings = new SimpleBindings();    
bindings.add(new SortField("_score", SortField.Type.SCORE));
bindings.add(new SortField("boost", SortField.Type.LONG));
Sort sort = new Sort(expr.getSortField(bindings, true));
searcher.search(baseQuery, null, 10, sort);

Now that Lucene allows updating NumericDocValuesFields without re-indexing the documents, you can incorporate frequently changing fields (popularity, ratings, price, last-modified time...) in the boosting factor without re-indexing the document every time any one of them changes.

Tuesday, May 14, 2013

The Replicator

No, as much as I want to, I don't (yet) have a true Replicator at my disposal to blog about. Nor does it look like I will have one in the near future, even though scientists are making great progress towards it. I have recently made my contribution to the global replication effort though, by adding an index replication module to Lucene. It does not convert energy to mass, nor mass to mass, so I'm not sure if scientifically it qualifies as a Replicator at all, but if you are into search indexes replication with Lucene, read on!

In computer science, replication is defined as "sharing information so as to ensure consistency between redundant resources ... to improve reliability, fault-tolerance, or accessibility" (Wikipedia). When you design a search engine, especially at large scales, replication is one approach to achieve these. Index replicas can replace primary nodes that become unavailable (e.g due to planned maintenance or severe hardware failures), as well as to support higher query loads by load-balancing search requests across them. Even if you are not building a large-scale search engine, you can use replication to take hot backups of your primary index, while searches are running on it.

Lucene's replication framework implements a primary/backup approach, where the primary process performs all indexing operations (updating the index, merging segments etc.), and the backup/replica processes copy over the changes in an asynchronous manner. The framework consists of few key components that control the replication actions:
  • Replicator mediates between the clients and server. The primary process publishes Revisions while the replica processes update to the most recent revision following their own. When a new Revision is published, the previous one is released (unless it is currently being replicated), so that the resources it consumes can be reclaimed (e.g. remove unneeded files).

  • Revision describes a list of files and their metadata. The Revision is responsible to ensure that the files are available as long as clients copy its files. For example, IndexRevision takes a snapshot on the index using SnapshotDeletionPolicy, to guarantee the files are not deleted until the snapshot is released.

  • ReplicationClient performs the replication operation on the replica side. It first copies the needed files from the primary server (e.g. the files that the replica is missing) and then invokes the ReplicationHandler to act on the copied files. If any errors occur during the copy process, it restarts the replication session. That way, when the handler is invoked, it is guaranteed that all files were copied safely to the local storage.

  • ReplicationHandler acts on the copied files. IndexReplicationHandler copies the files over to the index directory and then syncs them to ensure the files are on stable storage. If any errors occur, it aborts the replication session and cleans up the index directory. Only after successfully copying and syncing the files, it notifies a callback that the index has been updated, so that e.g. the application can refresh its SearcherManager or perform whatever tasks it needs on the updated index.
The replication framework supports replicating any type of files. It offers built-in support for replicating a single index (as described above), as well as an index and taxonomy pair (for faceted search) via IndexAndTaxonomyRevision and IndexAndTaxonomyReplicationHandler. To replicate other types of files, you need to implement Revision and ReplicationHandler. But be advised, implementing a handler properly is ... tricky!

Following example code shows how to use the replicator on the server side:
// publish a new revision on the server
IndexWriter indexWriter; // the writer used for indexing
Replicator replicator = new LocalReplicator();
replicator.publish(new IndexRevision(indexWriter));
Using the replicator on the client side is a bit more involved but hey, that's where the important stuff happens!
// client can replicate either via LocalReplicator (e.g. for backups)
// or HttpReplicator (e.g. when the server is located on a different
// node).
Replicator replicator;

// refresh SearcherManager after the index is updated
Callable<Boolean> callback = new Callable<Boolean>() {
  public void call() throws Exception {
    // index was updated, refresh manager
    searcherManager.maybeRefresh();
  }
}

// initialize a matching handler for the published revisions
ReplicationHandler handler = new IndexReplicationHandler(indexDir, callback);

// where should the client store the files before invoking the handler
SourceDirectoryFactory factory = new PerSessionDirectoryFactory(workDir);

ReplicationClient client = new ReplicationClient(replicator, handler, factory);

client.updateNow(); // invoke client manually
// -- OR --
client.startUpdateThread(30000); // check for updates every 30 seconds

So What's Next?

The replication framework currently restarts a replication session upon failures. It would be great if it supported resuming a session by e.g. copying only the files that weren't already successfully copied. This can be done by having both server and client e.g. compute a checksum on the files, so the client knows which ones were successfully copied and which weren't. Patches welcome!

Another improvement would be to support resume at the file level, i.e. don't copy parts of the file that were already copied safely. This can be implemented by modifying ReplicationClient to discard the first N bytes of the file it already copied, but would be better if it's supported by the server as well, so that it only sends the required parts. Here too, patches are welcome!

Yet another improvement is to support peer-to-peer replication. Today, the primary server is the bottleneck of the replication process since all clients access it for pulling files. This can be somewhat mitigated by e.g. having a tree network topology, where the primary node is accessed by only a few nodes, which are then accessed by other nodes and so forth. Such topology is however harder to maintain as well as very sensitive to the root node going out of action. In a peer-to-peer replication however, there is no single node from which all clients replicate, but rather each server can broadcast its current revision as well as choose any server that has a newer revision available to replicate from. An example of such implementation over Apache Solr is described here. Hmmm ... did I say patches are welcome?

Lucene has a new replication module. It's only 1 days-old and can already do so much. You are welcome to use it and help us teach it new things!

Sunday, April 28, 2013

Index Sorting with Lucene

When you index documents with Lucene, you often index them in some arbitrary order, usually by a first-come first-served manner. Most applications may not give a second thought about the order of the documents in the index, but some applications could greatly benefit if they could control that. For example, an application which always displays search results sorted by date could benefit if the index kept the documents sorted like that, since then it could traverse exactly the first K documents that match the query, and not process the entire result set to determine the first K.

Application-level Sort

Applications can try to control the order of the documents in the index by e.g. sorting them beforehand. This will work if the index is created once, from a single batch of documents. However, indexes are often created incrementally, where documents are indexed as they are encountered. Therefore, even if the application sorts each batch, the index will quickly get out-of-order as more batches of documents are indexed.
Actually, as long as the application does not use a MergePolicy, each segment will be sorted by itself, and it could process the first K documents from each segment to determine the global top K documents. However, as soon as segments are merged, they will lose their internal sort order.
If the application indexes in multiple threads (which significantly improves indexing performance), then keeping the index and segments sorted becomes an even harder task.

Static Index Sorting

In the past, Lucene offered an IndexSorter utility which allowed you to statically sort an index. It was dropped following the move to Lucene 4.0.0 APIs, but recently brought back as SortingAtomicReader, which exposes a sorted view of the index. It uses a Sorter which returns a permutation on the documents, so when it is asked for document X, it returns document Y, where perm(X)=Y. To sort an index with it, you should pass it to IndexWriter.addIndexes().

Sorting Segments During Merge

Static index sorting is less useful to applications that incrementally build their indexes. As explained above, one challenge is to keep the documents sorted in a segment that is the result of a merge. SortingMergePolicy (introduced in LUCENE-4752) wraps segments that are picked for merge with SortingAtomicReader, and guarantees that merged segments are also sorted.
NOTE: SortingMergePolicy was measured to be slower by x2-3 times than other merge policies, which is not so bad considering the task at hand. If your application's characteristics justify the use of this merge policy, it is important to compare the effects it has on indexing slowdown vs search speedup for your application!

Search Early Termination

Having an incrementally built index kept globally sorted, so that the first K documents you traverse are the ones to return as search results, is very hard and a solution to this is not yet offered by Lucene (patches are welcome!). However, SortingMergePolicy guarantees that merged segments are always sorted, and so we can apply early-termination search logic at each segment level.
EarlyTerminatingSortingCollector (introduced in LUCENE-4858) helps you do that, by early-terminating document collection on sorted segments, while doing full evaluation on unsorted ones (e.g. newly indexed documents or segments in an existing index).
NOTE: in a real-time indexing application, newly flushed segments will usually contain a small number of documents anyway, and therefore fully evaluating them is not likely to be very costly.

Applications for Index Sorting

Besides sorting the index for search early termination purposes, there are other applications to index sorting, for example compression. Sorting the index by documents' size, type or date attributes, can result in better representation of the documents in the index, leading to a smaller index footprint, which can eventually speed up searches too since more parts of the index can fit into RAM. I'm sure that Lucene users will find even more useful applications to index sorting, so stay tuned!

Sunday, January 20, 2013

Lucene Facets just got faster!

Recently, Mike McCandless had some fun with Lucene Facets, and came up with interesting performance numbers and some ideas on how facets can be improved, by e.g. utilizing DocValues to store per-document category ordinals, instead of the payload. This triggered a series of improvements and API changes, which resulted (so far) in 2x speedups to faceted search!

From Facets nightly benchmark

The biggest contribution to the spike came from cutting over facets to DocValues (LUCENE-4602), which got us nearly 70% improvements over term payloads. Not only are DocValues more efficient than payloads, they also let you cache their information in-memory, for faster access, which we now take advantage by default during faceted search.

Additional improvements came from specializing the decode logic (LUCENE-4686), as well as moving various components that take part during faceted search from an iterator-style API to a bulk-read (decode, aggregate) API (LUCENE-4620). The latter got us an additional ~20% improvements, which is amazing given that decoding and aggregation logic weren't changed - it was all just different software design. This is an important lesson for "hot" code - sometimes in order to make it efficient, you need to let go of some Object Oriented best practices!

While that's a major step forward, there's more to come. There's already work going on (LUCENE-4600 and LUCENE-4610), which will bring even more improvements for the common faceted search use cases (facet counting, no associations, no partitions etc.). Also, I plan to address two points that Mike raised in his blog: (1) making TotalFacetCounts (aka complements) segment-aware (to make it more NRT-friendly and reduce its reloading cost), and (2) experiment with a more efficient category ordinals cache, to reduce its RAM consumption as well as loading speed. There's also fun work going on here, which explores an alternative encoding for category ordinals. So stay tuned for more updates!

NOTE: cutting over facets to DocValues breaks existing indexes. You need to either rebuild them, or use FacetsPayloadMigrationReader to do a one-time migration of the payload information to DocValues.

Thursday, January 3, 2013

Facet Associations

Suppose that you work with an automatic categorization system, which given a document and some metadata, outputs Topic categories with confidence level. For example, an article about the Apache Lucene project might be categorized with Topic/Apache Software Foundation (0.34), Topic/Apache Lucene (0.95) and Topic/Information Retrieval (0.84). An article about Apache Nutch might be categorized with Topic/Apache Software Foundation (0.22), Topic/Apache Nutch (0.93) and Topic/Distributed Crawler (0.87). If you index the articles with those facets, and ignore the confidence level during facet aggregation, you will get Apache Software Foundation as the top category (with count=2). This might give the user a false impression as if the result set focuses mainly on the Apache Software Foundation topic, while the documents don't discuss general ASF issues at all. However, if you take the confidence level into account, you will get Apache Lucene and Apache Nutch as the top categories, while Apache Software Foundation would come last.

Facet Associations

Lucene Facets let you index categories with confidence level very easily, by assigning a CategoryAssociation with each category. The following short code snippet demonstrates how you would index the first article's categories:
FacetFields facetFields = new AssociationsFacetFields(taxoWriter);

// first article's categories with confidence level
CategoryAssociationsContainer article1 = new CategoryAssociationsContainer();
article1.setAssociation(
  new CategoryPath("Topic", "Apache Software Foundation"),
  new CategoryFloatAssociation(0.34f));
article1.setAssociation(
  new CategoryPath("Topic", "Apache Lucene"), 
  new CategoryFloatAssociation(0.95f));
article1.setAssociation(
  new CategoryPath("Topic", "Information Retrieval"), 
  new CategoryFloatAssociation(0.84f));

Document doc = new Document();

// add the facets to the document
facetFields.addFields(doc, associations);

// index the document
indexWriter.addDocument(doc);
Let's take a closer look at the code:
  • CategoryAssociationsContainer holds a mapping from a category to its association.
  • AssociationsFacetFields adds the needed fields (drill-down terms and category list payload) to the document, along with the associations values.
  • Finally, the document is indexed with IndexWriter.
Quite simple, ha? After you have indexed both documents like so, you can compute the top categories by summing their confidence level, as demonstrated in the following code:
CategoryPath cp = new CategoryPath("Topic");
FacetRequest topic = new AssociationFloatSumFacetRequest(cp, 10);
FacetSearchParams fsp = new FacetSearchParams(topic);
FacetsCollector fc = new FacetsCollector(fsp, indexReader, taxoReader);
searcher.search(new MatchAllDocsQuery(), fc);
If you print the top categories (using e.g. the code snippet from here), you will get the following output. Note how the top categories are now those with the higher confidence level, rather than those that appear in more documents:
Topic (0.0)
  Apache Lucene (0.95)
  Apache Nutch (0.93)
  Distributed Crawler (0.87)
  Information Retrieval (0.84)
  Apache Software Foundation (0.56)
Lucene provides two CategoryAssociation implementations for integer and float values, as well as two matching FacetRequests which set the weight of a category as the sum of its association values (AssociationFloatSumFacetRequest used in the above code sample). You can extend facet associations by either implementing a CategoryAssociation (and matching FacetRequest), or implement a FacetRequest which computes a different function over the integer/float association values.