In computer science, replication is defined as "sharing information so as to ensure consistency between redundant resources ... to improve reliability, fault-tolerance, or accessibility" (Wikipedia). When you design a search engine, especially at large scales, replication is one approach to achieve these. Index replicas can replace primary nodes that become unavailable (e.g due to planned maintenance or severe hardware failures), as well as to support higher query loads by load-balancing search requests across them. Even if you are not building a large-scale search engine, you can use replication to take hot backups of your primary index, while searches are running on it.
Lucene's replication framework implements a primary/backup approach, where the primary process performs all indexing operations (updating the index, merging segments etc.), and the backup/replica processes copy over the changes in an asynchronous manner. The framework consists of few key components that control the replication actions:
Replicator
mediates between the clients and server. The primary process publishesRevisions
while the replica processes update to the most recent revision following their own. When a newRevision
is published, the previous one is released (unless it is currently being replicated), so that the resources it consumes can be reclaimed (e.g. remove unneeded files).Revision
describes a list of files and their metadata. TheRevision
is responsible to ensure that the files are available as long as clients copy its files. For example,IndexRevision
takes a snapshot on the index usingSnapshotDeletionPolicy
, to guarantee the files are not deleted until the snapshot is released.ReplicationClient
performs the replication operation on the replica side. It first copies the needed files from the primary server (e.g. the files that the replica is missing) and then invokes theReplicationHandler
to act on the copied files. If any errors occur during the copy process, it restarts the replication session. That way, when the handler is invoked, it is guaranteed that all files were copied safely to the local storage.ReplicationHandler
acts on the copied files.IndexReplicationHandler
copies the files over to the index directory and then syncs them to ensure the files are on stable storage. If any errors occur, it aborts the replication session and cleans up the index directory. Only after successfully copying and syncing the files, it notifies acallback
that the index has been updated, so that e.g. the application can refresh its SearcherManager or perform whatever tasks it needs on the updated index.
IndexAndTaxonomyRevision
and IndexAndTaxonomyReplicationHandler
. To replicate other types of files, you need to implement Revision
and ReplicationHandler
. But be advised, implementing a handler properly is ... tricky!
Following example code shows how to use the replicator on the server side:
// publish a new revision on the server IndexWriter indexWriter; // the writer used for indexing Replicator replicator = new LocalReplicator(); replicator.publish(new IndexRevision(indexWriter));Using the replicator on the client side is a bit more involved but hey, that's where the important stuff happens!
// client can replicate either via LocalReplicator (e.g. for backups) // or HttpReplicator (e.g. when the server is located on a different // node). Replicator replicator; // refresh SearcherManager after the index is updated Callable<Boolean> callback = new Callable<Boolean>() { public void call() throws Exception { // index was updated, refresh manager searcherManager.maybeRefresh(); } } // initialize a matching handler for the published revisions ReplicationHandler handler = new IndexReplicationHandler(indexDir, callback); // where should the client store the files before invoking the handler SourceDirectoryFactory factory = new PerSessionDirectoryFactory(workDir); ReplicationClient client = new ReplicationClient(replicator, handler, factory); client.updateNow(); // invoke client manually // -- OR -- client.startUpdateThread(30000); // check for updates every 30 seconds
So What's Next?
The replication framework currently restarts a replication session upon failures. It would be great if it supported resuming a session by e.g. copying only the files that weren't already successfully copied. This can be done by having both server and client e.g. compute a checksum on the files, so the client knows which ones were successfully copied and which weren't. Patches welcome!
Another improvement would be to support resume at the file level, i.e. don't copy parts of the file that were already copied safely. This can be implemented by modifying
ReplicationClient
to discard the first N
bytes of the file it already copied, but would be better if it's supported by the server as well, so that it only sends the required parts. Here too, patches are welcome!
Yet another improvement is to support peer-to-peer replication. Today, the primary server is the bottleneck of the replication process since all clients access it for pulling files. This can be somewhat mitigated by e.g. having a tree network topology, where the primary node is accessed by only a few nodes, which are then accessed by other nodes and so forth. Such topology is however harder to maintain as well as very sensitive to the root node going out of action. In a peer-to-peer replication however, there is no single node from which all clients replicate, but rather each server can broadcast its current revision as well as choose any server that has a newer revision available to replicate from. An example of such implementation over Apache Solr is described here. Hmmm ... did I say patches are welcome?
Lucene has a new replication module. It's only 1 days-old and can already do so much. You are welcome to use it and help us teach it new things!