Sunday, April 27, 2014

Expressions with Lucene

Lucene's expressions module allows computing values for documents using arbitrary mathematical expressions. Among other things, expressions allow a very easy and powerful way to e.g. sort search results and geospatial faceting. The module parses String expressions in JavaScript notation, and returns an object which computes a value for each requested document. Expressions are built of literals, variables and functions. For example, in the expression 5*sqrt(votes), 5 is a literal, sqrt is a function and votes is a variable.

The JavaScript parser recognizes a handful of useful functions, such as max, min, sqrt etc. The javadocs contain the full list of functions as well as operators that the parser recognizes. The variables are resolved by binding their name to a ValueSource. For example, in order to parse and execute the above expression, you need to write code similar to this:

Expression expr = JavascriptCompiler.compile("5*sqrt(votes)");
SimpleBindings bindings = new SimpleBindings();
bindings.add(new SortField("votes", SortField.Type.INT));
The votes variable is bound to the numeric field "votes" (usually, a NumericDocValuesField). SimpleBindings let you bind a variable to a SortField, however internally it is bound to a ValueSource. Expression itself returns a ValueSource, and when your application asks for the value of a document, it computes it based on the formula and the bounded variables' ValueSources.

Customizing expressions

JavascriptCompiler lets you pass a mapping of custom functions, where each function is implemented by a public and static Method, which takes up to 256 double parameters and returns a double. You can see a good example here. That, together with variables, provides great customization capabilities of expressions.

Sometimes customizing expressions is not so straightforward. For example, someone recently asked on the Lucene user-list how to use a multi-valued field in an expression, for the purpose of computing different functions on it (max, sum etc.). At first, it looks like a custom function, e.g. maxAll() (max() is already taken), which can be embedded in an expression like 5*maxVal(data). However, since data is a variable, and variables are bound to ValueSources (which return a single value for a document), we cannot pass all values of data to maxAll().

We can implement a ValueSource though, which returns the maximum value of the field data, and bind a variable to it. Since this isn't a true function, we cannot use a more natural notation like max(data). Perhaps one day Lucene will have built-in support for multi-valued numeric fields, and the expressions module will auto-detect such fields and pass all values to the function (you're welcome to contribute patches!). Until then though, you need to implement a ValueSource for each such function, but fortunately it is quite trivial. I wrote some prototype code below which demonstrates how to do that.

▶ Multi-valued expressions demo

Nested expressions

Suppose that your documents contain longitude and latitude fields and you want to use them to compute a document's relevance, by computing the haversine formula. You can easily do that by compiling an expression such as haversin(40.7143528,-74.0059731,latitude,longitude).

Now, what if you want to use the result of that expression in another expression, such as _score + 1/(1+haversin(40.7143528,-74.0059731,latitude,longitude))? It starts to become somewhat longish to read. Wouldn't it be better if we can encapsulate that in a distance variable and read the expression _score + 1/(1+distance) instead? Fortunately, we can do that easily with nested/sub expressions:

Expression distance = JavascriptCompiler.compile(
Expression expr = JavascriptCompiler.compile("_score + 1/(1+distance)");
SimpleBindings bindings = new SimpleBindings();
bindings.add(new SortField("_score", SortField.Type.SCORE));
bindings.add(new SortField("latitude", SortField.Type.LONG));
bindings.add(new SortField("longitude", SortField.Type.LONG));
bindings.add("distance", distance);
Since a variable can be bound to any ValueSource, and expressions are in fact ValueSources, we can bind the result of an expression to a variable in another expression. Nested expressions also cache their result in case you use them e.g. in multiple clauses of another expression. This could be useful if you have a very expensive expression which needs to be evaluated multiple times per document.

The expressions module is very powerful and lets you code custom ranking/sorting/faceting formulas easily. With custom functions, variables and ValueSources, it makes it trivial to extend and build upon even further. If you implement useful functions, you're welcome to contribute them!

1 comment:

  1. Nice post. I'm wondering what's the performance involved in using a Lucene Function? Is Lucene doing a full-scan over all results and applying the Function to each value?