Introduction
One of the most interesting Solr features is definitely “Function Queries”. They allow you to dynamically alter the ranking of your documents, by applying functions on existing index fields. More details and a complete list of the available functions are documented at https://solr.apache.org/guide/8_9/function-queries.html
Use case - Boost recent documents
A pretty common requirement on search results ranking is to combine the absolute relevancy of results (name it tf/idf
, BM25
or whatever) with a boost on most recent documents. Going through the list of available functions, one can see that recip
function (https://solr.apache.org/guide/8_9/function-queries.html#recip-function) is suited for this feature. Actually one of the syntax examples does just that
|
|
In order to combine the score of this function with the default relevancy score, you must go a bit more complex and have the following on your sortBy
param
|
|
If you split it like shown here
|
|
it can be translated to
Calculate the query relevancy score (x), calculate the recency score (y), multiply them together (product(x,y)) and sort results in descending order.
Diving into the details
Let’s take a step back though and check how recip
works. Quoting from the official Solr documentation page we have
recip(x,m,a,b) performs a reciprocal function with recip(x,m,a,b) implementing a/(m*x+b) where m,a,b are constants, and x is any arbitrarily complex function.
When a and b are equal, and x>=0, this function has a maximum value of 1 that drops as x increases. Increasing the value of a and b together results in a movement of the entire function to a flatter part of the curve.
These properties can make this an ideal function for boosting more recent documents when x is rord(datefield).
The above gives us enough information on why rord(creationDate)
is required in recip(rord (creationDate), 1,1000,1000)
. The output of recip
decreases as x ( or in our case rord(creationDate)
) increases. Since we want to penalize older dates we need to get bigger values for those. This is ensured by rord
which stands for reverse ordering. Therefore, the older the date, the greater the value of rord()
and the smaller the value of recip
which is always in the [0,1] range.
If we have 3 documents in the index with 3 different dates we’d have the following output for rord
creationDate | rord(creationDate) |
---|---|
20210101 | 1 |
20190101 | 2 |
19900101 | 3 |
Unpleasant surprise
We silently assumed that all of the above documents were stored in a single index. But what happens if we have a 3-sharded setup and each of the 3 documents is stored in a different shard? The above table would turn to the following
shard | creationDate | rord(creationDate) |
---|---|---|
shard1 | 20210101 | 1 |
shard2 | 20190101 | 1 |
shard3 | 19900101 | 1 |
The rord()
function is applied on each shard separately. Therefore, since we have a single document in each shard, each of the documents will get value 1 assigned. Bottom line is that we don’t get comparable values between shards. In practice this might not be an issue because one can assume that you have similar distribution of dates across shards. Still, I would not consider this a bullet-proof approach. We might have to deal with skewed shards in terms of dates contained in it or cases where documents are routed to shards based on creationDate
.
Going in another direction
In order to have a valid solution we need to have comparable values across shards. One way to do this is described in Solr’s ReciprocalFloatFunction
(which is the code behind the recip
function query). Quoting from the JavaDoc
These properties make this an idea function for boosting more recent documents.
Example: recip(ms(NOW,mydatefield),3.16e-11,1,1)
A multiplier of 3.16e-11 changes the units from milliseconds to years (since there are about 3.16e10 millisecondsper year). Thus, a very recent date will yield a value close to 1/(0+1) or 1,a date a year in the past will get a multiplier of about 1/(1+1) or 1/2,and date two years old will yield 1/(2+1) or 1/3.
It looks like we could make use of the ms
function and get the difference between 2 dates and pass it as input in recip()
function. But….
Unpleasant Surprise…again
The above function requires dates to be stored in Date specific index field types as described in https://solr.apache.org/guide/8_9/function-queries.html#ms-function(`DatePointField`, TrieDateField
). What happens if for some strange reason you are stuck with dates represented as yyyyMMdd
strings? One option is to change the field type and re-index everything but this is not always straightforward, especially in production.
Another way to go about this is to make use of the strdist
function as described https://solr.apache.org/guide/8_9/function-queries.html#strdist-function and provide a custom distance-measure implementation that will return the difference in # of days from our current date.
The example call would be
|
|
DateDistance
returns the difference between the 2 dates in days. The 0.0027 equals 1/365, since we are now converting days to years instead of ms to years as described in the JavaDoc. A sample DateDistance
implementation follows
|
|
We’ve left out the last 2 params of recip for which we’ve using the value 1000 till now. These values control how heavy is the penalty on older documents. In the following examples assume that current date is 20181010 and all dates are compared to that. The bigger the constant, the smoother the fade-out effect, meaning older dates are not so heavily penalized. In parentheses you can see the values for the last 2 params of recip
function. For the last column where the constant used is 15 we’d have recip(strdist(creationDate, "20181010", org.custom.solr.DateDistance), 0.0027,15,15)
creationDate | Score (1,1) | Score (5,5) | Score (15,15) |
---|---|---|---|
20181010 | 1.0 | 1.0 | 1.0 |
20180505 | 0.70497006 | 0.9227646 | 0.9728573 |
20171010 | 0.50365144 | 0.8353521 | 0.9383504 |
20100505 | 0.10749799 | 0.3758692 | 0.64370775 |
20001010 | 0.05336464 | 0.21988654 | 0.4581692 |
A new door is now wide open
Turns out you can compare almost everything using a custom StrDistance
implementation. Say you want to get back from your search engine similar documents to the one you are viewing now, or in other words content-based recommendations. You apply your Machine learning magic and produce a vector representation for each document that is stored in the index in JSON format. This is usually an array of floats or doubles. Then, on query time you apply cosine similarity between the vectors in order to find the most closely related to the original one. One possible implementation of StringDistance
could be the following
|
|
and here are some tests verifying the above
|
|
Keep in mind that native support for dense vector search is now part of Lucence/Solr 9.0 https://sease.io/2022/01/apache-solr-neural-search.html and future solutions should be based on that, before trying anything else. I haven’t personally tested it yet though.
Takeaways
StrDistance
seems like a small door on one side that opens up infinite possibilities on the other one. In the paragraphs above we’ve seen
- A custom
StrDictance
implementation that boosts more recent documents - A custom
StrDistance
implementation that gives us back content-based recommendations
But it must be used with care. This function will be applied on each one of the results returned and it would not come as a surprise that many use-cases might suffer when it comes to performance. Always measure performance for your use-case as part of the proof-of-concept!