Solr Function queries

Introduction

One of the most interesting Solr features is definitely “Function Queries”. They allow you to dynamically alter the ranking of your documents, by applying functions on existing index fields. More details and a complete list of the available functions are documented at https://solr.apache.org/guide/8_9/function-queries.html

Use case - Boost recent documents

A pretty common requirement on search results ranking is to combine the absolute relevancy of results (name it tf/idf, BM25 or whatever) with a boost on most recent documents. Going through the list of available functions, one can see that recip function (https://solr.apache.org/guide/8_9/function-queries.html#recip-function) is suited for this feature. Actually one of the syntax examples does just that

1

recip(rord (creationDate), 1,1000,1000)

In order to combine the score of this function with the default relevancy score, you must go a bit more complex and have the following on your sortBy param

1

{!func}product(query({!qtree v=$q}), recip(rord (creationDate), 1,1000,1000)) DESC

If you split it like shown here

1
2
3
4


{!func}product(x, y) DESC

where x=query({!qtree v=$q})
and   y=recip(rord (creationDate), 1,1000,1000)	

it can be translated to

Calculate the query relevancy score (x), calculate the recency score (y), multiply them together (product(x,y)) and sort results in descending order.

Diving into the details

Let’s take a step back though and check how recip works. Quoting from the official Solr documentation page we have

recip(x,m,a,b) performs a reciprocal function with recip(x,m,a,b) implementing a/(m*x+b) where m,a,b are constants, and x is any arbitrarily complex function.
When a and b are equal, and x>=0, this function has a maximum value of 1 that drops as x increases. Increasing the value of a and b together results in a movement of the entire function to a flatter part of the curve.
These properties can make this an ideal function for boosting more recent documents when x is rord(datefield).

The above gives us enough information on why rord(creationDate) is required in recip(rord (creationDate), 1,1000,1000). The output of recip decreases as x ( or in our case rord(creationDate)) increases. Since we want to penalize older dates we need to get bigger values for those. This is ensured by rord which stands for reverse ordering. Therefore, the older the date, the greater the value of rord() and the smaller the value of recip which is always in the [0,1] range.

If we have 3 documents in the index with 3 different dates we’d have the following output for rord

creationDate	rord(creationDate)
20210101	1
20190101	2
19900101	3

Unpleasant surprise

We silently assumed that all of the above documents were stored in a single index. But what happens if we have a 3-sharded setup and each of the 3 documents is stored in a different shard? The above table would turn to the following

shard	creationDate	rord(creationDate)
shard1	20210101	1
shard2	20190101	1
shard3	19900101	1

The rord() function is applied on each shard separately. Therefore, since we have a single document in each shard, each of the documents will get value 1 assigned. Bottom line is that we don’t get comparable values between shards. In practice this might not be an issue because one can assume that you have similar distribution of dates across shards. Still, I would not consider this a bullet-proof approach. We might have to deal with skewed shards in terms of dates contained in it or cases where documents are routed to shards based on creationDate.

Going in another direction

In order to have a valid solution we need to have comparable values across shards. One way to do this is described in Solr’s ReciprocalFloatFunction (which is the code behind the recip function query). Quoting from the JavaDoc

These properties make this an idea function for boosting more recent documents.
Example: recip(ms(NOW,mydatefield),3.16e-11,1,1)
A multiplier of 3.16e-11 changes the units from milliseconds to years (since there are about 3.16e10 millisecondsper year). Thus, a very recent date will yield a value close to 1/(0+1) or 1,a date a year in the past will get a multiplier of about 1/(1+1) or 1/2,and date two years old will yield 1/(2+1) or 1/3.

It looks like we could make use of the ms function and get the difference between 2 dates and pass it as input in recip() function. But….

Unpleasant Surprise…again

The above function requires dates to be stored in Date specific index field types as described in https://solr.apache.org/guide/8_9/function-queries.html#ms-function(`DatePointField`, TrieDateField). What happens if for some strange reason you are stuck with dates represented as yyyyMMdd strings? One option is to change the field type and re-index everything but this is not always straightforward, especially in production.

Another way to go about this is to make use of the strdist function as described https://solr.apache.org/guide/8_9/function-queries.html#strdist-function and provide a custom distance-measure implementation that will return the difference in # of days from our current date.

The example call would be

1
2
3
4
5


{!func}
product(query({!qtree v=$q}), 
recip(strdist(creationDate, "20181010", org.custom.solr.DateDistance), 
      0.0027,1,1)) 
DESC 

DateDistance returns the difference between the 2 dates in days. The 0.0027 equals 1/365, since we are now converting days to years instead of ms to years as described in the JavaDoc. A sample DateDistance implementation follows

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33


package org.custom.solr;

import java.time.LocalDate;
import java.time.chrono.ChronoLocalDate;
import java.time.temporal.ChronoUnit;
import org.apache.lucene.search.spell.StringDistance;

public class DateDistance implements StringDistance {
    
    @Override
    public float getDistance(String date1, String date2) {
        try {
            LocalDate date1Formatted = convertToLocalDate(date1);
            LocalDate date2Formatted = convertToLocalDate(date2);
            return Math.abs(daysBetween(date1Formatted, date2Formatted));
        } catch (Exception e) {
            // log error here
        }

        return 0.0f;
    }

    private LocalDate convertToLocalDate(String time) {
        int year = Integer.parseInt(time.substring(0, 4));
        int month = Integer.parseInt(time.substring(4, 6));
        int day = Integer.parseInt(time.substring(6, 8));
        return LocalDate.of(year, month, day);
    }

    public static long daysBetween(ChronoLocalDate start, ChronoLocalDate end) {
        return Math.abs(start.until(end, ChronoUnit.DAYS));
    }
}

We’ve left out the last 2 params of recip for which we’ve using the value 1000 till now. These values control how heavy is the penalty on older documents. In the following examples assume that current date is 20181010 and all dates are compared to that. The bigger the constant, the smoother the fade-out effect, meaning older dates are not so heavily penalized. In parentheses you can see the values for the last 2 params of recip function. For the last column where the constant used is 15 we’d have recip(strdist(creationDate, "20181010", org.custom.solr.DateDistance), 0.0027,15,15)

creationDate	Score (1,1)	Score (5,5)	Score (15,15)
20181010	1.0	1.0	1.0
20180505	0.70497006	0.9227646	0.9728573
20171010	0.50365144	0.8353521	0.9383504
20100505	0.10749799	0.3758692	0.64370775
20001010	0.05336464	0.21988654	0.4581692

A new door is now wide open

Turns out you can compare almost everything using a custom StrDistance implementation. Say you want to get back from your search engine similar documents to the one you are viewing now, or in other words content-based recommendations. You apply your Machine learning magic and produce a vector representation for each document that is stored in the index in JSON format. This is usually an array of floats or doubles. Then, on query time you apply cosine similarity between the vectors in order to find the most closely related to the original one. One possible implementation of StringDistance could be the following

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25


import com.google.gson.Gson;
import com.google.gson.GsonBuilder;
import org.apache.lucene.search.spell.StringDistance;

public class VectorDistance implements StringDistance {
    private static final Gson GSON = new GsonBuilder().create();

    @Override
    public float getDistance(String sourceVector, String otherVector) {
        float[] srcVector = GSON.fromJson(sourceVector, float[].class);
        float[] targetVector = GSON.fromJson(otherVector, float[].class);

        float score = 0.0f;
		float normSrc = 0.0f;
		float normTarget = 0.0f;
		
        for (int i = 0; i < srcVector.length; i++) {
            score += srcVector[i] * targetVector[i];
			normSrc += srcVector[i] * srcVector[i];
			normTarget += normTarget[i] * normTarget[i];
        }
		
        return score / (float)(Math.sqrt(normSrc) * Math.sqrt(normTarget));
    }
}

and here are some tests verifying the above

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44


import org.junit.Assert;
import org.junit.BeforeClass;
import org.junit.Test;

public class TestVectorDistance {
	
	private static VectorDistance distanceFunction;
	private static final double FLOAT_DELTA = 0.001;

	@BeforeClass
	public static void init() {
		distanceFunction = new VectorDistance();
	}

	@Test
	public void test_similarity_all_common_ids() {
		String doc1 = "[0.5]";
		String doc2 = "[0.5]";

		float similarity = distanceFunction.getDistance(doc1, doc2);
		assertFloatEquality("Similarity should be 1.0f", similarity, 1.0f);
	}

	@Test
	public void test_similarity_two_common_ids() {
		String doc1 = "[0.5, 0.1, 0.45]";
		String doc2 = "[0.5, 0.9, 0.0]";

		float similarity = distanceFunction.getDistance(doc1, doc2);
		assertFloatEquality("Similarity should be 0.4855f", similarity, 0.485f);
	}

	@Test
	public void test_similarity_no_common_ids() {
		String doc1 = "[0.5, 0.1, 0.6, 0.0, 0.0, 0.0]";
		String doc2 = "[0.0, 0.0, 0.0, 0.5, 0.9, 0.9]";

		float similarity = distanceFunction.getDistance(doc1, doc2);

		assertFloatEquality("Similarity should be 0.0f", similarity, 0.0f);
	}

	private void assertFloatEquality(String msg, float actual, float expected){ 	       Assert.assertEquals(msg, expected, actual, FLOAT_DELTA);
	}

Keep in mind that native support for dense vector search is now part of Lucence/Solr 9.0 https://sease.io/2022/01/apache-solr-neural-search.html and future solutions should be based on that, before trying anything else. I haven’t personally tested it yet though.

Takeaways

StrDistance seems like a small door on one side that opens up infinite possibilities on the other one. In the paragraphs above we’ve seen

A custom StrDictance implementation that boosts more recent documents
A custom StrDistance implementation that gives us back content-based recommendations

But it must be used with care. This function will be applied on each one of the results returned and it would not come as a surprise that many use-cases might suffer when it comes to performance. Always measure performance for your use-case as part of the proof-of-concept!