SolrCloud performance tuning

Introduction

In this post we’ll focus on SolrCloud stability and search performance. Most of the lessons have been learned the hard way, fighting them on production. The list is nowhere complete and rest assured that there are more Solr performance related storied to tell. Take note that the ones discussed below are not covered in great detail as it’d make a very long post. Some of them are even worth their own individual post.

Potential issues causing cluster instability

The following points fall under the “abusing Solr” category. The same way that JOIN functionality provided by RDBMs does not mean that all the JOIN queries you can imagine will be performant, there are some usages of Solr that are not ideal performance-wise.

Avoid search calls using .setRows(Integer.MAX)

Avoid setting Integer.MAX or high values (usually when doing paging with start and rows params) in .setRows() call when preparing a SolrQuery. It can wreak havoc to the Solr cluster. Basically Solr tries to allocate as much memory as indicated in the parameter passed to .setRows(), not a good thing. Haven’t seen it with my own eyes happening but others have. Check Solr Wiki and Statsbiblioteket.
Alternatives:

For real-time requests first get the # of docs in the index and pass this to .setRows() instead. For calls that you know what to expect just pass the expected number of results.
For “next-page” functionality and offline procedures cursorMark can be utilized.

Avoid using deleteByQuery

Mixing indexing calls with deleteByQuery() commands is just asking for trouble. If you witness seemingly random replicas crashes and/or getting in recovery mode it might be related to deleteByQuery interspersed with indexing requests.

Alternatives:

Make use of deleteById() directly if possible.
Try to search first using the query that is passed to deleteByQuery(), get the IDs and call deleteById().
Get rid of your deletes altogether, by making use of atomic updates. This comes with its own limitations but you can check if it fits your use-case.

Improving search throughput

Adjust maxConnectionsPerHost to your needs

The story has as follows. Search on a single-sharded SolrCloud setup with an 10M docs/350GB index and high cardinality fields was slow, especially when faceting was utilized. A simple search from Solr admin page with faceting enabled could take ~1.5sec.After the index was split into 8 shards the same query’s response time dropped to ~200msec! That’s great I thought, problem solved. But it wasn’t to be. Running a load test showed that after a certain rate (>20QPS) performance dropped dramatically.So, what was going on? At first I noticed that having more than 2 shards on the same Solr JVM was causing slowness, but I could not figure out why this was happening. The CPUs on the machine were more than enough to server requests in multiple shards. Turns out there’s a property which can be placed inside solr.xml under the HttpShardHandlerFactory node. It’s called maxConnectionsPerHost and increasing it (default value is 20) gives some fresh air to Solr and inter-shard communication and response times are back to expected values (~200msec) even under heavy load.

Check your FilterCache hit ratio (especially in distributed search)

Searching in a distributed index makes heavy use of the FilterCache that is not evident when searching in a single shard. Faceting in specific seems to generate and store multiple entries in the FilterCache. In these cases it’s better to configure your FilterCache using maxRamMB as faceting adds multiple entries of small size.

Profile indexing/search and pay attention to your custom analyzers/tokenizers/filters

You don’t really need any exotic tools for this one. Tools provided by the JDK (like JVisualVM) can monitor threads/memory activity while you index/search. Look for your custom classes and how much time is spent on them.

If you have GC choking issues, try the G1 collector.

It can do miracles. Witnessed a setup where the default collector used by SolrCloud could not cope with the load after some time. Just switching to G1 fixed things. GC tuning is very dependent on each use case so below only the most important parameters are mentioned. These include

-XX:+ParallelRefProcEnabled, enable it when Ref Proc and Ref Enq phases take too much time. More details at https://docs.oracle.com/javase/9/gctuning/garbage-first-garbage-collector-tuning.htm#JSGCT-GUID-90E30ACA-8040-432E-B3A0-1E0440AB556A
-XX:MaxGCPauseMillis the maximum GC pause duration that G1 should try to achieve. Don’t be too optimistic on this one and set a very low value as G1 might struggle in trying to keep up.
-XX:ParallelGCThreads which sets the number of threads used during parallel phases of the garbage collectors. If you have the CPU try to increase it, it can have a big impact on performance.
-XX:G1HeapRegionSize which is depended on the total heap space allocated. The target is to have ~2048 G1 regions according to the official documentation at http://www.oracle.com/technetwork/articles/java/g1gc-1984535.html

Utilize facet.threads

In practice the use of facet.threads has improved search response times when faceting is enabled but still sharding the index will probably yield better response times for faceting and search in general. There’s no point in setting this variable a higher value than the # of fields you are faceting on. It should at most be equal to the # of faceting fields.

Use DocValues for faceting

This comes directly from Solr Documentation, although I don’t have any serious metrics I’ll take their word for it.

Pay attention to your deleted documents

Maybe you have a perfectly healthy SolrCloud instance which performs great. Time goes by and after a few months you notice a dramatic degradation in search performance. Code is the same, hardware is the same, traffic is the same, query and indexing patterns are the same. It’s time to have a look at the percentage of the deleted documents in your index. It might be responsible for this degradation and you have to somehow limit it.

Lucene - and therefore Solr - make use of certain append-only data structures when storing terms that in the index segments. More details on the underlying data structures (SSTables), how they work and what are the benefits can be found in Chapter 3 of the great Designing Data-Intensive Applications book (https://dataintensive.net/)

This means that whenever something is deleted it is just marked as such and it’s not actually removed from the index. Unsurprisingly your index size grows, as more and more documents are getting deleted. And if that wasn’t enough, your search queries are also getting slower as the reader has to traverse more documents. In the meantime, the so-called merger that runs periodically in the background is responsible for merging segments (and removing the deleted entries) once the deletion ratio goes above a certain threshold. But for some use-cases this value might prove to be too high and it’s not configurable.

Segments handling in Lucene/Solr as well as handling of deleted documents is a black art (or black magic for that matter). More info on how merging can be enforced at https://sematext.com/blog/solr-optimize-is-not-bad-for-you-lucene-solr-revolution/. There is an HTTP API that can get you out of this mess. In later versions support is coming for configuring this threshold.

Note: All of the above were tested on SolrCloud 7.4.1, so some of those hints may not be applicable in recent versions.