Fault tolerant services

Introduction

We live in a software world where distributed applications are common. They usually include lots of independently deployed services that communicate with each other. Sure, there are many benefits from this approach, but it goes without saying that it also introduces some new problems that need our attention.

One of those is fault tolerance. For example, what happens when one of the services degrades in performance? If you don’t take any measures the application that makes the calls to that “problematic” service will start experiencing long timeouts and might even go down if threads accumulate while waiting for a response. You need to have some safeguards in-place.

It’s so common of a problem that there are open source tools available like Hystix (https://github.com/Netflix/Hystrix) and resilience4j (https://github.com/resilience4j/resilience4j) that deal with it. And most of the times taking advantage of such battle-tested tools is the way to go.

Here, we’ll look at what might lurk inside such a fault-tolerant library. Think of it more like an exercise but it can also prove useful if you need something in-place without adding more dependencies in an existing library-bloated application. (Anyone having lived through JAR hell https://en.wikipedia.org/wiki/Java_Classloader#JAR_hell can relate to that)

The basic ideas behind this are described in https://github.com/Netflix/Hystrix/wiki#how-does-hystrix-accomplish-its-goals

Initial take

Imagine a scenario where you have a web application running on an application server like Apache Tomcat. Every click on the website is picked up by a thread from Tomcat’s thread pool. At some point, in order to accomplish a task it has to communicate with a Service through an API call like

1

ServiceClient.serve(req);

The ServiceClient performs HTTP calls to the independent service. If the Service degrades then more and more Tomcat threads are getting piled up waiting for the Service to respond. In a busy web server this behavior can even bring down the server due to the high number of threads. Other symptoms include increased memory usage, increased CPU context switching or even inability to create new threads.

Setting an upper bound

You can put a limit to that by wrapping these requests to the Service in another thread. One way to go about this is by using a bounded thread pool in Java like this

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15


final int COMPUTATION_TIMEOUT = 5;
final int NUM_THREADS = 10;

ExecutorService executor = Executors.newFixedThreadPool(NUM_THREADS);

Supplier<Response> req = () -> {
                try {
                    return ServiceClient.serve(requestParams);
                } catch (Exception e) {
                    throw new CompletionException(e);
                }
            };
CompletableFuture<Response> result = CompletableFuture.supplyAsync(req, executor);

return result.get(COMPUTATION_TIMEOUT, TimeUnit.SECONDS);

If that was written in plain English, it’d go like

"Look, I won't allow your Service to bring me down. You can have at most 10 simultaneous requests to the Service but no more. And each one of these will be allowed a max of 5 sec to complete. If it does not complete in 5sec you'll get a `TimeoutException` and the Tomcat thread will be allowed to return immediately."

This is certainly a step in the right direction but we still have a blind spot here. What happens if there’s a surge of activity and these 10 threads are not enough to handle the extra traffic? Checking the JavaDoc of .newFixedThreadPool() it appears that Java will add those extra tasks into an unbounded queue. The word unbounded just smells trouble. We might end-up putting extreme pressure on the web server by letting this queue go wild. Now, what are our options?

One step further

Depending on your use-case it might be proper to define a different strategy for queuing. One such is rejecting the incoming requests when all threads are busy might be a viable solution. For example, If you click to reserve a room or perform a search operation you can still get a message “Please retry” and issue the command again.

Thankfully you can configure this behavior by introducing a custom ThreadPoolExecutor

1
2
3
4
5
6


ThreadPoolExecutor executor = new ThreadPoolExecutor(
	corePoolSize, 
	maxNumThreads,
    keepAliveTime, keepAliveTimeUnit, 
	queue);

The above example creates a DirectHandOff type pool. Having
corePoolSize=20, maxNumThreads=100, keepAliveTime=1, keepAliveTimeUnit=TimeUnit.MINUTES, queue=new SynchronousQueue<>() the above translates into English as follows

"The pool starts with 20 threads. If all 20 threads are busy and new requests keep coming in, new threads will be spawned till the upper limit of 100 threads is reached. If all 100 threads are busy new requests will be rejected with a `RejectedExecutionException`. If the load decreases the pool will start killing the threads that sit idle for more than 1min until it goes back to 20 threads."

A nice “side-effect” of the above ThreadPoolExecutor is that the size of the pool is adaptable to the web server’s traffic by fluctuating between corePoolSize and maxNumThreads depending on the load.

One last thing. In practice it’s also useful to give names to those threads. You can thank yourself later while debugging and reading thread dumps. A custom ThreadPoolExecutor is all that is needed

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


// ThreadFactoryBuilder is used here for simplicity, it's a Google Guava class
ThreadFactory threadFactory = new ThreadFactoryBuilder()
                                                .setNameFormat(poolName + "-%d")
                                                .build();

ThreadPoolExecutor executor = new ThreadPoolExecutor(
	corePoolSize, 
	maxNumThreads,
    keepAliveTime, keepAliveTimeUnit, 
	new SynchronousQueue<>(),            
	threadFactory);

Linking your work to a circuit breaker

Finally, after all the above are setup properly you can combine your new thread pool with a circuit breaker. A simplified circuit breaker records successes and failures and has 2 corresponding thresholds. When there are FAILURE_THRESHOLD number of consecutive failed requests the circuit is open, meaning new requests are not forwarded to the service. The circuit remains semi-open for DELAY number of TIME_UNIT. Once you get SUCCESS_THRESHOLD number of consecutive successful request the circuit will move from semi-open to close and communication with the service resumes normally.
The only thing that needs care is when to increase the counter of failed requests. Usually certain exceptions should be taken into account, in our case these include RejectedExecutionException and TimeoutException.

1
2
3
4
5


CircuitBreaker<Object> breaker = new CircuitBreaker<>()
                    .withFailureThreshold(FAILURE_THRESHOLD, 20))
                    .withSuccessThreshold(SUCCESS_THRESHOLD, 10))                          .withDelay(Duration.ofMillis(DELAY, 20, TIME_UNIT)));
        }
    }

Downsides

The above approach will work just fine. The main downsides is that the limits defined in ThreadPoolExecutor and CircuitBreaker are predefined and static. You need to understand your system’s capacity and your expected traffic for setting the proper values without hindering your system’s scalability.

In order to overcome this you need to work towards more adaptive implementations that react to an application’s real-time performance.

Takeaways

Communication between services is of critical importance. Especially nowadays with applications having many moving parts, a single failure can cascade to the whole system. We must be very prudent when designing such a system.

One cannot simply deploy a bunch of services that communicate with each other in order to bring something meaningful to the user and just hope for the best. It’s better to assume that something will fail and be sure that most (if not all) of the times it will.

The basic steps that have discussed to achieve basic fault tolerance in your application are

Wrap each request in a new thread
Make sure your thread is part of a bounded thread pool
Set timeouts for each request
Reject additional requests when the threshold is exceeded
Trigger a circuit breaker in case of failures