- Always try to figure out if a code that is crashing on very big environment (ie for this case on 100+ asynchronous connection requests) also fails on a much smaller sized environment (ie for this case on 5+ synchronous connection request)
- Try to determine whether you can successfully reproduce the crash on every run of server/code
- If the issue can be successfully reproduced with small number of requests on every run of server that means there is issue with the normal flow of the code
- If the issue can't be reproduced on every run of server and/or on small sized sequential requests then it means that some part of code is throwing an exception which is inhibiting the normal flow of code.
I implemented the requirments stated in module-6 for caching the restaurant queries using redis and ensured that my code was passing all the unit tests. However when I tried to test my implementation by load testing using jmeter, I found out that upto a fixed number of queries it was able to handle the queries quite easily but after that all request were taking indefinite time.
Only a fixed number of requests were served by the server, after that all requests were put on wait indefinitely
- Hypothesis: I hypothesized that the JedisPool is somehow getting exhausted and thus it is waiting indefinitely which is creating those timeouts.
- Experiment: Initially I was having only this line in RedisConfiguration.java file
poolConfig.setMaxTotal(128);
To confirm my hypothesis and to handle the infinite wait state I added the below line in RedisConfiguration.java file
poolConfig.setMaxWaitMillis(10000);
- Observation:
-
After setting this configuration, I started getting the below exception in the server log
Servlet.service() for servlet dispatcherServlet in context with path threw exception Request processing failed; nested exception is redis.clients.jedis.exceptions.JedisException: Could not get a resource from the pool with root cause java.util.NoSuchElementException: Timeout waiting for idle object
-
Another thing I observed that the number of successful connections (Response Code: 200) was proportional to the total size of pool (ie maximum number of connections that could be made through jedisPool). For example when I set the pool size to 2048, it was able to handle upto 2046 requests and when I set the pool size to 100 it was able to handle upto 99 requests from jmeter.
This exception confirmed my hypothesis that jedisPool was getting exhausted.
- Hypothesis: Thus, I furthur hypothesized that it might be because the jedis instances are not returned back to pool due to some reason
- Experiment:
To test this hypothesis, I tried to set breakpoints in my code where I was returning the jedis objects back to the pool using
jedis.close()
while using jmeter. - Observation: But I found out that debugger was always hitting that breakpoint.
- Experiment Again: I got stuck at this point and didn't know how to proceed, I got some advice that I should try to check if I can reproduce the issue with pool size of around 10 or less
- Observation Again: I was successful in reproducing the issue with 10(5) connections from jmeter (it was able to handle upto 9(4) successful connections) Thus, I observed that for pool size = n the server is able to handle n-1 successful connections.
- Hypothesis: This observation made me wonder whether this issue is due to some bug in the initialization of cache
- Experiment: I was further advised to reproduce the issue using cURL instead of jmeter. (This is because using cURL I can send the request sequentially instead of parallelly which will enable me to debug my code much easily)
- Observation: I was successful in reproducing the issue with cURL as well
- Experiment Again: My next experiment was trying to send a request using cURL with pool size = 1
- Observation Again:
I did it and it threw exception with message
Could not get a resource from the pool
at the line
jedis = pool.getResource();
- Hypothesis: This made it clear that the bug was residing somewhere in the code prevoius to this one (maybe in the code where I am initializing the JedisPool objects)
- Experiment:
Hence I started checking the methods
RedisConfiguration::isCacheAvailable
,RedisConfiguration::initCache
andRedisConfiguration::getJedisPool
- Observation: After analyzing the mehods one by one and carefully observing the flow of code line by line in these three methods by setting the breakpoints I found out that the culprit method responsible for leaking Jedis objects was:
Buggy Code:
public boolean isCacheAvailable() {
try {
jedisPool.getResource();
return true;
} catch (JedisConnectionException e) {
return false;
}
}
Here I was calling JedisPool.getResource()
without collecting and closing the Jedis
object (by calling jedis.close()
method) which the JedisPool.getResource()
method was returning.
To correct it, I re-wrote the above code as follows:
Corected Code:
public boolean isCacheAvailable() {
Jedis jedis;
try (jedis = jedisPool.getResource()) {
return true;
} catch (JedisConnectionException e) {
return false;
}
}
Thus in this way I was able to debug a non-trivial bug which didn't get caught by the unit tests.