QEats Module-6

Key Takeaways

Always try to figure out if a code that is crashing on very big environment (ie for this case on 100+ asynchronous connection requests) also fails on a much smaller sized environment (ie for this case on 5+ synchronous connection request)
Try to determine whether you can successfully reproduce the crash on every run of server/code
If the issue can be successfully reproduced with small number of requests on every run of server that means there is issue with the normal flow of the code
If the issue can't be reproduced on every run of server and/or on small sized sequential requests then it means that some part of code is throwing an exception which is inhibiting the normal flow of code.

Elaborate/Exact steps used for debugging

Problem Description

I implemented the requirments stated in module-6 for caching the restaurant queries using redis and ensured that my code was passing all the unit tests. However when I tried to test my implementation by load testing using jmeter, I found out that upto a fixed number of queries it was able to handle the queries quite easily but after that all request were taking indefinite time.

Observation

Only a fixed number of requests were served by the server, after that all requests were put on wait indefinitely

Iteration 1

Hypothesis: I hypothesized that the JedisPool is somehow getting exhausted and thus it is waiting indefinitely which is creating those timeouts.
Experiment: Initially I was having only this line in RedisConfiguration.java file

poolConfig.setMaxTotal(128);

To confirm my hypothesis and to handle the infinite wait state I added the below line in RedisConfiguration.java file

poolConfig.setMaxWaitMillis(10000);

Observation:

After setting this configuration, I started getting the below exception in the server log

Servlet.service() for servlet dispatcherServlet in context with path threw exception Request processing failed; nested exception is redis.clients.jedis.exceptions.JedisException: Could not get a resource from the pool with root cause java.util.NoSuchElementException: Timeout waiting for idle object
Another thing I observed that the number of successful connections (Response Code: 200) was proportional to the total size of pool (ie maximum number of connections that could be made through jedisPool). For example when I set the pool size to 2048, it was able to handle upto 2046 requests and when I set the pool size to 100 it was able to handle upto 99 requests from jmeter.

This exception confirmed my hypothesis that jedisPool was getting exhausted.

Iteration 2

Hypothesis: Thus, I furthur hypothesized that it might be because the jedis instances are not returned back to pool due to some reason
Experiment: To test this hypothesis, I tried to set breakpoints in my code where I was returning the jedis objects back to the pool using jedis.close() while using jmeter.
Observation: But I found out that debugger was always hitting that breakpoint.
Experiment Again: I got stuck at this point and didn't know how to proceed, I got some advice that I should try to check if I can reproduce the issue with pool size of around 10 or less
Observation Again: I was successful in reproducing the issue with 10(5) connections from jmeter (it was able to handle upto 9(4) successful connections) Thus, I observed that for pool size = n the server is able to handle n-1 successful connections.

Iteration 3

Hypothesis: This observation made me wonder whether this issue is due to some bug in the initialization of cache
Experiment: I was further advised to reproduce the issue using cURL instead of jmeter. (This is because using cURL I can send the request sequentially instead of parallelly which will enable me to debug my code much easily)
Observation: I was successful in reproducing the issue with cURL as well
Experiment Again: My next experiment was trying to send a request using cURL with pool size = 1
Observation Again: I did it and it threw exception with message Could not get a resource from the pool at the line

jedis = pool.getResource();

Iteration 4

Hypothesis: This made it clear that the bug was residing somewhere in the code prevoius to this one (maybe in the code where I am initializing the JedisPool objects)
Experiment: Hence I started checking the methods RedisConfiguration::isCacheAvailable,RedisConfiguration::initCache and RedisConfiguration::getJedisPool
Observation: After analyzing the mehods one by one and carefully observing the flow of code line by line in these three methods by setting the breakpoints I found out that the culprit method responsible for leaking Jedis objects was:

Buggy Code:

  public boolean isCacheAvailable() {
    try {
      jedisPool.getResource();
      return true;
    } catch (JedisConnectionException e) {
      return false;
    }
  }

Here I was calling JedisPool.getResource() without collecting and closing the Jedis object (by calling jedis.close() method) which the JedisPool.getResource() method was returning. To correct it, I re-wrote the above code as follows:

Corected Code:

  public boolean isCacheAvailable() {
    Jedis jedis;
    try (jedis = jedisPool.getResource()) {
      return true;
    } catch (JedisConnectionException e) {
      return false;
    }
  }

Thus in this way I was able to debug a non-trivial bug which didn't get caught by the unit tests.

arunsammit/QEats_Module-6.md