Fault Tolerance is not the same as Resilency. These two terms are sometimes used interchangeably, but are indeed different.
- Fault Tolerant means the ability of a system to survive (tolerate) when a fault occurs, e.g, surviving a server crash or network partition etc
- There may be some temporary drop in overall performance, however system features are not affected
- Mechanisms such as checkpoint/restore, Replicated State Machines can solve this issue
- The systems usually has the ability to self-detect faults and do failovers
If out of N instances of a microservice sitting behind a reverse proxy like nginx, one instance fails, the service is still available. However, There will be a decrease in throughput . Hence the service is fault-tolerant.
Resiliency is a measure of the system's ability to self-recover from problems
- If we run the above example on Kubernetes, the instance that failed will be automatically brought back online (maynot be the same instance) because Kubenetes automatically maintains the exact number of pods in a replica set. Hence in addition to being fault tolerance this it is also resilient
- Circuit Breaker pattern often used in micro service architecture address resilience issue, wherein you give system time to recover
- In Hadoop or any distributed storage system, if the number of replicas fall below the replication factor, the system creates the number of under replicated copies automatically
You can have a system that is fault tolerant but not resilient . For example, from the first example , if you had to manually bring up an addtional instance
can we write it according tp paper point of view? or is there is any example of it