Fault Tolerance is not the same as Resilency. These two terms are sometimes used interchangeably, but are indeed different.

Fault Tolerance

Fault Tolerant means the ability of a system to survive (tolerate) when a fault occurs, e.g, surviving a server crash or network partition etc
There may be some temporary drop in overall performance, however system features are not affected
Mechanisms such as checkpoint/restore, Replicated State Machines can solve this issue
The systems usually has the ability to self-detect faults and do failovers

Example(s)

If out of N instances of a microservice sitting behind a reverse proxy like nginx, one instance fails, the service is still available. However, There will be a decrease in throughput . Hence the service is fault-tolerant.

Relisiency

Resiliency is a measure of the system's ability to self-recover from problems

Example(s)

If we run the above example on Kubernetes, the instance that failed will be automatically brought back online (maynot be the same instance) because Kubenetes automatically maintains the exact number of pods in a replica set. Hence in addition to being fault tolerance this it is also resilient
- Circuit Breaker pattern often used in micro service architecture address resilience issue, wherein you give system time to recover
In Hadoop or any distributed storage system, if the number of replicas fall below the replication factor, the system creates the number of under replicated copies automatically

You can have a system that is fault tolerant but not resilient . For example, from the first example , if you had to manually bring up an addtional instance

Nj-kol/Fault Tolerance vs Resilency.md

Fault Tolerance

Example(s)

Relisiency

Example(s)

RoomaMaryam commented May 4, 2025

Uh oh!