Characteristics of Distributed Systems 💻

What is a Distributed System ?

Distributed system is a collection of independent computers or nodes that work together to solve a common problem or perform a task.

Instead of relying on a single central computer, the workload is divided and distributed among multiple nodes, each node has its own role and contributes to the overall goal, each has its own processing power and memory.

This allows for parallel processing, improved performance, fault-tolerance, and the ability to handle large-scale workloads

Table of contents ☝️

Characteristics of Distributed Systems

Distributed systems exhibit several key characteristics that distinguish them from centralized systems. Here are some important characteristics of distributed systems:

1. Reliability

Reliability refers to the ability of the system to continue to function correctly in the face of failures. A reliable distributed system is one that is unlikely to fail, and that can recover quickly from failures that do occur.

In simple terms, it means that the system can be trusted to work properly and reliably.

Table of contents ☝️

2. Fault Tolerance

When a part of a system experiences failure but the system as a whole can still continue to operate reliably, we label the system as Fault-Tolerant or Resilient.

Fault Tolerance means that The system can continue operating even if individual nodes or components fail.

System can encounter two types of fails:

System Fault: It indicates that something is not functioning as intended or has encountered an error. When a fault occurs, it means there is a problem within the system, but it does not necessarily result in a complete system failure.
System Failure: A system failure occurs when the system as a whole is unable to perform its intended functions or deliver the expected results. A failure can result from a single critical fault or a combination of multiple faults accumulated, not addressed and cause system-wide problems.

There are a number of fault tolerance techniques used in distributed system:

Error Detection and Monitoring: Fault tolerance techniques often include mechanisms to detect errors or failures in the system. This can involve monitoring the health and performance of system components, detecting anomalies, and raising alerts or notifications when deviations from expected behavior occur.
Error Recovery and Healing: When failures are detected, fault tolerance techniques aim to recover from those errors automatically. This can involve techniques such as automatic error recovery, where failed components are restarted or replaced, or self-healing mechanisms that detect and repair faults in the system.
Checkpointing: This involves periodically saving the state of a system so that it can be restored if a failure occurs. For example, a database might checkpoint its state every few minutes so that it can be restored if a server fails.
Failover: This involves automatically switching to a backup system if the primary system fails. For example, a web server might have a backup server that is automatically activated if the primary server fails.
Graceful Degradation: This is done by designing the system in a way that allows it to fall back to a less capable mode of operation if necessary. For example, a web application might be designed to use JavaScript to provide some features. However, if a user's browser does not support JavaScript, the application can degrade gracefully and provide those features using only HTML and CSS.

Fault tolerance is an important consideration for any distributed system that needs to be:

highly available.
continue operating even when some components fail.

Table of contents ☝️

3. Scalability

Scalability means that the system can continue to handle larger loads without a significant decline in performance or responsiveness.

A scalable distributed system can effectively handle a higher number of concurrent users or process a larger volume of data without becoming overloaded or experiencing performance bottlenecks.

Scalability can be achieved through three different approaches:

Vertical Scaling: Known as scaling up. This involves increasing the resources (e.g., CPU, memory) of individual nodes in the system, allowing them to handle more work. It typically involves upgrading hardware components.
Horizontal Scaling: Known as scaling out. This involves adding more machines or nodes to a system. The system is distributed across multiple machines, and the workload is divided among them. Each machine operates independently and handles a portion of the overall workload. It typically involves adding more servers, virtual machines, or containers to the system.
Elastic Scaling: This is the ability of the system to automatically adjust its resources based on workload ups and downs. It allows the system to scale up or down dynamically based on demand, ensuring efficient resource utilization.

Load parameters related to system scalability refer to the factors that determine the workload or demand on a system and how well the system can handle that load as it scales.

Here are some common load parameters:

Requests per Second (RPS): It represents the number of incoming requests or transactions that the system needs to process per second. Higher RPS indicates a heavier workload and requires the system to handle a larger volume of requests.
Response Time Requirements (RTR): It defines the expected or desired time it takes for a system to respond to a user's request. It represents the speed at which the system needs to process and provide a response.
Concurrent Users: Concurrent users refer to the number of users simultaneously interacting with the system. More concurrent users typically result in a higher load on the system, as it needs to serve multiple users' requests concurrently.
Data Volume: The amount of data that the system needs to handle and process. It can include factors like the size of incoming data streams, the number of records, or the overall data storage requirements.
Traffic Patterns: Describe how the workload is distributed over time. For example, a website may experience higher traffic during peak hours or specific events, while it may have lower traffic during off-peak times. Understanding traffic patterns helps in designing systems that can handle the expected workload efficiently and scale resources accordingly.
Processing Complexity: The complexity of the processing tasks performed by the system can impact scalability. Some operations or computations may require more computational resources or time, affecting the system's ability to handle a higher load.

When the load parameters increase, a scalable system is expected to keep its expected performance.

Table of contents ☝️

4. Maintainability

Maintainability refers to how easy it is to modify and update existing software over time. It measures how well a system supports changes, updates, bug fixes, and ongoing maintenance tasks.

A maintainable system is designed to be modular, well-documented, and easily understandable, allowing developers to efficiently manage and enhance it over time

Higehr Maintainability reduces the effort, time, and cost associated with maintaining and evolving a software system, enabling effective troubleshooting, bug fixing, and feature enhancements.

Table of contents ☝️

5. Extensibility

Extensibility refers to how easy it is to add new features to existing software. It measures how well a system can be extended or scaled to meet changing requirements or incorporate new capabilities without major redesign or reimplementation.

Similar to Maintainability, an extensible system is designed to be modular, well-documented, and easily understandable, allowing developers to efficiently manage and extend it over time

Extensibility enables the system to evolve, adapt, and grow as new needs arise, providing a foundation for future enhancements and scalability.

Table of contents ☝️

6. Availability

Availability in distributed systems is the ability of a system to remain operational and accessible. In simple terms, availability is about ensuring that the distributed system is up and running, ready to respond to user requests or perform desired operations at any given time.

We’ll say the system is 100% available if clients can always interact with the system and perform their intended tasks without experiencing prolonged downtime or unavailability.

There are a number of techniques that can be used to improve the availability of a distributed system, including:

Replication: Replication is the process of creating multiple copies of data or software components. This can help to improve availability by providing a backup in case of a failure.
Redundancy: Redundancy focuses on duplicating critical system components to eliminate single points of failure. This can be done at the hardware or software level. For example, a system may have multiple power supplies or multiple servers. Redundancy helps to ensure that the system can continue to function even if one of the components fails.
Load Balancing: Load balancing is the process of distributing requests across multiple nodes or replicas. This can help to improve availability by ensuring that no single node is overloaded.

High availability is crucial for critical applications or services where downtime can have severe consequences, such as e-commerce platforms, banking systems, or communication networks. By taking steps to improve the availability of a distributed system, organizations can help to ensure that critical services are always available.

Table of contents ☝️

7. Consistency

Consistency refers to ensuring that data remains correct and synchronized across multiple copies or nodes. It guarantees that all users or processes observing the system will see a consistent view of the data, regardless of which node/replica they access.

Various consistency models or levels exist, ranging from strong consistency to week or eventual consistency :

Model	Strong	Week / eventual
Definition	Immediate synchronization across all replicas or nodes, all nodes always see the latest value of a data item.	There is a delay or lag between updating data and propagating those updates to all nodes. This delay leads to the possibility of nodes observing different states of the data at any given time.
Availability	Reduced availability	Better availability
Latency	Higher latency	Lower latency
Performance	Lower	Better
Use Cases	Financial systems, where it is important that all clients always see the latest data.	Systems where availability is more important than consistency, such as social media platforms.

Here are some examples of systems that use strong consistency:

Financial systems
Healthcare systems
E-commerce systems
Real-time trading systems

Here are some examples of systems that use weak consistency:

Social media platforms
Content management systems
Blog platforms
File sharing systems

Table of contents ☝️

8. Concurrency

Concurrency allows multiple tasks to be executed at the same time. It allows different components or nodes within the system to work concurrently on separate tasks

To understand concurrency in a distributed system, let's consider a simple example of a web server handling multiple client requests. In this scenario, the web server is designed to distribute the incoming requests across multiple worker threads or processes to achieve concurrency.

Sequential Execution: Without concurrency, the web server would process each client request sequentially. It would receive a request, process it, and send back a response before moving on to the next request. In this case, if one request takes a long time to process, it would delay the processing of subsequent requests, potentially leading to poor performance.
Concurrent Execution: With concurrency, the web server can handle multiple requests simultaneously. As requests arrive, the server assigns them to available worker threads or processes. Each worker thread independently processes its assigned request, allowing multiple requests to be processed concurrently.

Concurrency can be a powerful tool for improving the performance of distributed systems. However, it can also be a source of problems if it is not managed properly.

Some of the common issues associated with concurrency include:

Race Conditions: A race condition is a bug that can occur in a multithreaded or distributed system. It happens when two or more threads or processes are trying to access the same data at the same time, and the outcome of the program depends on the order in which the threads or processes access the data. A common scenario of arace condition is when two users are trying to update the same record in a database at the same time with different values which can produce incorrect results.
Deadlocks: Deadlocks occur when two or more threads or processes are unable to proceed because each is waiting for a resource that another thread/process holds. This results in a situation where all threads/processes are blocked and unable to make progress. Imagine two people, Alice and Bob, who need two items, a pen and a notebook, to complete their respective tasks. However, they both hold one item that the other person needs.

To avoid these problems, proper design, synchronization mechanisms, and concurrency control techniques should be employed to ensure correctness and reliability in the presence of concurrency.

Table of contents ☝️

9. Transparency

Transparency in a distributed system is the property of a system that hides the complexity of the system from the user or application programmer. This means that the user or application programmer does not need to know how the system is implemented or how the different components of the system interact with each other.

In simpler terms, transparency means that users or applications interact with a distributed system as if it were a centralized system, without needing to know about the distribution of resources, network communication, or other low-level details.

There are various types of transparency in distributed systems:

Access Transparency: This type of transparency hides the fact that the system is distributed. The user or application programmer can access resources on different computers in the same way that they would access resources on a single computer.
Location Transparency: This type of transparency hides the location of resources. The user or application programmer does not need to know where a resource is located in order to access it.
Failure Transparency: This type of transparency hides the fact that components of the system may fail. The user or application programmer can continue to use the system even if some of the components fail.
Replication Transparency: This type of transparency hides the fact that resources may be replicated. The user or application programmer can access resources through a consistent interface without knowing that multiple copies of the resource exist and are being managed behind the scenes.

By providing these forms of transparency, distributed systems aim to simplify the development, management, and usage of complex distributed applications.

Table of contents ☝️

3. Measuring Performance

Measuring performance in a distributed system involves evaluating various aspects to measure its effectiveness. Here are some common approaches and metrics used to measure performance:

1. Latecny:

Is the time it takes for a request to travel from one component (e.g the client) to another component (e.g the server).

2. Response Time:

Is the time it takes for a server to process a request and send back a response. it includes latency along with other factors that contribute to the overall time taken for a complete response.

For example, if you are playing an online game, the latency is the time it takes for your input to be sent to the game server, and the response time is the time it takes for the game server to process your input and send back the updated game state

Here is a table that summarizes the key differences between latency and response time:

Latency	Responce Time
The time it takes for a signal to travel from one point to another.	The time it takes for a system to respond to a request.
A measure of how fast data can travel.	A measure of how quickly a system can process requests.
Can be affected by factors such as distance, network congestion, and the type of network used.	Can be affected by factors such as the speed of the server, the amount of traffic on the server, and the complexity of the request.

Lower latency is better. However, there are some cases where higher latency may be acceptable, such as when the amount of data being transferred is very large or when the system is not being used heavily.
Response time is also important, but it is less critical than latency. This is because even if a system has a high response time, it can still process requests quickly if the latency is low.
If the latency is high, even a system with a low response time may not be able to process requests quickly enough.

3. Throughput:

The number of requests that the system can handle per unit of time. Higher throughput indicates the system can handle more work in a given time period.

4. Resource Utilization:

Monitoring the utilization of system resources such as CPU, memory, disk, and network bandwidth. It helps assess how efficiently resources are being used and whether any bottlenecks exist.

5. Error Rates:

Tracking the occurrence of errors, exceptions, or failures within the system. Lower error rates indicate better performance and reliability.

6. Scalability:

Assessing how well the system can handle increased workloads or user demands by adding more resources or scaling horizontally. It helps determine if the system can maintain performance as it grows.

Measuring performance is crucial for identifying areas of improvement, optimizing system behavior, and ensuring that the system meets performance goals and user expectations. It helps in detecting performance bottlenecks, diagnosing issues, and making informed decisions to enhance the system's efficiency and effectiveness.

Table of contents ☝️

Links:

🕴️ Linkedin: Dragon Slayer 🐲
📝 Articles: All Articles written by D.S

Mohamed-Code-309/Characteristics of Distributed Systems.md

Characteristics of Distributed Systems 💻

Table of contents:

What is a Distributed System ?

Characteristics of Distributed Systems

1. Reliability

2. Fault Tolerance

3. Scalability

4. Maintainability

5. Extensibility

6. Availability

7. Consistency

8. Concurrency

9. Transparency

3. Measuring Performance

1. Latecny:

2. Response Time:

3. Throughput:

4. Resource Utilization:

5. Error Rates:

6. Scalability:

Links: