ameenkhan07/#DistributedSystemReadingList.md

Last active December 14, 2019 08:51

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/ameenkhan07/1ddb437f9d04459284494b6b65866153.js"></script>
Save ameenkhan07/1ddb437f9d04459284494b6b65866153 to your computer and use it in GitHub Desktop.

Download ZIP

Distributed System Reading

Raw

#DistributedSystemReadingList.md

The papers included in CSE-586 reading.

Consensus and Paxos

Paxos Made Moderately Complex. Robbert Van Renesse and Deniz Altinbuken, ACM Computing Surveys, 2015.
Paxos made live - An engineering perspective. Tushar D. Chandra, Robert Griesemer, Joshua Redstone. ACM PODC, Pages: 398 - 407, 2007
ZooKeeper: Wait-free coordination for internet-scale systems P. Hunt, M. Konar, F. P. Junqueira, and B. Reed USENIX ATC 2010.
The Chubby Lock Service for Loosely-Coupled Distributed Systems. Mike Burrows, OSDI 2006.
In Search of an Understandable Consensus Algorithm. Diego Ongaro, John Ousterhout, USENIX ATC, 2014.
WPaxos: Wide Area Network Flexible Consensus. Ailidani Ailijiang, Aleksey Charapko, Murat Demirbas, Tevfik Kosar, IEEE TPDS, 2019.
Dissecting the Performance of Strongly-Consistent Replication Protocols. Ailidani Ailijiang, Aleksey Charapko, Murat Demirbas, Sigmod 2019.
Viewstamped replication revisited. Barbara Liskov and James Cowling. MIT-CSAIL-TR-2012-021, 2012.
Chain Replication for Supporting High Throughput and Availability. Robbert van Renesse and Fred B. Schneider, OSDI 2004.
FAWN: A Fast Array of Wimpy Nodes David G. Andersen and Jason Franklin and Michael Kaminsky and Amar Phanishayee and Lawrence Tan and Vijay Vasudevan. SOSP 2009.
CORFU: A shared log design for Flash clusters. Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wobber, Michael Wei, John D. Davis, NSDI'2012.

Failure detectors and fault-tolerance

Unreliable Failure Detectors for Reliable Distributed Systems, Tushar Deepak Chandra and Sam Toueg, Journal of the ACM, 1996.
Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems. Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm, OSDI 2014.
Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages, Haryadi S. Gunawi, Mingzhe Hao, and Riza O. Sumintom Agung Laksono, Anang D. Satria, Jeffry Adityatama, and Kurnia J. Eliazar, SOCC 2016.
Does The Cloud Need Stabilizing? Murat Demirbas, Aleksey Charapko, Ailidani Ailijiang, 2018.
TaxDC: A Taxonomy of nondeterministic concurrency bugs in datacenter distributed systems, Tanakorn Leesatapornwongsa, Jeffrey F. Lukman, Shan Lu, Haryadi S. Gunawi, ASPLOS 2016.

Time and snapshots

Time, Clocks, and the Ordering of Events in a Distributed System. Leslie Lamport, Commn. of the ACM, 1978.
Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases. Sandeep Kulkarni, Murat Demirbas, Deepak Madeppa, Bharadwaj Avva, and Marcelo Leone, 2014. https://cse.buffalo.edu/tech-reports/2014-04.pdf
Distributed Snapshots: Determining Global States of a Distributed System. K. Mani Chandy Leslie Lamport, ACM Transactions on Computer Systems, 1985.

Cloud computing

Tail at scale. Jeff Dean, Luiz Andre Barroso, Commn of the ACM, 2013.
Lessons from Giant-Scale Services. Eric A. Brewer, IEEE Internet Computing, 2001.
Above the Clouds: A Berkeley View of Cloud Computing. Michael Armbrust, Armando Fox, Rean Griffith, Anthony D. Joseph, Randy H. Katz, Andrew Konwinski, Gunho Lee, David A. Patterson, Ariel Rabkin, Ion Stoica and Matei Zaharia. EECS Department University of California, Berkeley Technical Report No. UCB/EECS-2009-28 February 10, 2009.
Serverless computing: One step forward, two steps back, UC Berkeley, CIDR 2019.
Cloud Programming Simplified: A Berkeley View on Serverless Computing, 2019. https://arxiv.org/abs/1902.03383
On designing and deploying Internet scale services, James Hamilton, LISA 2007.

NoSQL and distributed databases

Life beyond Distributed Transactions: an Apostate’s Opinion. Pat Helland, CIDR 2007.
Optimistic Replication. Yasushi Saito and Marc Shapiro, ACM Computing Surveys, 2005.
CAP Twelve Years Later: How the "Rules" Have Changed. Eric Brewer, IEEE Computer, 2012
PNUTS: Yahoo!'s Hosted Data Serving Platform. Brian F. Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, Adam Silberstein, Philip Bohannon, Hans-Arno Jacobsen, Nick Puz, Daniel Weaver and Ramana Yerneni, VLDB 2008.
Dynamo: Amazon’s Highly Available Key-Value Store. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall and Werner Vogels, ACM SIGOPS 2007.
Bigtable: A Distributed Storage System for Structured Data. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, ACM Transactions on Computer Systems, 2008.
Spanner: Google’s Globally-Distributed Database James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman,Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh,Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura,David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak,Christopher Taylor, Ruth Wang, Dale Woodford, ACM Trans on Computer Systems, 2013.

Big data processing

MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat, Commn of the ACM, 2008.
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. NSDI 2012. April 2012.
TUX2: Distributed Graph Computation for Machine Learning, Wencong Xiao, Jilong Xue, Youshan Miao, Zhen Li, Cheng Chen and Ming Wu, Wei Li, Lidong Zhou, NSDI 2017.
Proteus: agile ML elasticity through tiered reliability in dynamic resource markets. Aaron Harlap, Alexey Tumanov, Andrew Chung, Gregory R. Ganger, Phillip B. Gibbons. EuroSys, 2017.
TensorFlow: A system for large-scale machine learning, Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. OSDI 2016.

Decentralized ledgers

Practical Byzantine Fault Tolerance, Miguel Castro and Barbara Liskov, OSDI'99.
Bitcoin: A Peer-to-Peer Electronic Cash System, Satoshi Nakamoto, 2008.
Scalable and Probabilistic Leaderless BFT Consensus through Metastability, Team Rocket, Maofan Yin, Kevin Sekniqi, Robbert van Renesse, and Emin Gun Sirer, 2019.
Untangling Blockchain: A Data Processing View of Blockchain Systems. Tien Tuan Anh Dinh, Rui Liu, Meihui Zhang, Gang Chen, Beng Chin Ooi, IEEE Transactions on Knowledge and Data Engineering, 2017.
Bridging Paxos and Blockchain Consensus. A. Charapko, A. Ailijiang, M. Demirbas, IEEE Blockchain, 2018. http://www.cse.buffalo.edu/~demirbas/publications/bridging.pdf
Blockchains from a distributed computing perspective, Maurice Herlihy, Commn of the ACM, 2017.

Raw

1a.md

Paxos Made Moderately Complex

The paper proposes a multi-decree version of Paxos (multipaxos). It does so by extending the original Paxos framework, which involves the entities Proposers/Leaders and Acceptors similar to that in Paxos made live paper, and introduces a new entity Replicas and how this combined protocol is able to safely replicate the state machine, so as to make the overall system defended against crash (non-Byzantine) failures up to a specific number f. The extension is not stated verbatim in the paper but is apparent after understanding the underlying elements. The correctness of the protocol is explained not by proof but by presenting that the invariants of said entities hold over the period of their execution. In essence, the Paxos algorithm in the paper is performed between the Leaders and Acceptors and is referred to as the Synod protocol. This protocol uses concepts of ballots. These ballots are initiated by the Leader processes, where a leader could start as many ballots as required and are used to decide values in the slots (replication entity). Additionally, the number of processes (Leaders/Acceptors), here called configurations, could change after the selection of some slots, in which case it could be updated. The replication is done using the state machines of Replicas, where each Replica has a series of slots corresponding to the commands sent by the clients. The main responsibility of the entire protocol is to ensure that the ordering of commands in the slots in all replicas are the same. Since the system is presumed to be concurrent, i.e. clients concurrently request replicas with commands, so the chances of slots in each replica having different ordering of commands is very high, which defeats the purpose of having replicas in the first place. The Synod protocol is done by Acceptors and Leaders processes. The methodology to create and maintain the Replicated State Machines in different processes coordinates concurrently, mainly by using the messages in the following way: The Clients concurrently request the Replicas with their commands, and the ordering returned in response to these requests by the Replica is executed by the Clients. The Replicas having received commands from the client allocate them to the slots by proposing the ordering of said commands to a specific slot, to the Leaders, and updating their state after receiving decisions from the latter. The invariants of Replica which ensures that the ordering of slots of replicas do not diverge. The invariant (R1) requires the replicas to agree on the order of commands, no two commands could be allocated to the same slot, which could not be verified within this process execution itself. Leaders work in conjunction with Acceptors. How Synod/Paxos works between these two is that the Leaders first ask Acceptors to “prepare” by confirming if they could be their leaders. This is done by a thread within Leader, called scout (Phase 1). If promised by a majority of acceptors, the leaders then propose their values (commands), using commander subprocess, to the Acceptors. If the majority of Acceptors agree to the proposal, the decision of the next slot to be inserted to the replicas is returned. The main invariant in this family of processes of importance is that for each acceptor ‘a’ among a majority acceptors if tuple with ballot ‘b’ has been accepted by ‘a’ but a thread is executing by the commander for a higher ballot, 'b’, then the command of higher ballot and lower ballot will be the same. The acceptors receive 2 types of messages, depending on the category of Leader thread calling them, and returns appropriate messages if a quorum of acceptor agrees to that request. Most invariants in this process are relatively trivial, but the one involved in the synod is similar to that stated in the Leader Process. The invariants mentioned in the above processes are shown by inductive reasoning to be true. That is for invariant in replica to hold, the leader has to hold which is further dependent on the acceptor. The paper then explains how the protocol will hold against the dueling leader's problem (here ping pong problem), i.e. how it will accommodate failure detection using some assumptions in the system. To wrap it up, the paper itself proposes some ways to optimize some glaring issues which are apparent in the synod protocol.

OPINIONS:

The explanation of multipaxos is very comprehensive and includes all aspects of a consensus protocol that should be considered. It very systematically elaborates the asynchronous replication using the concurrent processes (replica/leader/acceptor) and focuses on proving the correctness by Invariant as it goes breaking the problem down. Although the protocol has some problems which can be observed at the first glance, like within the acceptor subprocess, it returns a set of acceptors which are in majority in response to the requests from leaders. As this execution goes along, the size of the state of Acceptor will keep on increasing, almost exponentially, ie the overhead of the process on the protocol is apparent. But the paper later addresses this problem along with others by suggesting ways to make the initial problem more “pragmatic”, by either state reduction. Similarly, the garbage collection approach is proposed to reduce the state size of the acceptor processes by them no longer storing slots once f+1 number of acceptors have received their decisions. The paper also has an accompanying Python code (identical to the pseudocode) which helped in getting a practical perspective of the asynchronous nature of the protocol using the programming construct.

There is one small claim made in the paper that is not backed by any initial reasoning to its inclusion or why it works but just does. This is the variable WINDOW in Replica. According to the working principle of the Replica process, multiple slots could be simultaneously considered for their respective commands by the Leader processes. The number of these slots are included in the invariant predicate logic as WINDOW. The logic explains that replica can only propose commands for slots for which it knows the configuration of and those slots have to be within the range of the last slot which got its decision (slot_out) plus the WINDOW. Although this might makes sense, how this value of the Window is to be determined, is not presented.

Raw

1b.md

Paxos Made Live

The paper proposes replacing an existing fault-tolerant Chubby system, which uses a 3rd party Database (3DB) with a custom database built on a layer of Paxos-driven (multipaxos) fault-tolerant log system. This fault-tolerant DB is potentially an interface for various clients, besides Chubby looking for a layer to integrate fault tolerance into their workflow. The paper further goes on to explore the challenges faced by the team whilst converting this theoretical solution to a more practically feasible system. Amongst the challenges, the algorithmic ones faced have been solved using different techniques, ones involving software engineering have been tried to addressed by adopting best practices in the field to prove it to be a robust system, as much as they can. The architecture of the proposed fault-tolerant replicas have separated modules, with a fault-tolerant replicated log at its base on top of which a local-level DB exists. The execution of the Paxos protocol is mainly conducted at the log level. Various states in the execution of such a system are talked about, one of which of high interest is when a replica has a turnover (dies and come back), it’s internal state will no longer match with replicas which have continued execution in its absence. In this case, the multi-Paxos paradigm along with locally persistent log will enable the said replica to catch up to the rest of the replicas by reading the operations from said log. This action of catching up of the replica is done by it passively, ie it doesn't participate in any voting being done by other replicas until it witnesses another replica being instantiated. Additionally, the multipaxos architecture is proposed to be optimized by chaining the replicas and selecting a master, so as to reduce the overall WRITES being made by all different Paxos operations (5 in total) to 1. Of the algorithmic challenges tackled by them, Master Leases and Snapshot are particularly novel. Whilst, initially the multi-Paxos required the expensive approach of serializing read operations in, the master lease approach grants a lease of specific time to a master, during which no other nodes will become the master. This way, removing the serializing and serving using the master node reduces the overall write operations and the accompanying overhead. The snapshots of the database, which are periodically taken by the replicas, are used to prevent the replicated log from growing too much. These snapshots, which are mutually consistent with the logs, are used to recover from failures. The snapshots from leading replicas are also used by lagging replicas to catch up.

In contrast to the existing academic level theoretical concepts of Paxos, to create an industry acceptable fault-tolerant database, the team explains the techniques used by them. First, they express the pseudocode as a higher-level specification followed by creating a compiler to translate the specification to the C++ code. This codebase has various runtime consistency checks. They also stress the importance of writing reproducible test-cases, to test for all edge cases for safety/progress properties of Paxos as well as the fault-tolerant logs as well as concurrency.

In essence, the paper itself is an equally detailed explanation of the system design as well as the software engineering aspects involved in its conception and maintenance. They also talk about some of the failures they encountered during their experience in its entirety of its execution, both explained and unexplained.

OPINIONS

It presents the modular state machine language and the compiler to convert the specification to a program as an advantage as it is easier to update the core algorithm. The downside of this, although not directly apparent, is in the debugging process of the algorithms, where both layers (specification and compiled code) would be required to be visited.

Also, as told in the paper itself, although the testing framework is designed to test most of the components in the absence of the multi-threading aspect of the system..They go on to say that since the system has to be multi-threaded for best execution, this testing won’t work in that case.

Raw

2a.md

Does the Cloud Need Stabilizing

This is a position paper, whose main contributions, in essence, is exploring the self-stabilization approach of fault tolerance, which currently might not be a sought after solution for the cloud-based systems because of being practically unfeasible, but recent rise for the need for complex techniques for recovery might change this in the future. This paper does so by first presenting the current practices involved in developing cloud applications using examples of real-world web applications, then detailing the fault recovery approaches used for different cases.

Currently and for the past few years, cloud computing uses infrastructure-first approach for designing the systems by using popular software architectures like microservices for implementing main business logic, which, along with distributed data-stores and data processing systems, are then encapsulated and deployed using containers (Docker) and orchestrated using cluster management tools (Kubernetes/Borg). The distributed systems in the above hierarchy are internally coordinated using consensus systems (Zookeeper). This layered and coordinated approach has proven to be fault-tolerant for many small-scale as well as many large-scale faults. The fault recovery approaches employed according to the type faults, generally, range from recovery by reset (Kubernetes Pods do this by horizontal scaling in case of failure and then electing leaders), recovery by checkpoint (crux of Deep Learning deployments like Tensorflow) to recovery by state state repair (caused by timing error in concurrent system, resolved mostly offline).

After giving the above “lay of the land”, the paper then proceeds to give two examples where a slightly more complex fault-tolerant approach as to what the situation demands. The first example is in a transaction based system where an action involves two independent systems, where a failure in one should effectively void the action in both of the systems. This is done in localized deployments using transactional methodology, where all independent systems should update their states if all of them succeed. Also, since not all the systems are internal, the likes of third party services part of transactions like these might end up updating their states even if any of the internal systems failed at some later point. Additionally, this transactional solution does not scale efficiently in globally distributed deployments. In this type of coordinated environment, where the coordinator is responsible for correcting the states of the system, the self-stabilizing protocol is suggested to be a possible solution, where given considering some compromise, the invariants in the system could be made eventually consistent.

The other example is of data-processing systems that for the main part use recovery by checkpoint recovery model. The example of Twitter live streaming data processing is used to demonstrate that by monitoring the performance of the system to identify low performance and initiate action to bring the system back to normal, i.e try to self-stabilize the system.

OPINIONS:

The paper is particularly interesting as starts by exploring the potential use-case of self-stabilizing environment, and after explaining the current practices used in the domain of cloud applications, and later entirely negates its feasibility, as implementing such a system would require making compromises to be practical and would be too complex in contrast to the current approach of simply exploiting the infrastructure alone is able to do it successfully and pretty robustly

The paper, however, does not go into detail as to what type of compromises could be allowed, which by itself is very vague and a slightly dangerous term to be used when talking about system involving financial transactions. For a paper exploring the feasibility of a theoretical approach, it should have been more elaborate on this point.

In the first example, it suggests a stabilizing approach called Distributed Sagas for the tackling distributed coordination problems which guarantees “all or none” transaction requests for ensuring consistency. The sagas pattern is an outcome of the belief the 2 phase commit for consistency is not a practical solution and instead of a linear flow, is structured as a directed acyclic graph of request, dictating that if one request fails, then other requests would eventually have to be canceled out. This implementation is a good example of a self-stabilizing approach for transactional requests and a good tangential read, given the topic of the paper.

The paper could also have given war-stories/real-world examples of the recovery categories whilst listing them. For example, approximately two years back, a package in NPM (popular nodejs package manager) got deprecated leading to cascading of failures in thousands of applications around the world which eventually got fixed by rollback approach. These would’ve backed the severity of recovery approaches.

Finally, I believe that the paper doesn’t do enough justice in actually detailing the self-stabilizing use-cases, and should have explored more examples besides coordinated transactions and distributed data systems.

Raw

2b.md

Time, Clocks, and ordering of Events in a Distributed System

It is a theoretical paper which discusses what it means for events to be temporally ordered in a distributed system, and mechanisms which can allow synchronization in such a system. Since in a distributed system, the concept of local time is of no use, and synchronizing using a global time has its own issues, the paper discusses synchronization using the ordering of events across different processes in an asynchronous system in the absence of this physical time. The paper describes the notion of local and global ordering of events. The computations occurring within the same process are locally ordered with respect to one other, and the process can maintain this. But in an asynchronous distributed system, local ordering is not sufficient for tasks like locking of shared resources, so a global ordering is needed.

The core contribution of the paper is that by using the concept of logical clocks it explains maintaining partial ordering of events when there is communication between processes. In essence, logical clocks are a counter(tick), which gets incremented when transition from one event to another, either within the same process, or when communicating between processes happens. It also proposes a new relation called "happens before" (a->b), which makes sure that an event 'a' happens before event 'b'. This relation is also transitive in nature, meaning if event “a”->”b” and ‘“b”->”c” then “a”->”c”. Two events from different processes are partially ordered if a hb relation exists between said 2 events. The clock condition requires that if an event a occurs before event b, then the clock value of a is less than clock value of b.

To achieve this two clock conditions are used. First, a process increments the counter between 2 consecutive events. Second, as in the case of communicating process, the receiver ensures that its timestamp is greater than that of sender and also greater than/equal its current value. Extending this logic, total ordering of a system can be achieved by ordering all the events of the system using these clocks that obey the clock condition), ties are broken by arbitrary ordering.

It also illustrates the use of total ordering with the help of a version of mutual exclusion problem using resource sharing as an example, where multiple processes are trying to acquire resources.

OPINIONS

The concept of logical clock in the distributed system was a pioneering oe at the time of publication and provides good insight on how to determine global ordering of events in distributed system.

The clocks are critical to the asynchronous distributed system, especially when designing the consistency model. Bypassing physical clock is a good idea that avoids the inaccuracy problem of clock. An additional interesting fact here is that, since there’s no master node/process which takes care of governing the synchronization between multiple requests, ie it is decentralized in nature, there won’t be a single cause of failure for synchronization.

The logical clock, being a single value representing the entire relative time, is not enough to capture the information of ordering within the asynchronous system which potentially could have thousands of nodes. The use of vector clock, which is an extension of logical clocks, is a list of relative counter of all processes at that instance of time for that process, is a much more comprehensive way of ordering events across distributed system. These vector clock have been successfully and famously been implemented in Amazon Dynamo. The paper also doesn’t take into consideration, stating its out of scope of the paper, the fact the the nodes(process) and links(channels) are prone to failure, which in the real world implementations, may frequently occur.

Raw

3a.md

Why Does the Cloud Stop Computing?

The paper is a study of outages (unplanned unavailability of the web services) that occurred in cloud applications, here 32 diverse and popular internet services, during the time period of ‘09 to ‘15. This study is motivated by the fact that since the services are public-facing, and their unavailability might lead to revenue loss to people using them, they are accountable and therefore need to be transparent regarding the factors that led said outages. The study covers the quantitative aspect of outages like outage count, service downtime and uptime and effect of service-maturity on their respective outages. The study follows up these quantitative aspects by summarizing why different types of outages happened. This qualitative aspect of the study is detailed and included in the final findings of the study, which categorizes the root causes (13 types) of said outages along with their impacts (6 types) and procedures (8 types) to fix these outages. Another contribution of the paper is the creation of COSDB, a database of all outages researched by the team during this study and making it available to other researchers. The database is generated by collecting news articles and post-mortem reports of outages of selected services separately and processing them manually to create a set of links and outage metadata tags. The paper also socratically explores the possibility of hidden/unknown SPOF. NoSPOF essentially is the design principle of systems to tolerate fault by including redundancies at every level. It explains that it is not all about redundancies themselves but also about reliable failure recovery chain which includes failure detection, good failover code, and running backup modules. Flaw in the above chain could lead to cascading of failures throughout the systems lead to multiple points of failures. The statistical findings presented in the paper are of some interest, giving insights like half of the services experienced three or more outages per annum, 69% of outages are reported with downtime and out of the 32 total, 25 services on average and 27 on their respective worst years, do not reach 99.9 % uptime, the ideal deployment. These insights indicate, which is known in the industry, that optimal uptime is a very hard feat to reach. The qualitative findings comprehensively detail the individual root causes of outages where, from the known reasons, UPGRADE, NETWORK, and BUGS are the most common root cause. Also, the severity of downtime depends on situations, like in case of the combination of NATDIS and POWER failures along with HUMAN mistakes and BUGS, downtimes are longest because all resources are affected and because of cascading failure. Authors found that upgrading hardware or software is largest root cause because large ecosystems are hard to reproduce in the research environment. Unexpected bugs, in new software or in upgrade scripts, or in both are responsible for the most number of outages. Another root cause is misconfiguration, which can happen in the ecosystem dependent on configs. Traffic overload can lead to outages on special events (new years) or due to misconfiguration. Cross-service outages are caused by interruptions from other services. Security Attacks such as malicious applications, DoS attacks, and Human errors can cause serious outages at any point in time.

OPINIONS

The data collection and tagging phase is quite comprehensive, making the effort of doing it manually for furthering the research and study in the domain, highly commendable. By first providing statistical insight regarding the outages (duration of downtime, uptime, matureness etc), the provides good groundwork for final categorization of outages from the generated metadata, which gives a good qualitative understanding of the study. The creation of CosDB is also a good data source for future studies. Although using news articles might have been a recommended source for scraping the data for this task, the content of the articles might not have been accurate about specific details about the failures and would have been presented in a slightly generalized to explain to wider nonesoteric audience and would’ve been technically vague to be used for this study in the first place, which makes the the tags in the DB slightly doubtful. In my opinion, the paper does a good work by giving examples of the outages occurred in the recent past when categorizing said outages. But in case of cross-service dependencies, it could’ve mentioned language-specific packages in different package managers, where an error in a small service that is used by almost all popular applications in that language domain, would have far-reaching consequences on par with unavailability of AWS nodes, in term of severity. A promising improvement of this study could be attributed to the explosion of Deep Learning since this paper was published. The manual task of tagging individual reports of outages could easily have been done using deep learning approaches of topic modeling and article summarization. This way, a lot more sources could've been used for each outage to get a better idea of specifics involved in each of them, instead of some guessing which might have been in play when the authors of the paper would’ve been creating the sources for CosDB. Additionally, a lot more internet services could be included in this study to get a wider quantitative perspective of the outages, as this automation using deep learning systems would reduce the human intervention, and might even be used to create a system of including new and currently occurring outages in these services.

Another improvement could be to expand data sources from academic post-mortems and news articles with Tweets. Most of the popular services like Github and Facebook have official accounts for informing the public of said outages since their high availability is quite critical to some people. Even in the absence of an official account, engineers from said companies publish links to their post mortems, company tech blogs, as all people involved understand the accountability of these services.

Raw

3b.md

Distributed Snapshots: Determining Global States of Distributed Systems

The paper describes, and proves, a global state-detection algorithm (famously referred to as Chandy-Lamport Algorithm) for a fully asynchonous distributed system. This process of getting a global snapshot is challenging as there is no global clock, so saving the state of all the processes in the system cannot be done reliably because of synchronization problem. Additionally, this state-detection has to execute concurrently to the tasks being processed by the processes, i.e not interfere/alter underlying computations. The motivation of such algorithm is that it is useful in determining stability (once true, always true) of the system, which in turn has proven to figure out different system related properties like if a system is in deadlock. The algorithm also has uses in different use-cases like checkpointing (restoring system to a previously stored state), garbage collection etc.

Global snapshot is defined as the combination of all local snapshots, which is the local state of each processes and local state of the channels they use to communicate with each other in the said system. Two processes in the system communicate with each other using a pair of unidirectional FIFO channels (incoming and outgoing). The algorithm assumes that the processes don't fail, channels have infinite buffers and messages are not lost/duplicated/mutilate during communication. The invariant to be maintained during its whole execution is that messages being sent/received from either process through the channels will be equivalent, i.e. whatever is sent, will be received, and is part of the snapshot. As there is no synchronization using a global time, the algorithm uses the concept of markers to simulate one. It is sent from the initiation process and is propagated throughout the system, triggering local snapshot execution.

The rules defined for the algorithm are from 2 perspectives, sender and receiver, namely Marker Sending Rule and Marker Receiving Rule. The first rule states that process sends the marker along channel c to a process p before it sends any more, and starts listening to its receiving channel. The second rule states that once a marker is received by process q from channel c, it records its state if not already done so and state of c, i.e. the message received along c since it recorded its local state. The snapshot algorithm is terminated once all of the processes have received markers and saved their local state states of all incoming channels. This process of saving states on receiving markers will help avoid any inconsistency in the final recorded global state. The paper the problem of non-determinism by showing that the global state determined is a possible state, reachable from the initial state, and leading to the same terminal state.

OPINIONS

comments, insightful criticisms, and suggestions for improvements. What were the strengths of the paper? What 1 were its drawbacks and shortcomings? How could the paper be improved? How do you relate the paper to things you have learned or worked on?

The paper does a good job in describing the correctness, which at the time, other alternatives in this field were unable to do so. The process of using markers, using it to initiate the algorithm and to record the termination of the same is described very neatly in the paper, with sufficient examples.

The applicability of the algorithm, has increased monumentally beyond just determining system stability, since it was first published. The global state detection is a huge contribution in the field of databases, specially distributed databases, where checkpointing allows DB’s to be restored to a previous state after system failure/crash. It can also be used in monitoring of nodes in a distributed cluster.

The paper has obvious flaws, which mainly lies in the assumptions it makes related to error free flow of messages between processes through channels connecting them. In real world application, outside of domain of theoretical reasoning, these assumptions become non-trivial and therefore make implementing such an algorithm in an asynchronous system, like most internet applications today, almost impossible. The algorithm needs to account the fact the failure/outages in the nodes(processes) and links(channels) in this type of system is highly possible and a real world system implementation as-is wont always result in correct global state, therefore making it unusable in its further application.

Raw

4.md

Lessons from Giant Scale Services

The paper is not a traditional theory-oriented paper, in the sense that it aims to bridge the gap between the domain of giant-scale services and availability and scalability, favoring mainly on the availability aspects of single-site services serving clients across the internet using real-world examples. The author, therefore, describes the paper as an experience paper as it focuses on the high-level design and the challenges of practical giant-scale data-driven services and suggests simple ways to tackle them and lays out relevant metrics and thumb rules. The paper describes the basic model of such large scale systems which is referred to throughout the paper, where the main components comprises of clients (web browsers), load managers, which maps an external service name to servers physical name and balances the incoming load to the available servers, servers made of CPU, memory, disks which do the actual computation, and databases which are partitioned and replicated across multiple servers. This model is based on a few assumptions. First, the service provider has limited control over the clients/IP networks. Second, the basic model assumes the queries drive the service (for protocols such as HTTP, FTP, and RPC). Lastly, the read-only queries outnumber updates. One of the contributions of the paper is in its discussion of load management. It compares the traditional approach of using round-robin to distribute IP addresses in a rotating fashion. This approach has a glaring flaw as it doesn’t hide crashed servers which is further worsened by browsers by DNS mishandling. The paper then describes the state of the art approaches at the time of paper was published. One of them is using layer-4 (transport layer) switches addresses the problem in the Round Robin approach of mishandling DNS, as they have the ability to understand and make decisions bases on the TCP, port numbers information. Furthermore, the layer-7 (application layer) switches are able to understand HTTP requests and the URL therein, and are deployed in pairs to monitor open TCP connections and therefore detect node failures and avoid single points of failure. Alternate load management approaches are also used, in combination with the layer-4 switches. The main contribution of the paper is that it presents several metrics for measuring availability. The traditional approach used is uptime, which is the fraction of time a site is handling traffic with associated metrics MTBF (mean time between failures) and MTTR (mean time to repair). Therefore UPTIME = [(MTBF-MTTR)/MTBF]. The paper explains the impact of MTTR and MTBF in the overall equation, where the uptime could be improved by reducing MTTR and by applying best practices for MTBF. Another metric, Yield is a fraction of queries that are completed. YIELD = [(queries completed)/(queries offered)] and is more useful in practice as it correctly maps to user experience. The main inference of this metric is that not all system seconds are of importance, only those which affect the user. Harvest metric measure query completeness, HARVEST = [(data available)/(complete data)], therefore indicates how much of database is reflected in the answer. Lastly, the paper introduces a new metric/principle DQ, which is the total amount of data that has to be moved per second on average DQ = (data per query) * (queries per second) (needs to be constant). This DQ principle can be used to measure the impact of node failures in the system and can also be used to tune the system to handle these failures gracefully. The linear scaling of the DQ value with the system itself helps in even preparing the system for future user-traffic. This DQ value can be used to make an informed decision in choosing between replication and partitioning in the system.

The paper also explains the replication and partitioning of databases in terms of availability. In order to deal with heavy incoming loads (in terms of availability), the data store may be simply replicated or partitioned with a load manager, which analyses the content of the request and directs the query to the corresponding server. When a node fails, a replicated system handles it more gracefully than a partitioned system, provided they can support the additional load. The paper also explains the evolution of such a system and is an important contribution to distributed system research. The paper quantifies the trade-offs between the update techniques (fast-reboot, rolling upgrades, and big flip), in terms of availability (DQ value). Software upgrades mainly are done through a fast reboot with some downtime or through a rolling upgrade that has to tolerate version differences or by doing a big flip which upgrades 50% of the nodes at a time.

OPINIONS

Strengths Overall, the paper presents a concise summary of many of the techniques that are widely used to address the practical issues related to availability and evolution and is very much relevant to today's large scale systems, however dated the paper might by. Overall, I believe the paper is a very good source analysis of highly available, scalable, and fault-tolerant giant-scale systems and for some high-level design and therefore is applicable to anyone building a large service, as some inferences presented in the paper like optimizing MTTR, are not at all obvious to the untrained eye.

Criticisms Some assumptions are made in the paper, such as the system operates based on queries and that the majority of the queries are read-only. Although reasonable at the time, the paper doesn’t delve into the applicability of these principles in systems where a large percentage of the requests are computation and updates and issues surrounding consistency would impact the creation of replicas. Although not an apparent flaw at the time, the paper also doesn’t dive into concerns surrounding security, database management, database replication/partition strategies, and a discussion of designing a large scale system without them would not be possible today as they are equally critical in maintaining the CAP properties, where A(availability) is one of them.

Improvements Additionally, although it states its irrelevance to the topic of availability, the paper does not dive into infrastructure-related issues such as latency, server load management, fault tolerance, data recovery/migration, it’s also practically impossible to talk about a high availability system without taking these factors into consideration. An improvement in the paper I believe could’ve simply been taken these factors and talk about availability with them in mind, giving an idea of the actual scope. Some aspects related to state-of-the-art practices such as autoscaling with increased/decreased load, service monitoring could have made the underlined the broad scope of the field of availability of services.

Raw

5.md

Dynamo: Amazon's Highly Available Key value Store

This paper presents the design and implementation details of the Dynamo system, Amazon’s highly available key-value database. In the ten years since its inception, it has had lasting impact and has been a catalyst for a broader movement towards non-relational databases in large scale internet applications. It’s highly available design trades off immediate consistency for high scalability under certain failure scenarios and makes extensive use of object versioning and application-assisted conflict resolution whilst satisfying their Service Level Agreement, which guarantees that the majority of requests can be done in a certain time bound.

Dynamo was designed with some fundamental goals in mind, besides high availability. First, the system should be incrementally scalable, meaning adding a machine to the cluster should accordingly improve the overall performance. Secondly, it was designed such that there would with no leader process, so there won't be a SPOF (single point of failure), which bottlenecks the system and makes it harder to scale the system. Lastly, data should be replicated, which means that instead of incurring write-time costs to ensure correctness throughout the system, inconsistencies should be resolved at read by the clients themselves. To achieve these goals, some assumptions were made in the design. It assumes that only simple read/write operation to data item identifiable by a unique key. Also, weak ACID properties for the purpose of higher availability besides relaxed consistency, no isolation guarantee is also acceptable.

The main contributions of this paper is its focus on how Dynamo combines several core distributed systems techniques. Dynamo implements data partitioning, to satisfy incremental scalability requirements. To do this, it uses consistent hashing to partition keys and organize its position and treat output ranges as a circular “ring”. It also uses virtual nodes, i.e. maps all the physical nodes to a set of multiple virtual nodes. This is done mainly to spread data evenly across the system and balance the load. The data is also replicated on N hosts where each key is assigned to a coordinator node. The coordinator is responsible for the data replicas. The list of nodes that is responsible for storing a particular key is called a preference list where the list contains more than N nodes to account for failure.

Data Versioning keeps a list of vector clock (<node, counter>) for merging data in read operations. When resolving conflicts, when objects are causally related, the records are merged s.t. the latest result is final whereas when there is no causality between objects the conflicts are kept and results are sent to the clients. Since the list of vector clocks for each data object might increase too much, a truncation scheme is implemented so when the number of pairs reaches a threshold, the oldest pair is removed from the clock. This itself might be ineffective, but according to the paper, it hadn’t failed them at the time of publication. Sloppy Quorums are used to increase the availability. It requires a number R/W of nodes must participate in a successful read/write operation on the first N healthy nodes from the preference list.

To handle temporary node/network failure Dynamo uses hinted handoff mechanism to ensure “always writable” property. The basic idea is to select another node to temporarily replace the failure node and receive replicas. Those replicas are labeled with a hint, and stored in a separate local datastore that is periodically scanned and when node failure is resolved the replicas are sent back. To handle the replica synchronization problems like failed replicas, permanent failure is used which uses Merkle Tree, which is a tree that keeps individual key hash value as leaves and keeps the hash value of children node in parent node to efficiently detect inconsistency and minimizes transferred data. A gossip-based protocol guarantees eventual consistent view of ring membership. In such a protocol every node keeps track of what it thinks the cluster looks like, i.e. which nodes are reachable, what key ranges they’re responsible for, and so on. Decentralized failure detection protocols use a similar protocol that enables each node in the system to learn about the arrival/departure of other nodes.

In software implementation of above system design, each storage node has three main software components, request coordination, membership and failure detection and local persistence engine. The request coordination component is built on top of a layer of event-driven messaging where the message processing pipeline is split into multiple states.

Last contribution of the paper is in their method for measuring performance. Rather than using mean or median performance, they look at the latency of the 99.9th percentile. The paper also contributes some practical results about how techniques can be applied in a production system.

OPINIONS

The paper is presented with only Amazon’s use cases in mind, but following the publication, Amazon released DynamoDB as a service, which is till this date for various use cases by companies like Lyft, Tinder and Redfin. I really found the concept of 99.9% principle very interesting. To provide good experiences for all customers, SLAs are expressed and measured at the 99.9th percentile of the distribution. Final, I think this is a reasonably well structured paper because it points out the design principles of Dynamo first so that it explains why it adopts the features like versioning, failure detection etc..

Strengths Dynamo achieves high scalability and availability with a slightly weak consistency model, that can generally be tackled by the concerned application. Ability of the system to incrementally scale is great for client applications to scale up and down according to the needs and demand is a huge advantage for PAAS/SAAS systems. Dynamo also preserves symmetry and avoids having a centralized registry for storing membership and node liveness information by employing gossip-based membership protocol and failure detection. The usage of Merkle trees used for replica synchronization appears to be a huge improvement because it minimizes the amount of data that needs to be transferred for synchronization.

Criticisms/Drawbacks According to the design principles, Dynamo addresses version conflicts during reading to ensure all “write” operations be executed. Consider a system where “read” operations happens more frequently than “writes”. In such a system, for better customer experience, it would be preferable to have efficient reading data. Dynamo merges version conflicts to solve inconsistency problems, which essentially shift the responsibility and cost of increased software development and would potentially lead to exposing users private data to client applications, which is a major data security issue. Additionally, Dynamo considers heterogeneity as one of the base principles but the experiments they carried were all on homogeneous machines, so experiments might not have been as comprehensive as presented.

Raw

6.md

MapReduce: Simplified Data Processing on Large Clusters

This paper introduces MapReduce, a novel programming model that is used to process and generate large data sets. The paper, authored by researchers from Google, solves the problem of straightforward distributed computations over large input data by distributing over hundreds/thousands of commodity machines. The paper also presents the associated implementation for processing and generating large data set on large clusters. The major contributions of this paper include both the concept and the implementation of the interface, MapReduce, for doing such computations. Here, the map and reduce are supposed to be user-defined functions. Mapreduce benefits from the simple computation complexity of the map and reduce functions. Map function processes a key-value pair to generate a set of intermediate key/value pairs while reduce function merges all of these generated intermediate pairs. The paper further provides examples such as counting words in a large collection of documents, distributed grep, reverse web-link graph, distributed sort inverted index to highlight the utility of the MapReduce programming model in different technical domains. The system proposed in the paper comprises a cluster of machines, a distributed filesystem, like (GFS, HDFS) and a scheduling system. The master node (part of the cluster) is the most important entity of the system as it orchestrates and schedules all the actions taking place in the system and keeps track of a state (idle, in-progress, completed), an identifier for each of the map and reduce workers and location of buffered regions in map workers, useful in failure handling. The execution of the algorithm begins with the clustered framework partitioning the input data set into M pieces, which to be processed by different machines. After splitting the input data, multiple copies of the MapReduce program are started on a cluster of worker machines, one of which acts as a master. The master assigns one of the M map tasks to each idle worker, which reads the contents of the input split, processes using user-defined functions and periodically storing(buffering) intermediate key-value pairs to local disk. The location of these results are passed back to the master machine, who forwards these locations to worker machines to run the reduce function. After receiving a reduce task, the worker reads the intermediate data using RPC calls to the map workers, sorts the data by key and then iteratively calls user’s reduce function for each sorted element. The output of the reduce function is appended to a final output file for this partition of the data. To deal with possible failures, the master pings each worker periodically, and if no response is received, it is deemed to be a failure. Any map tasks associated with this worker are reset to the initial state by the master and are reassigned to other workers. In the case of master failure, the MapReduce operation is simply aborted, though the authors claim that since there is only one master, the probability of a master failure is low. But if it does fail, periodic checkpoints are made of the master data structures, so in case the whole system could be resetted to that point. For task granularity, the paper proposed subdividing the map phase into M pieces and the reduce phase into R pieces. Ideally, M and R should be much larger than the number of worker machines. The master must make O(M+R) scheduling decisions and keeps O(M*R) state in memory. Having each worker perform many different tasks improves dynamic load balancing, and also speeds up recovery when a worker fails. The paper mention some data model refinements, like data partitioning function, ordering guarantees, combiner function, local execution, and etc. Most of these are targeted at reducing the amount of data sent across the network as network bandwidth is a scarce resource. From the experiments, we find that the performance of MapReduce is impressive.

OPINIONS

Although MapReduce was initially implemented and deployed on a big cluster for only google purpose in mind, not people can leverage any computing sources in a similar fashion and makes it easier to solve their respective problems as it makes it easy to realize parallel and distributed computing.

Strengths It lets people realize that restricting the programming model makes it easy to realize parallel and distributed computing. As the paradigm hides the details of parallelization, fault-tolerance, locality optimization, and load balancing, the model is easy for even inexperienced users in parallel and distributed systems. Also, as the examples listed in the paper elaborate, a large variety of problems are easily expressible as MapReduce computations. From my point of view of the refinements listed in the paper, the partitioning function, which says to partition with a hash (Mod R), and ordering guarantees, which guarantees sorted keys which makes it easier to produce a sorted output file per partition, are particularly effective in terms of real-world engineering,

Criticisms/Drawbacks At a high level, I don’t think there are many disadvantages to MapReduce as a high-level concept. As the worker nodes only perform one type of tasks during MapReduce, it is possible with a high percentage that even after optimizing and best scheduling, potential computation power of this system wouldn’t be harnessed properly, as there would be times where either map or reduce workers would be waiting on their respective inputs. MapReduce doesn’t run Map and Reduce jobs as threads. They are processes that are heavyweight compared to threads. Although the experiment highlights the power of MapReduce by itself, it does not contain experiments that compare MapReduce against other distributed programming model. Those experiments could further show how MapReduce is. There is a whole lot of intermediate results that are written to a DFS and then read back by the next job from DFS, which combined is a lot of I/O overhead.

Improvements Since the entire MapReduce operation will be aborted if master fails and might cause a huge loss on the computation progress it would be better to implement a replication scheme of some sort which will be enabled act as master if the original one fails. It might lead to some performance overhead but will lead to a more robust system. Spark paradigm addresses the MapReduce downsides like parallelism, CPU utilization, and extensive reads/writes as it uses threads to run jobs, is not only limited to map and reduce functions and is in-memory between transitions. These advantages have made Spark to successfully evolve to be as a strong contender to MapReduce.

Raw

7.md

Bitcoin:A Peer-to-Peer Electronic Cash System.md

As the abstract suggests, this paper presents a peer-to-peer version of electronic cash for online payments, which could be sent directly between two individuals without going through a third party (like financial institution). This transfer of credits is done using public key cryptography, where one person can sign a transaction, and person having that digital signature can have that money. So all transactions are based on cryptographic proofs rather than trust, thereby solving trust issues related to operations of financial institutions. It works as a peer to peer network, since there is no central point of authority. All the transactions are kept in a public ledger called the blockchain. The problem of double-spending is where an individual can spend the same money twice, and solution to which is proposed in the paper, which involves being aware of all transactions in said public ledger at a given point in time, all of which is done using distributed timestamp server. The entire solution as well as explanation of the interlocked components of this entire system are the major contributions of this paper.

Electronic coin is a chain of digital signatures, where one person transfers the coin to another by digitally signing a hash of previous transactions, and public key of that person. On the receiving end, payee can verify the signatures to gain ownership, but to do this it needs to be aware of all the transactions, which in absence of an authoritative third party must be done using publicly announced, single history of all transactions. If everyone in the system have a public record of transactions, they need to come to a consensus about the order of these transactions. Bitcoin does this by using a distributed timestamp server that chains blocks of transactions together, i.e. a blockchain. A timestamp server works by taking a hash of a block of items to be timestamped and widely publishes the hash. Each timestamp includes the previous timestamp in its hash, forming a chain with each additional timestamp reinforcing the ones before it. Each block is mined by miners that solve a hard math problem that takes some amount of CPU power.

To implement said distributed timestamp server, paper uses proof-of-work system, which involves scanning for value (mining), which when hashed using SHA-256 cryptographic algorithms, hash begins with 0 bits. For the timestamp network, the proof-of-work is implemented by incrementing a nonce(counter of sorts) in the block until a value is found, which once completed to satisfy the proof-of-work, the block cannot be changed. An update in one block would include redoing for all the blocks after it. Proof-of-work also solves the problem of determining representation in majority decision making. Proof-of-work is based on one-CPU-one-vote. The majority decision is represented by the longest chain.

Execution of network involves new transactions to be broadcasted to all nodes in that network, all nodes to collect new transactions to a block, each node performs their task to find difficult proof-of-work for it’s block, broadcasted to all when found by one node, accept the block if all transactions are valid and not spent, nodes accept and create next block using hash of the accepted block in previous block. Consensus in blockchain is done by always considering the longest chain to be the correct one and is intended to be extended by the nodes. In case of different version being broadcasted, nodes work on the older branch, but save the other in case they become longer. In case of a tie, next proof-of-work is found and the nodes working on smaller branch will not switch to longer branch. The paper lays out the payoffs (incentives) that nodes would receive from participation in a system that does not have any central party and how the structure of these payoffs incentivizes honesty as an evolutionary stable strategy. There are three incentive considerations, first what is called the ‘block reward’ that nodes receive for creating a block of valid transactions, and adding it to the longest chain. Second, a transaction fee as an incentivizing mechanism for nodes. Lastly, the paper suggests how the incentive structure works in practice to make the incentives for fair play greater than foul play, where fair play are users pursuing extending the chain with valid transactions.

The use of a pruned Merkle tree enables making the Bitcoin blockchain more compact over time, highlighting longevity and broad adoption of the system. The example provided shows one of four transactions need to be directly carried forward after the Merkle tree is pruned and only the hash of its root then need be included in the block; far more efficient than having to lug around all four transactions in perpetuity.

Using the compactifying quality of Merkle trees, the paper also suggests the relative ease with which any user can confirm that transactions have been accepted into the Bitcoin blockchain without having access to a full network node. There is, however, a large cumulative social benefit that running a full network node provides the Bitcoin network, which is, running a full node increases the reliability of the Bitcoin network by ensuring the integrity of its chain and also preserves its decentralized structure. The paper mentions that although the incentives for running a full node are likely to be greater for commercial entities, ideally it should be a default behavior.

Privacy is an important part of the paper and foundational to Bitcoin. Open decentralization is not incompatible with privacy, where the privacy ensured by a third party is equally able to be modeled on a decentralized network by using public keys. Privacy here is maintained by breaking the flow of information in another place and keeping public keys anonymous. The public can see that someone is sending an amount to someone else, but without information linking the transaction to anyone. An additional firewall, such as a new key pair should be used for each transaction to keep them from being linked to a common server. The paper also recognized that it is entirely possible to reverse engineer transactions and attribute public keys to their owners. It compares the Bitcoin privacy model with the privacy model of traditional banks, emphasizes elegance in achieving a similar degree of desirable privacy.

The paper concludes by presenting an interesting evaluation of the evolution of two competing paths of the network, when faced with the prospect of an attacker building an alternate adversarial chain. Using a binomial random walk process to model this situation simplifies the problem, because one need only consider the probability of the attacker’s path (chain) gaining a relative advantage of one unit (block) during each time period, and then playing this race out over time. Calculation also shows that the probability of an attacker catching up from a given deficit decreases exponentially.

OPINIONS

This paper is a pioneer in the field altogether, and gave rise to various blockchain “coins” in the years that followed its publication. The aspects of Bitcoin architecture are really interesting, and worth exploring in terms of decentralized systems domain. Some ideas contributed by the paper are unique at first glance. First, an electronic cash system that provides a public history of transactions and is impervious to the double-spending attack, which is the Achilles heal of most online electronic cash platforms. The mechanism of proof-of-work facilitating incentives for honest nodes to control a majority of computing power is also an important point to note.The features like the network requiring little coordination, encouraging overall best effort, freedom of nodes to exit and enter at will, and vote by proxy of computational power further provide a strong base environment of the Bitcoin network that the paper outlines.

Criticisms/Drawbacks Practical aspects of implementing a blockchain is felt lacking, although paper does a great job explaining the underlying concepts and motivations of implementing such a platform. Finer points like capacity of blocks i.e. how many transactions in the block are ideal, how would the scale in regards to the overall chain and overhead in performance in case of data increasing drastically are felt missing. Also, as the paper itself mentions, it doesn't provide security against the reversible nature of the transactions, however difficult such a task might actually be. The paper also doesn’t provide a mechanism to recover from 51% attack, meaning extensibly, doesn’t provide a way to secure the inherent decentralized nature of the system, which is the crux of Bitcoin.

Improvements Since this paper highlights the high-level understanding of a system using blockchain technology, not much could be said about improvements on it, besides a case study involving a real world deployment would’ve brought a number of implementation factors also to forth, giving a more holistic understanding, both theoretical and practical.