creotiv · August 25, 2021 17:34
diff --git a/gfs.py b/gfs.py
 """
 A brief summary of GFS is as follows. GFS consists of three components: a client,
 a master, and one or more chunkservers. The client is the only user-visible, 
 that is programmer-accessible, part of the system. It functions similarly to a 
 standard POSIX file library. The master is a single server that holds all metadata
 for the filesystem. By metadata we mean the information about each file, its 
 constituent components called chunks, and the location of these chunks on various
 chunkservers. The chunkservers are where the actual data is stored, and the 
 vast majority of network traffic takes place between the client and the chunkservers,
 to avoid the master as a bottleneck. We will give more detailed descriptions 
 below by going through the GFS client, master, and chunkserver as implemented 
 in python classes, and close with a test script and its output.

 The client class is the only user-visible portion of the GFS library. It mediates
 all requests between the client for filesystem access and the master and 
 chunkservers for data storage and retrieval. It is important to note that GFS 
 appears very familiar to programmers of normal filesystems, there is no distributed
 knowledge required. All of this is abstracted away behind the client implementation. 
 Of course there are some exceptions to this such as the localized chunk knowledge 
 used to allocate processing of files most efficiently, such as in the map reduce 
 algorithm, but we have avoided such complexity in this implementation. What is most 
 critical is to note how the normal read, write, append, exist, and delete calls are 
 available in their common forms, and how these are implemented by the client class; 
 we also simplify open, close and create by subsuming them under the previous methods. 
 The gist of each method is the same: ask the master for the metadata including chunk 
 IDs and chunk locations on the chunkservers, then update any necessary metadata with 
 the master, and finally transaction actual data flow only with the chunkservers.

 The master class simulates a GFS master server. This is where all the metadata 
 is stored, the core node of the entire system. Client requests initiate with the 
 master, then after metadata is retrieved, they talk directly to the individual 
 chunkservers. This avoids the master being a bottleneck as the metadata is typically 
 short and low latency. The metadata is implemented as a series of dictionaries, 
 although in a real system you'd have filesystem backing of the dicts. The 
 notification of chunkservers becoming available and unavailable via heartbeats, 
 chunkserver authentication and localization info for efficient storage are all 
 simplified here so that the master itself is allocating chunkservers. However we 
 still preserve the direct client read/write to the chunkservers, bypassing the 
 master, to show how the distributed system is working.

 The chunkserver class is the smallest in this project. This represents an actual 
 distinct box running in a massive datacenter, connected to a network reachable by 
 the master and client. In GFS, the chunkservers are relatively "dumb" in that they 
 know only about chunks, that is, the file data broken up into pieces. They don't 
 see the whole picture of the entire file, where it is across the whole filesystem, 
 the associated metadata, etc. We implement this class as a simple local storage, 
 which you can examine after running the test code by looking at the directory path 
 "/tmp/gfs/chunks". In a real system you'd want persistent storage of the chunk 
 info for backup.

 We use main() as a test for all the client methods, including exceptions. We 
 first create a master and client object, then write a file. This write is 
 performed by the client in the same way as the real GFS: first it gets the chunk 
 metadata from the master, then writes chunks directly to each chunkserver. Append 
 functions similarly. Delete is handled in the GFS fashion, where it renames the 
 file to a hidden namespace and leaves it for later garbage collection. A dump 
 displays metadata content. Note that this is a single-threaded test, as this 
 demonstration program does not support concurrency, although that could be added 
 with appropriate locks around the metadata.

 And putting it all together, here is the output of the test script run from the 
 python interpreter. Pay special attention to the master metadata dump at the end, 
 where you can see how the chunks are spread across chunkservers in jumbled order, 
 only to be reassembled by the client in the order specified by the master metadata.

 Now of course we are lacking some of the complexities of GFS necessary for a fully 
 functional system: metadata locking, chunk leases, replication, master failover, 
 localization of data, chunkserver heartbeats, deleted file garbage collection.  
 But what we have here demonstrates the gist of GFS and will help give you a better 
 understanding of the basics.  It can also be a starting point for your own 
 explorations into more detailed distributed filesystem code in python.


 Source: http://clouddbs.blogspot.com/2010/11/gfs-google-file-system-in-199-lines-of.html
 """

 import math
 import uuid
 import os
 import time
 import operator

 class GFSClient:
    def __init__(self, master):
        self.master = master

    def write(self, filename, data): # filename is full namespace path
        if self.exists(filename): # if already exists, overwrite
            self.delete(filename)
        num_chunks = self.num_chunks(len(data))
        chunkuuids = self.master.alloc(filename, num_chunks)
        self.write_chunks(chunkuuids, data)

    def write_chunks(self, chunkuuids, data):
        chunks = [ data[x:x+self.master.chunksize] \
            for x in range(0, len(data), self.master.chunksize) ]
        chunkservers = self.master.get_chunkservers()
        for i in range(0, len(chunkuuids)): # write to each chunkserver
            chunkuuid = chunkuuids[i]
            chunkloc = self.master.get_chunkloc(chunkuuid)
            chunkservers[chunkloc].write(chunkuuid, chunks[i])
  
    def num_chunks(self, size):
        return (size // self.master.chunksize) \
            + (1 if size % self.master.chunksize > 0 else 0)

    def write_append(self, filename, data):
        if not self.exists(filename):
            raise Exception("append error, file does not exist: " \
                 + filename)
        num_append_chunks = self.num_chunks(len(data))
        append_chunkuuids = self.master.alloc_append(filename, \
            num_append_chunks)
        self.write_chunks(append_chunkuuids, data) 

    def exists(self, filename):
        return self.master.exists(filename)
         
    def read(self, filename): # get metadata, then read chunks direct
        if not self.exists(filename):
            raise Exception("read error, file does not exist: " \
                + filename)
        chunks = []
        chunkuuids = self.master.get_chunkuuids(filename)
        chunkservers = self.master.get_chunkservers()
        for chunkuuid in chunkuuids:
            chunkloc = self.master.get_chunkloc(chunkuuid)
            chunk = chunkservers[chunkloc].read(chunkuuid)
            chunks.append(chunk)
        data = reduce(lambda x, y: x + y, chunks) # reassemble in order
        return data

    def delete(self, filename):
        self.master.delete(filename)

 class GFSMaster:
    def __init__(self):
        self.num_chunkservers = 5
        self.max_chunkservers = 10
        self.max_chunksperfile = 100
        self.chunksize = 10
        self.chunkrobin = 0
        self.filetable = {} # file to chunk mapping
        self.chunktable = {} # chunkuuid to chunkloc mapping
        self.chunkservers = {} # loc id to chunkserver mapping
        self.init_chunkservers()

    def init_chunkservers(self):
        for i in range(0, self.num_chunkservers):
            chunkserver = GFSChunkserver(i)
            self.chunkservers[i] = chunkserver

    def get_chunkservers(self):
        return self.chunkservers

    def alloc(self, filename, num_chunks): # return ordered chunkuuid list
        chunkuuids = self.alloc_chunks(num_chunks)
        self.filetable[filename] = chunkuuids
        return chunkuuids
   
    def alloc_chunks(self, num_chunks):
        chunkuuids = []
        for i in range(0, num_chunks):
            chunkuuid = uuid.uuid1()
            chunkloc = self.chunkrobin
            self.chunktable[chunkuuid] = chunkloc
            chunkuuids.append(chunkuuid)
            self.chunkrobin = (self.chunkrobin + 1) % self.num_chunkservers
        return chunkuuids

    def alloc_append(self, filename, num_append_chunks): # append chunks
        chunkuuids = self.filetable[filename]
        append_chunkuuids = self.alloc_chunks(num_append_chunks)
        chunkuuids.extend(append_chunkuuids)
        return append_chunkuuids

    def get_chunkloc(self, chunkuuid):
        return self.chunktable[chunkuuid]

    def get_chunkuuids(self, filename):
        return self.filetable[filename]

    def exists(self, filename):
        return True if filename in self.filetable else False
   
    def delete(self, filename): # rename for later garbage collection
        chunkuuids = self.filetable[filename]
        del self.filetable[filename]
        timestamp = repr(time.time())
        deleted_filename = "/hidden/deleted/" + timestamp + filename 
        self.filetable[deleted_filename] = chunkuuids
        print "deleted file: " + filename + " renamed to " + \
             deleted_filename + " ready for gc"

    def dump_metadata(self):
        print "Filetable:",
        for filename, chunkuuids in self.filetable.items():
            print filename, "with", len(chunkuuids),"chunks"
        print "Chunkservers: ", len(self.chunkservers)
        print "Chunkserver Data:"
        for chunkuuid, chunkloc in sorted(self.chunktable.iteritems(), key=operator.itemgetter(1)):
            chunk = self.chunkservers[chunkloc].read(chunkuuid)
            print chunkloc, chunkuuid, chunk 

 class GFSChunkserver:
    def __init__(self, chunkloc):
        self.chunkloc = chunkloc
        self.chunktable = {}
        self.local_filesystem_root = "/tmp/gfs/chunks/" + repr(chunkloc)
        if not os.access(self.local_filesystem_root, os.W_OK):
            os.makedirs(self.local_filesystem_root)

    def write(self, chunkuuid, chunk):
        local_filename = self.chunk_filename(chunkuuid)
        with open(local_filename, "w") as f:
            f.write(chunk)
        self.chunktable[chunkuuid] = local_filename

    def read(self, chunkuuid):
        data = None
        local_filename = self.chunk_filename(chunkuuid)
        with open(local_filename, "r") as f:
            data = f.read()
        return data

    def chunk_filename(self, chunkuuid):
        local_filename = self.local_filesystem_root + "/" \
            + str(chunkuuid) + '.gfs'
        return local_filename

 def main():
    # test script for filesystem

    # setup
    master = GFSMaster()
    client = GFSClient(master)

    # test write, exist, read
    print "\nWriting..."
    client.write("/usr/python/readme.txt", """
        This file tells you all about python that you ever wanted to know.
        Not every README is as informative as this one, but we aim to please.
        Never yet has there been so much information in so little space.
        """)
    print "File exists? ", client.exists("/usr/python/readme.txt")
    print client.read("/usr/python/readme.txt")

    # test append, read after append
    print "\nAppending..."
    client.write_append("/usr/python/readme.txt", \
        "I'm a little sentence that just snuck in at the end.\n")
    print client.read("/usr/python/readme.txt")

    # test delete
    print "\nDeleting..."
    client.delete("/usr/python/readme.txt")
    print "File exists? ", client.exists("/usr/python/readme.txt")
    
    # test exceptions
    print "\nTesting Exceptions..."
    try:
        client.read("/usr/python/readme.txt")
    except Exception as e:
        print "This exception should be thrown:", e
    try:
        client.write_append("/usr/python/readme.txt", "foo")
    except Exception as e:
        print "This exception should be thrown:", e

    # show structure of the filesystem
    print "\nMetadata Dump..." 
    print master.dump_metadata()

 if __name__ == "__main__":
    main()
	"""
	A brief summary of GFS is as follows. GFS consists of three components: a client,
	a master, and one or more chunkservers. The client is the only user-visible,
	that is programmer-accessible, part of the system. It functions similarly to a
	standard POSIX file library. The master is a single server that holds all metadata
	for the filesystem. By metadata we mean the information about each file, its
	constituent components called chunks, and the location of these chunks on various
	chunkservers. The chunkservers are where the actual data is stored, and the
	vast majority of network traffic takes place between the client and the chunkservers,
	to avoid the master as a bottleneck. We will give more detailed descriptions
	below by going through the GFS client, master, and chunkserver as implemented
	in python classes, and close with a test script and its output.

	The client class is the only user-visible portion of the GFS library. It mediates
	all requests between the client for filesystem access and the master and
	chunkservers for data storage and retrieval. It is important to note that GFS
	appears very familiar to programmers of normal filesystems, there is no distributed
	knowledge required. All of this is abstracted away behind the client implementation.
	Of course there are some exceptions to this such as the localized chunk knowledge
	used to allocate processing of files most efficiently, such as in the map reduce
	algorithm, but we have avoided such complexity in this implementation. What is most
	critical is to note how the normal read, write, append, exist, and delete calls are
	available in their common forms, and how these are implemented by the client class;
	we also simplify open, close and create by subsuming them under the previous methods.
	The gist of each method is the same: ask the master for the metadata including chunk
	IDs and chunk locations on the chunkservers, then update any necessary metadata with
	the master, and finally transaction actual data flow only with the chunkservers.

	The master class simulates a GFS master server. This is where all the metadata
	is stored, the core node of the entire system. Client requests initiate with the
	master, then after metadata is retrieved, they talk directly to the individual
	chunkservers. This avoids the master being a bottleneck as the metadata is typically
	short and low latency. The metadata is implemented as a series of dictionaries,
	although in a real system you'd have filesystem backing of the dicts. The
	notification of chunkservers becoming available and unavailable via heartbeats,
	chunkserver authentication and localization info for efficient storage are all
	simplified here so that the master itself is allocating chunkservers. However we
	still preserve the direct client read/write to the chunkservers, bypassing the
	master, to show how the distributed system is working.

	The chunkserver class is the smallest in this project. This represents an actual
	distinct box running in a massive datacenter, connected to a network reachable by
	the master and client. In GFS, the chunkservers are relatively "dumb" in that they
	know only about chunks, that is, the file data broken up into pieces. They don't
	see the whole picture of the entire file, where it is across the whole filesystem,
	the associated metadata, etc. We implement this class as a simple local storage,
	which you can examine after running the test code by looking at the directory path
	"/tmp/gfs/chunks". In a real system you'd want persistent storage of the chunk
	info for backup.

	We use main() as a test for all the client methods, including exceptions. We
	first create a master and client object, then write a file. This write is
	performed by the client in the same way as the real GFS: first it gets the chunk
	metadata from the master, then writes chunks directly to each chunkserver. Append
	functions similarly. Delete is handled in the GFS fashion, where it renames the
	file to a hidden namespace and leaves it for later garbage collection. A dump
	displays metadata content. Note that this is a single-threaded test, as this
	demonstration program does not support concurrency, although that could be added
	with appropriate locks around the metadata.

	And putting it all together, here is the output of the test script run from the
	python interpreter. Pay special attention to the master metadata dump at the end,
	where you can see how the chunks are spread across chunkservers in jumbled order,
	only to be reassembled by the client in the order specified by the master metadata.

	Now of course we are lacking some of the complexities of GFS necessary for a fully
	functional system: metadata locking, chunk leases, replication, master failover,
	localization of data, chunkserver heartbeats, deleted file garbage collection.
	But what we have here demonstrates the gist of GFS and will help give you a better
	understanding of the basics. It can also be a starting point for your own
	explorations into more detailed distributed filesystem code in python.


	Source: http://clouddbs.blogspot.com/2010/11/gfs-google-file-system-in-199-lines-of.html
	"""

	import math
	import uuid
	import os
	import time
	import operator

	class GFSClient:
	def __init__(self, master):
	self.master = master

	def write(self, filename, data): # filename is full namespace path
	if self.exists(filename): # if already exists, overwrite
	self.delete(filename)
	num_chunks = self.num_chunks(len(data))
	chunkuuids = self.master.alloc(filename, num_chunks)
	self.write_chunks(chunkuuids, data)

	def write_chunks(self, chunkuuids, data):
	chunks = [ data[x:x+self.master.chunksize] \
	for x in range(0, len(data), self.master.chunksize) ]
	chunkservers = self.master.get_chunkservers()
	for i in range(0, len(chunkuuids)): # write to each chunkserver
	chunkuuid = chunkuuids[i]
	chunkloc = self.master.get_chunkloc(chunkuuid)
	chunkservers[chunkloc].write(chunkuuid, chunks[i])

	def num_chunks(self, size):
	return (size // self.master.chunksize) \
	+ (1 if size % self.master.chunksize > 0 else 0)

	def write_append(self, filename, data):
	if not self.exists(filename):
	raise Exception("append error, file does not exist: " \
	+ filename)
	num_append_chunks = self.num_chunks(len(data))
	append_chunkuuids = self.master.alloc_append(filename, \
	num_append_chunks)
	self.write_chunks(append_chunkuuids, data)

	def exists(self, filename):
	return self.master.exists(filename)

	def read(self, filename): # get metadata, then read chunks direct
	if not self.exists(filename):
	raise Exception("read error, file does not exist: " \
	+ filename)
	chunks = []
	chunkuuids = self.master.get_chunkuuids(filename)
	chunkservers = self.master.get_chunkservers()
	for chunkuuid in chunkuuids:
	chunkloc = self.master.get_chunkloc(chunkuuid)
	chunk = chunkservers[chunkloc].read(chunkuuid)
	chunks.append(chunk)
	data = reduce(lambda x, y: x + y, chunks) # reassemble in order
	return data

	def delete(self, filename):
	self.master.delete(filename)

	class GFSMaster:
	def __init__(self):
	self.num_chunkservers = 5
	self.max_chunkservers = 10
	self.max_chunksperfile = 100
	self.chunksize = 10
	self.chunkrobin = 0
	self.filetable = {} # file to chunk mapping
	self.chunktable = {} # chunkuuid to chunkloc mapping
	self.chunkservers = {} # loc id to chunkserver mapping
	self.init_chunkservers()

	def init_chunkservers(self):
	for i in range(0, self.num_chunkservers):
	chunkserver = GFSChunkserver(i)
	self.chunkservers[i] = chunkserver

	def get_chunkservers(self):
	return self.chunkservers

	def alloc(self, filename, num_chunks): # return ordered chunkuuid list
	chunkuuids = self.alloc_chunks(num_chunks)
	self.filetable[filename] = chunkuuids
	return chunkuuids

	def alloc_chunks(self, num_chunks):
	chunkuuids = []
	for i in range(0, num_chunks):
	chunkuuid = uuid.uuid1()
	chunkloc = self.chunkrobin
	self.chunktable[chunkuuid] = chunkloc
	chunkuuids.append(chunkuuid)
	self.chunkrobin = (self.chunkrobin + 1) % self.num_chunkservers
	return chunkuuids

	def alloc_append(self, filename, num_append_chunks): # append chunks
	chunkuuids = self.filetable[filename]
	append_chunkuuids = self.alloc_chunks(num_append_chunks)
	chunkuuids.extend(append_chunkuuids)
	return append_chunkuuids

	def get_chunkloc(self, chunkuuid):
	return self.chunktable[chunkuuid]

	def get_chunkuuids(self, filename):
	return self.filetable[filename]

	def exists(self, filename):
	return True if filename in self.filetable else False

	def delete(self, filename): # rename for later garbage collection
	chunkuuids = self.filetable[filename]
	del self.filetable[filename]
	timestamp = repr(time.time())
	deleted_filename = "/hidden/deleted/" + timestamp + filename
	self.filetable[deleted_filename] = chunkuuids
	print "deleted file: " + filename + " renamed to " + \
	deleted_filename + " ready for gc"

	def dump_metadata(self):
	print "Filetable:",
	for filename, chunkuuids in self.filetable.items():
	print filename, "with", len(chunkuuids),"chunks"
	print "Chunkservers: ", len(self.chunkservers)
	print "Chunkserver Data:"
	for chunkuuid, chunkloc in sorted(self.chunktable.iteritems(), key=operator.itemgetter(1)):
	chunk = self.chunkservers[chunkloc].read(chunkuuid)
	print chunkloc, chunkuuid, chunk

	class GFSChunkserver:
	def __init__(self, chunkloc):
	self.chunkloc = chunkloc
	self.chunktable = {}
	self.local_filesystem_root = "/tmp/gfs/chunks/" + repr(chunkloc)
	if not os.access(self.local_filesystem_root, os.W_OK):
	os.makedirs(self.local_filesystem_root)

	def write(self, chunkuuid, chunk):
	local_filename = self.chunk_filename(chunkuuid)
	with open(local_filename, "w") as f:
	f.write(chunk)
	self.chunktable[chunkuuid] = local_filename

	def read(self, chunkuuid):
	data = None
	local_filename = self.chunk_filename(chunkuuid)
	with open(local_filename, "r") as f:
	data = f.read()
	return data

	def chunk_filename(self, chunkuuid):
	local_filename = self.local_filesystem_root + "/" \
	+ str(chunkuuid) + '.gfs'
	return local_filename

	def main():
	# test script for filesystem

	# setup
	master = GFSMaster()
	client = GFSClient(master)

	# test write, exist, read
	print "\nWriting..."
	client.write("/usr/python/readme.txt", """
	This file tells you all about python that you ever wanted to know.
	Not every README is as informative as this one, but we aim to please.
	Never yet has there been so much information in so little space.
	""")
	print "File exists? ", client.exists("/usr/python/readme.txt")
	print client.read("/usr/python/readme.txt")

	# test append, read after append
	print "\nAppending..."
	client.write_append("/usr/python/readme.txt", \
	"I'm a little sentence that just snuck in at the end.\n")
	print client.read("/usr/python/readme.txt")

	# test delete
	print "\nDeleting..."
	client.delete("/usr/python/readme.txt")
	print "File exists? ", client.exists("/usr/python/readme.txt")

	# test exceptions
	print "\nTesting Exceptions..."
	try:
	client.read("/usr/python/readme.txt")
	except Exception as e:
	print "This exception should be thrown:", e
	try:
	client.write_append("/usr/python/readme.txt", "foo")
	except Exception as e:
	print "This exception should be thrown:", e

	# show structure of the filesystem
	print "\nMetadata Dump..."
	print master.dump_metadata()

	if __name__ == "__main__":
	main()