Skip to content

Instantly share code, notes, and snippets.

@creotiv
Last active August 25, 2021 17:34
Show Gist options
  • Save creotiv/8559469 to your computer and use it in GitHub Desktop.
Save creotiv/8559469 to your computer and use it in GitHub Desktop.
GFS in python. Simple example
"""
A brief summary of GFS is as follows. GFS consists of three components: a client,
a master, and one or more chunkservers. The client is the only user-visible,
that is programmer-accessible, part of the system. It functions similarly to a
standard POSIX file library. The master is a single server that holds all metadata
for the filesystem. By metadata we mean the information about each file, its
constituent components called chunks, and the location of these chunks on various
chunkservers. The chunkservers are where the actual data is stored, and the
vast majority of network traffic takes place between the client and the chunkservers,
to avoid the master as a bottleneck. We will give more detailed descriptions
below by going through the GFS client, master, and chunkserver as implemented
in python classes, and close with a test script and its output.
The client class is the only user-visible portion of the GFS library. It mediates
all requests between the client for filesystem access and the master and
chunkservers for data storage and retrieval. It is important to note that GFS
appears very familiar to programmers of normal filesystems, there is no distributed
knowledge required. All of this is abstracted away behind the client implementation.
Of course there are some exceptions to this such as the localized chunk knowledge
used to allocate processing of files most efficiently, such as in the map reduce
algorithm, but we have avoided such complexity in this implementation. What is most
critical is to note how the normal read, write, append, exist, and delete calls are
available in their common forms, and how these are implemented by the client class;
we also simplify open, close and create by subsuming them under the previous methods.
The gist of each method is the same: ask the master for the metadata including chunk
IDs and chunk locations on the chunkservers, then update any necessary metadata with
the master, and finally transaction actual data flow only with the chunkservers.
The master class simulates a GFS master server. This is where all the metadata
is stored, the core node of the entire system. Client requests initiate with the
master, then after metadata is retrieved, they talk directly to the individual
chunkservers. This avoids the master being a bottleneck as the metadata is typically
short and low latency. The metadata is implemented as a series of dictionaries,
although in a real system you'd have filesystem backing of the dicts. The
notification of chunkservers becoming available and unavailable via heartbeats,
chunkserver authentication and localization info for efficient storage are all
simplified here so that the master itself is allocating chunkservers. However we
still preserve the direct client read/write to the chunkservers, bypassing the
master, to show how the distributed system is working.
The chunkserver class is the smallest in this project. This represents an actual
distinct box running in a massive datacenter, connected to a network reachable by
the master and client. In GFS, the chunkservers are relatively "dumb" in that they
know only about chunks, that is, the file data broken up into pieces. They don't
see the whole picture of the entire file, where it is across the whole filesystem,
the associated metadata, etc. We implement this class as a simple local storage,
which you can examine after running the test code by looking at the directory path
"/tmp/gfs/chunks". In a real system you'd want persistent storage of the chunk
info for backup.
We use main() as a test for all the client methods, including exceptions. We
first create a master and client object, then write a file. This write is
performed by the client in the same way as the real GFS: first it gets the chunk
metadata from the master, then writes chunks directly to each chunkserver. Append
functions similarly. Delete is handled in the GFS fashion, where it renames the
file to a hidden namespace and leaves it for later garbage collection. A dump
displays metadata content. Note that this is a single-threaded test, as this
demonstration program does not support concurrency, although that could be added
with appropriate locks around the metadata.
And putting it all together, here is the output of the test script run from the
python interpreter. Pay special attention to the master metadata dump at the end,
where you can see how the chunks are spread across chunkservers in jumbled order,
only to be reassembled by the client in the order specified by the master metadata.
Now of course we are lacking some of the complexities of GFS necessary for a fully
functional system: metadata locking, chunk leases, replication, master failover,
localization of data, chunkserver heartbeats, deleted file garbage collection.
But what we have here demonstrates the gist of GFS and will help give you a better
understanding of the basics. It can also be a starting point for your own
explorations into more detailed distributed filesystem code in python.
Source: http://clouddbs.blogspot.com/2010/11/gfs-google-file-system-in-199-lines-of.html
"""
import math
import uuid
import os
import time
import operator
class GFSClient:
def __init__(self, master):
self.master = master
def write(self, filename, data): # filename is full namespace path
if self.exists(filename): # if already exists, overwrite
self.delete(filename)
num_chunks = self.num_chunks(len(data))
chunkuuids = self.master.alloc(filename, num_chunks)
self.write_chunks(chunkuuids, data)
def write_chunks(self, chunkuuids, data):
chunks = [ data[x:x+self.master.chunksize] \
for x in range(0, len(data), self.master.chunksize) ]
chunkservers = self.master.get_chunkservers()
for i in range(0, len(chunkuuids)): # write to each chunkserver
chunkuuid = chunkuuids[i]
chunkloc = self.master.get_chunkloc(chunkuuid)
chunkservers[chunkloc].write(chunkuuid, chunks[i])
def num_chunks(self, size):
return (size // self.master.chunksize) \
+ (1 if size % self.master.chunksize > 0 else 0)
def write_append(self, filename, data):
if not self.exists(filename):
raise Exception("append error, file does not exist: " \
+ filename)
num_append_chunks = self.num_chunks(len(data))
append_chunkuuids = self.master.alloc_append(filename, \
num_append_chunks)
self.write_chunks(append_chunkuuids, data)
def exists(self, filename):
return self.master.exists(filename)
def read(self, filename): # get metadata, then read chunks direct
if not self.exists(filename):
raise Exception("read error, file does not exist: " \
+ filename)
chunks = []
chunkuuids = self.master.get_chunkuuids(filename)
chunkservers = self.master.get_chunkservers()
for chunkuuid in chunkuuids:
chunkloc = self.master.get_chunkloc(chunkuuid)
chunk = chunkservers[chunkloc].read(chunkuuid)
chunks.append(chunk)
data = reduce(lambda x, y: x + y, chunks) # reassemble in order
return data
def delete(self, filename):
self.master.delete(filename)
class GFSMaster:
def __init__(self):
self.num_chunkservers = 5
self.max_chunkservers = 10
self.max_chunksperfile = 100
self.chunksize = 10
self.chunkrobin = 0
self.filetable = {} # file to chunk mapping
self.chunktable = {} # chunkuuid to chunkloc mapping
self.chunkservers = {} # loc id to chunkserver mapping
self.init_chunkservers()
def init_chunkservers(self):
for i in range(0, self.num_chunkservers):
chunkserver = GFSChunkserver(i)
self.chunkservers[i] = chunkserver
def get_chunkservers(self):
return self.chunkservers
def alloc(self, filename, num_chunks): # return ordered chunkuuid list
chunkuuids = self.alloc_chunks(num_chunks)
self.filetable[filename] = chunkuuids
return chunkuuids
def alloc_chunks(self, num_chunks):
chunkuuids = []
for i in range(0, num_chunks):
chunkuuid = uuid.uuid1()
chunkloc = self.chunkrobin
self.chunktable[chunkuuid] = chunkloc
chunkuuids.append(chunkuuid)
self.chunkrobin = (self.chunkrobin + 1) % self.num_chunkservers
return chunkuuids
def alloc_append(self, filename, num_append_chunks): # append chunks
chunkuuids = self.filetable[filename]
append_chunkuuids = self.alloc_chunks(num_append_chunks)
chunkuuids.extend(append_chunkuuids)
return append_chunkuuids
def get_chunkloc(self, chunkuuid):
return self.chunktable[chunkuuid]
def get_chunkuuids(self, filename):
return self.filetable[filename]
def exists(self, filename):
return True if filename in self.filetable else False
def delete(self, filename): # rename for later garbage collection
chunkuuids = self.filetable[filename]
del self.filetable[filename]
timestamp = repr(time.time())
deleted_filename = "/hidden/deleted/" + timestamp + filename
self.filetable[deleted_filename] = chunkuuids
print "deleted file: " + filename + " renamed to " + \
deleted_filename + " ready for gc"
def dump_metadata(self):
print "Filetable:",
for filename, chunkuuids in self.filetable.items():
print filename, "with", len(chunkuuids),"chunks"
print "Chunkservers: ", len(self.chunkservers)
print "Chunkserver Data:"
for chunkuuid, chunkloc in sorted(self.chunktable.iteritems(), key=operator.itemgetter(1)):
chunk = self.chunkservers[chunkloc].read(chunkuuid)
print chunkloc, chunkuuid, chunk
class GFSChunkserver:
def __init__(self, chunkloc):
self.chunkloc = chunkloc
self.chunktable = {}
self.local_filesystem_root = "/tmp/gfs/chunks/" + repr(chunkloc)
if not os.access(self.local_filesystem_root, os.W_OK):
os.makedirs(self.local_filesystem_root)
def write(self, chunkuuid, chunk):
local_filename = self.chunk_filename(chunkuuid)
with open(local_filename, "w") as f:
f.write(chunk)
self.chunktable[chunkuuid] = local_filename
def read(self, chunkuuid):
data = None
local_filename = self.chunk_filename(chunkuuid)
with open(local_filename, "r") as f:
data = f.read()
return data
def chunk_filename(self, chunkuuid):
local_filename = self.local_filesystem_root + "/" \
+ str(chunkuuid) + '.gfs'
return local_filename
def main():
# test script for filesystem
# setup
master = GFSMaster()
client = GFSClient(master)
# test write, exist, read
print "\nWriting..."
client.write("/usr/python/readme.txt", """
This file tells you all about python that you ever wanted to know.
Not every README is as informative as this one, but we aim to please.
Never yet has there been so much information in so little space.
""")
print "File exists? ", client.exists("/usr/python/readme.txt")
print client.read("/usr/python/readme.txt")
# test append, read after append
print "\nAppending..."
client.write_append("/usr/python/readme.txt", \
"I'm a little sentence that just snuck in at the end.\n")
print client.read("/usr/python/readme.txt")
# test delete
print "\nDeleting..."
client.delete("/usr/python/readme.txt")
print "File exists? ", client.exists("/usr/python/readme.txt")
# test exceptions
print "\nTesting Exceptions..."
try:
client.read("/usr/python/readme.txt")
except Exception as e:
print "This exception should be thrown:", e
try:
client.write_append("/usr/python/readme.txt", "foo")
except Exception as e:
print "This exception should be thrown:", e
# show structure of the filesystem
print "\nMetadata Dump..."
print master.dump_metadata()
if __name__ == "__main__":
main()
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment