First of all, we need to make some assumptions about the whole scenario. Considering that Quake3 is a popular game that serves hundreds or maybe thousands of users, we can assume that a log file can reach Giga or Tera size. Create a CLI tool based on a simple personal computer to solve this problem may lead to an unstable solution. In order to solve this issue I decided that our log file will be present on a Hadoop Distributed File System (HDFS) instance, i.e. a cluster of computers which main function is store big amounts of data in a scalable and stable way. However, we need to manage our log data in order to solve the essay requirements.
Despite the fact Apache Hadoop MapReduce have been historically chosen as a Big Data processor, a few other apache distributions seems to outperform the latter one. A famous choice would be Apache Spark, which process data 10x faster on disk and 100x faster on memory compared to MapReduce. Bear in mind that we’ll handle HDFS to store our Big Data. I could use a whole Apache Hadoop distribution to store and process the data, but instead I decided to use Spark to process our log file, because of its speed performance, popularity and easy-to-learn (and code) client-API libraries written in Python, Scala or Java.
Despite I cannot deliver any code, bear in mind that the whole essay will be solved in a Python perspective. That is, every class/module will be related to this language. So first, we need to understand our log data. Assuming that we’re dealing with a plain text file, a game info could be like a line containing start timestamp event, followed by a series of lines where each of them contains the killer, killed player and also the gun type (or cause of death in case a user dies by himself), finished by a line containing an end timestamp event. Regardless if we’re dealing with big data or not, retrieve data from disk every time we run a query in the same dataset is not a good solution. So, in order to speed up a query time in a big file I decided to use parquet file format, since reliable benchmarks assures that process a parquet file is many times faster than process sql tables, json, csv ou txt files.
Ok, but now we need to dive a little bit into how to create a Command Line Tool in Python. A very common choice between programmers is the docopt library which is based on conventions that is very easy to write. These conventions are just a plain string (like a man page) describing the tool usage. What docopt does is parse this big string to internally create all the code necessarily to parse all the user input on the shell. So, our CLI will have a main command called quake3 and three subcommands (including their own options) which are: remove, summary and rank.
You may be wondering why we have a command like remove. It is very common to a admin server perform multiples queries on the same dataset. To avoid high throughput on our disk I/O system, I decided that a admin will have the choice to store repeatedly used data as a parquet file in order to perform faster queries. So, our remove command just drops a saved parquet file when its filename is passed as an option. In our CLI package we’ll have a list containing all the user saved data, so repeated filenames must not be provided in order to avoid any collisions.
In order to explain with more details how I thought a well structured python package could be, I printed a fake folder hierarchy of our CLI tool. Look at the following image.
As you can see, at the top level there are two setup files: one that assures that the Python build tool should run our program on all platforms when building the binary, and one that installs our package on the system. Respectively, we’re talking about setup.cfg and setup.py. Bear in mind that setup.py is responsible to install all scripts necessary to run our commands on the shell. Every time a user types $ quake3 command, setup.py redirects the program to the main function in quake3/cli.py file, which is responsible to parse all the user options and execute our subcommands.
Commands folder contains all the codes necessary to perform our essay requirements. So, all the files that matches our subcommands names are responsible to perform their actions. In addition we have a few extra files which will be explained later. In other words, everyone of these files is a Python Spark API Client that will perform our desired actions throughout our use cases. We should not forget to mention about the init.py file. It deals quake3 and commands folders as packages which will be like importable libraries used by each command handler. In parquets folder is where all the saved files will reside.
However, all the parsing logic still needs to be explained. At the beginning of cli.py file we need to write our CLI usage rules as I mentioned earlier, which is a big string describing all subcommands, including their options, help flags and descriptions. When cli.py’s main function is called by our installed script, a docopt instance is created from the usage guide that we wrote, which contains all the information to parse a user typed string on a shell. For instance, when a user types $ quake3 remove "filename", docopt generates a dictionary containing all the commands (and options) typed. From this, what cli.py does is infer what command file, presented in the commands folder, matches with the name remove. After remove.py is detected, its class is instantiated and its option is passed as an argument. Finally, cli.py calls its run method to interact with our Spark Cluster to perform the desired processing.
Before we can read more about my use cases strategies, let’s discuss a little bit about how we save a log file. As I mentioned, parquet files are a fast way to perform recurrent queries on big datasets. A log file, which I supposed to be a plain text document containing all the games informations, must be structured in order to be transformed into a parquet file. Spark Python’s API provides a series of abstractions which main reason is to hide detailed cluster implementations. These abstractions lays out on class objects that we can instantiate and perform some actions.
In this scenario, the most important one is SparkContext class which creates a main context to our operations. From this main context, we can read popular files extensions and divide its contents into chunks that will be thrown on a cluster to be processed. The top level abstraction of a file/data structure is called DataFrame. Spark provides ways to convert a DataFrame object into a parquet file. So, to perform all of this, we need to parse each game raw text data into a Python dictionary, which will be formally structured. Then we’ll be able to convert our structured games dictionary into a DataFrame object and then to a parquet file.
From this, every command class (remove, summary and rank) will have access to parquets folder in other to perform its tasks. So let’s discuss an example: a user types something like this $ quake3 summary -s “december_17”. Our summary.py code will detected that a user typed the save flag denoted by “-s” (or maybe “--save”). So, the summary object will verify if this file is already saved on parquets folder. If not, it will instantiate load.py in order to transform the log file located in our HDFS server into a parquet format.
Bear in mind that the load object will seek for december_17.txt in a pre-defined folder path in the HDFS storage system, which will be something like: “hdfs:///quake3/logs/december_17.txt”. So, every subsequent query, like $ quake3 rank -s “december_17”, will be executed in december_17.parquet file, which querying is many times faster than in a simple file on disk. You may be wondering what happens if “-s” is missing. By default the object seeks for the file on our storage system without transforming it into a parquet format.
Sometimes a log file may be periodically changed. So, an admin has the option to update a parquet typing $ quake3 summary -s -p “december_17”. In any case if a user types a query to a nonexistent file on our hdfs system, our CLI tool will throw an exception. Don’t forget that remove command removes a file from parquets folder as of its supplied filename. After this whole introduction, we can read more about my use cases approaches.
As earlier mentioned, a game is descripted by its start and end event, as well as the players killings. A full log file may contain a bunch of text describing all the games that occurred at some time. As you can see, duplicated names means that a player died by himself. Due the fact that a user may had typed a previous query using “--save” flag, we’ll have a pre-loaded log file as a parquet. summary.py file represents a Spark Client that queries something to this structured file.
By the fact that a parquet file is a column structured dataset, it can be queried by SparkSQLContext or by a Dataframe. Both the classes provides optimized ways to scan big data. So, the first step is identify all the players in a game. While our code iterates throughout a game record, at every killing occurrence our algorithm will store killer and killed players into a auxiliary dataframe and compute their scores (+1 if killed someone and -1 if committed suicide). So, at one game iteration, we will have all players scores as well as their names and also the total number of kills. To this latter count, we can miss all the suicide occurrences. Finally, but not least, after our algorithm have looped throughout the games, it will display their respective outputs.
Examples of CLI input:
$ quake3 summary “august_2017”
$ quake3 summary -s “august_2017”
$ quake3 summary -s -p “august_2017”
To perform this task, we need to iterate throughout the games. By first, we can create an auxiliary dataframe that will store all the users, including its own kills rates, identified at every killing occurrence. In other words, when a killing appears, we first verify if the killer exists in our auxiliary dataframe. If a user exists, we just count one more kill, otherwise we add this user in our dataframe with his first kill. By the end, we’ll have a list containing all the users in descending order displayed by our spark rank.py client.
I didn’t mentioned here, but this task will surely perform functional procedures like map, forEach or maybe reduce. Keep in mind that Spark perform those manipulations lazely. That is, all dataframes mentioned here will only be materialized after an action method like count, show or even reduce occurs. Like our previous example, this CLI command can perform these tasks in a saved parquet. For instance, for the first time a user may previously typed a command like $ quake3 summary -s “august_17”. At this time, all the ranking operations that we discussed will be performed faster than our previous command if we deal with “august_17” again. Remember that this will only happen if we type $ quake3 rank -s “august_17” (pay attention to the “-s” flag).
Examples of CLI input:
$ quake3 rank “august_2017”
$ quake3 rank -s “august_2017”
$ quake3 rank -s -p “august_2017”
In this case I will not repeat all the explanation about ranking the users, since it was already explained. At this time we will perform some operations that I didn’t introduce you yet. To display our ranking informations in a web page, we need to persist our data to allow another engine process these informations and generate user friendly template views. We can perform this by storing the ranking data into a database. There are tons of options throughout the industry, but MongoDB seems to outperform many NoSQL database engines. It is commonly used by architects around great companies when they deal with big data and seeks for scalability.
So, in order to perform this task we have to connect our rank.py spark client to our MongoDB engine. Now that we are persisting our data we need to find a way to deal with html pages, routes, port numbers and css (remember our user-friendly requirement). There are popular frameworks that are provenly scalable which provides secure and stable ways to build web applications. We could use Ruby on Rails, Express.js or even Django which offers a Python interface. But, since we’re “coding” in a python perspective I decided to use the latter one. Django provides a lot of middlewares that allows us connect our web application to third-party engines, such as MondoDB.
Django is built to provide the users all benefits of MVC architecture (it stands for Model-View-Controller). In a quick overview, the Model component is responsible for modeling all data resources that will be available on our database. In other words, Django provides a pythonic way to build a model that will represent a player with its respectives informations such as his name, number of games played, number of kills etc. Controller components are responsible for build all the routes a web application may provide. Finally, View components offers ways to deliver template views (i.e. web pages) after some user request. So, in order to create user friendly views, we can connect our Django application with Twitter Bootstrap CSS tools that are largely used in industry. We can achieve this task installing a Bootstrap Django middleware.
An admin will have the same options that our previous commands provides, such as “-s” or “-p”. But this time, since we’re dealing with another data engine, we must provide tools for an user not perform repeated I/O operations. When an user types “-w”, besides our rank command, rank.py saves all the ranking data into our MongoDB instance, and then a web page can be generated with the following route: “//localhost:<port_number>/quake3/rank/august_17.html”. In other words, our rank.py client communicates to a Django application initiating its web server at localhost using a given port number. An admin have the option to not build the whole database each time a rank command is written. This can be done using “-ws” flag.
Recapitulating, our web rank command works as follows: an admin types $ quake3 rank -s -w <port_number> “august_17”, so our CLI package call its rank.py client spark and process “august_17” log parquet file (remember that if this parquet does not exist our program will retrieve the original file) to obtain the rank of players. Then, our client code will store this data into a pre-defined database which will be MongoDB. After this, our program will contact a Django application to display log rank results, at a given port number. Our Django application will be previous configured to retrieve data from our MondoDB instance.
Examples of CLI input:
$ quake3 rank -w 3000 “august_2017”
$ quake3 rank -s -w 3000 “august_2017”
$ quake3 rank -s -p -ws 4200 “august_2017”
You may be wondering why I decided to solve this essay with a Big Data perspective. I could write a solution simply assuming that we would deal with small amounts of data. I believed that could be a good challenge think outside my comfort zone. In most cases we don’t even worry if our solution is scalable. In an industry that is heavily data-driven, students and engineers must be challenged to create solutions to scenarios that increasingly provides more data and complexity.
We could use more advanced technologies such as Amazon S3 to store our log files. For this essay I believe that HDFS is complete enough to show you that my intention was to scale our data, as well as the tools to process it.
Perhaps there are more trivial solutions to display our informations in a web page. But, looking forward and thinking as an admin server, it would be necessary to create a robust web application which could display more data and provide more services. Create an application using a framework like Django, gives us all the tools to build stable and scalable solutions.
I know that we can cache repeated data on the fly using pyspark. However, I don’t know how integrate different jobs in order to manipulate this cached data. So, I decided to use parquet since it’s very efficient.
I assumed that our log file would come perfectly filled. No blank values nor missing lines would exist in dataset. However, if any of theses cases happened, consider that our codes would not add extra content.
If you have any questions, suggestions or advices, feel free to contact me as you wish. At this very beggining of my professional career, it’s very important to me learn-and-break new things.
Thank you.
