Aditya Soni AdityaSoni19031997

System Design Cheatsheet

Picking the right architecture = Picking the right battles + Managing trade-offs

Basic Steps

Clarify and agree on the scope of the system

User cases (description of sequences of events that, taken together, lead to a system doing something useful)
- Who is going to use it?
- How are they going to use it?

Spark internals through code

Nothing gives you more detail about spark internals than actually reading it source code. In addition, you get to learn many design techniques and improve your scala coding skills. These are the random notes I make while reading the spark code. The best way to comprehend the notes is to load spark code into an IDE, e.g. IntelliJ, and navigate the code on the side.

Genesis - creation of a spark cluster

The scripts for creating a spark cluster are: start-master.sh and start-slave.sh. Read them carefully, and you can see that both scripts are very similar except the values for $CLASS variable. For start-master.sh, the value is CLASS="org.apache.spark.deploy.master.Master", while the value for start-slave.sh is shown below with more context.

# NOTE: This exact class name is matched downstream by SparkSubmit.

	/*
	Implementation of circular buffer counter for counting events per time.
	It is thead safe and has high increment/add performance and good enough avg/count performance.

	Usage is as simple as:

	var counter = new CircularBufferCounter(TimeSpan.FromSeconds(1), TimeSpan.FromHours(2));
	counter.Increment();
	var avgPerMinuteLastHour = counter.Avg(TimeSpan.FromMinutes(1), TimeSpan.FromHours(1));
	*/

	"""
	Minimal character-level Vanilla RNN model. Written by Andrej Karpathy (@karpathy)
	BSD License
	"""
	import numpy as np

	# data I/O
	data = open('input.txt', 'r').read() # should be simple plain text file
	chars = list(set(data))
	data_size, vocab_size = len(data), len(chars)

	# -- coding: utf-8 --
	# pylint: disable=line-too-long, unused-argument, invalid-name, too-many-arguments, too-many-locals

	"""
	Utilities to support integration of Vowpal Wabbit and scikit-learn
	"""
	import numpy as np
	import sklearn
	from pyvw import vw
	import re

	# Nathan Lehrer

	def get_best_path(grid):
	# Finds the best path through an M x N grid of point values, and that path's score
	# Input: grid = grid of point values = M x N list of lists
	# Returns: best_score = best possible score = int, path = best possible path = string
	M,N = len(grid),len(grid[0])
	scores = {(0,0):grid[0][0]} # best score for a path to each cell; score of (0,0) is grid value
	trace = {} # whether we optimally come from up ('U') or left ('L') into each cell

	"""Run Length Encoding utilities for NumPy arrays.

	Authors
	-------
	- Nezar Abdennur
	- Anton Goloborodko

	"""
	from __future__ import division, print_function
	import numpy as np

	import pandas as pd
	import itertools
	import time
	import multiprocessing
	from typing import Callable, Tuple, Union

	def groupby_parallel(
	groupby_df: pd.core.groupby.DataFrameGroupBy,
	func: Callable[[Tuple[str, pd.DataFrame]], Union[pd.DataFrame, pd.Series]],
	num_cpus: int = multiprocessing.cpu_count() - 1,