Skip to content

Instantly share code, notes, and snippets.

View dapangmao's full-sized avatar
🏠
Working from home

Dapangmao dapangmao

🏠
Working from home
View GitHub Profile

Chapter 2. Downloading and Getting Started

In this chapter we will walk through the process of downloading and running Spark in local mode on a single computer. This chapter was written for anybody that is new to Spark, including both Data Scientists and Engineers.

Spark can be used from Python, Java or Scala. To benefit from this book, you don’t need to be an expert programmer, but we do assume that you are comfortable with the basic syntax of at least one of these languages. We will include examples in all languages wherever possible.

Spark itself is written in Scala, and runs on the Java Virtual Machine (JVM). To run Spark on either your laptop or a cluster, all you need is an installation of Java 6 (or newer). If you wish to use the Python API you will also need a Python interpreter (version 2.6 or newer) . Spark does not yet work with Python 3.

Chapter 1. Introduction to Data Analysis with Spark

This chapter provides a high level overview of what Apache Spark is. If you are already familiar with Apache Spark and its components, feel free to jump ahead to Chapter 2.

What is Apache Spark?

Apache Spark is a cluster computing platform designed to be fast and general-purpose.

On the speed side, Spark extends the popular MapReduce model to efficiently support more types of computations, including interactive queries and stream processing. Speed is important in processing large datasets as it means the difference between exploring data interactively and waiting minutes between queries, or waiting hours to run your program versus minutes. One of the main features Spark offers for speed is the ability to run computations in memory, but the system is also faster than MapReduce for complex applications running on disk.

On the generality side, Spark is designed to cover a wide range of workloads that previously required separate distributed systems, i

@dapangmao
dapangmao / q.md
Last active August 29, 2015 14:08
Data scientist / Machine Learning Engineer 相关面试题 (转载)
  1. Given a coin you don’t know it’s fair or unfair. Throw it 6 times and get 1 tail and 5 head. Determine whether it’s fair or not. What’s your confidence value?

  2. Given Amazon data, how to predict which users are going to be top shoppers in this holiday season.

  3. Which regression methods are you familiar? How to evaluate regression result?

@dapangmao
dapangmao / Squeeze complexity by Spark.md
Last active November 14, 2017 10:59
Minimize complexity by Spark.md

There is always a trade-off between time complexity and space complexity for computer programs. Deceasing the time cost will increase space cost, and vice versa, The ideal solution to parallelize the program to multiple cores if there is a multiple-core computer, or even scale it out to mutiple machines across a cluster, which would eventually reduce both time complexity and space complexity.

Spark is currently the hottest platform for cluster computing on top of Hadoop, and its Python interface provides map, reduce and many other methods, which allow a MapRecdue job in a straightforward way, and therefore easily migrate an algorithm from a single machine to a cluster of many machines.

  • Minimize space complexity

There is a question to look for the only single number from a mostly paired-number array.

Single Number

@dapangmao
dapangmao / post.md
Last active August 29, 2015 14:07
Automated testing by pytest

The most hard part in testing is to write test cases, which is time-consuming and error-prone. Fortunately, besides Python built-in modules such as doctest, unittest, there are quite a few third-party packages that could help with automated testing. My favorite one is pytest, which enjoys proven record and syntax sugar.

###Step 1: test-driven development

For example, there is a coding challenge on Leetcode:

Find Minimum in Rotated Sorted Array

Suppose a sorted array is rotated at some pivot unknown to you beforehand. (i.e., 0 1 2 4 5 6 7 might become 4 5 6 7 0 1 2). Find the minimum element.

http://www.sorting-algorithms.com/

#-------------------------------------------------------------------------------
# Name:        Methods of sorting
# Purpose:     implements the sortings mentioned by Robert Sedgewick and
#               Kevin Wayne, Algorithms 4ed
#
#-------------------------------------------------------------------------------

def selection_sort(a):
@dapangmao
dapangmao / _2question.md
Last active August 29, 2015 14:07
Bit manipulation
  • Digit saves spaces
#-------------------------------------------------------------------------------
# Name:        Single Number
# Purpose:
#
#        Given an array of integers, every element appears twice except for one.
#        Find that single one.
#
#        Note:
  • Use yield
#-------------------------------------------------------------------------------
# Name:        Generate Parentheses
# Purpose:
#
#            Given n pairs of parentheses, write a function to
#            generate all combinations of well-formed parentheses.
#
#            For example, given n = 3, a solution set is:
  • Use iterations
#-------------------------------------------------------------------------------
# Name: Pascal's Triangle II
# Purpose:
#
# Given an index k, return the kth row of the Pascal's triangle.
#
# For example, given k = 3,
# Return [1,3,3,1].
  • Two iterations: forward and backward
#-------------------------------------------------------------------------------
# Name:        Trapping Rain Water
# Purpose:
#
# Given n non-negative integers representing an elevation map where the
# width of each bar is 1, compute how much water it is
# able to trap after raining.
# For example,