Skip to content

Instantly share code, notes, and snippets.

View dapangmao's full-sized avatar
🏠
Working from home

Dapangmao dapangmao

🏠
Working from home
View GitHub Profile
@dapangmao
dapangmao / _2question.md
Last active August 29, 2015 14:07
Bit manipulation
  • Digit saves spaces
#-------------------------------------------------------------------------------
# Name:        Single Number
# Purpose:
#
#        Given an array of integers, every element appears twice except for one.
#        Find that single one.
#
#        Note:

http://www.sorting-algorithms.com/

#-------------------------------------------------------------------------------
# Name:        Methods of sorting
# Purpose:     implements the sortings mentioned by Robert Sedgewick and
#               Kevin Wayne, Algorithms 4ed
#
#-------------------------------------------------------------------------------

def selection_sort(a):
@dapangmao
dapangmao / post.md
Last active August 29, 2015 14:07
Automated testing by pytest

The most hard part in testing is to write test cases, which is time-consuming and error-prone. Fortunately, besides Python built-in modules such as doctest, unittest, there are quite a few third-party packages that could help with automated testing. My favorite one is pytest, which enjoys proven record and syntax sugar.

###Step 1: test-driven development

For example, there is a coding challenge on Leetcode:

Find Minimum in Rotated Sorted Array

Suppose a sorted array is rotated at some pivot unknown to you beforehand. (i.e., 0 1 2 4 5 6 7 might become 4 5 6 7 0 1 2). Find the minimum element.

@dapangmao
dapangmao / Squeeze complexity by Spark.md
Last active November 14, 2017 10:59
Minimize complexity by Spark.md

There is always a trade-off between time complexity and space complexity for computer programs. Deceasing the time cost will increase space cost, and vice versa, The ideal solution to parallelize the program to multiple cores if there is a multiple-core computer, or even scale it out to mutiple machines across a cluster, which would eventually reduce both time complexity and space complexity.

Spark is currently the hottest platform for cluster computing on top of Hadoop, and its Python interface provides map, reduce and many other methods, which allow a MapRecdue job in a straightforward way, and therefore easily migrate an algorithm from a single machine to a cluster of many machines.

  • Minimize space complexity

There is a question to look for the only single number from a mostly paired-number array.

Single Number

@dapangmao
dapangmao / q.md
Last active August 29, 2015 14:08
Data scientist / Machine Learning Engineer 相关面试题 (转载)
  1. Given a coin you don’t know it’s fair or unfair. Throw it 6 times and get 1 tail and 5 head. Determine whether it’s fair or not. What’s your confidence value?

  2. Given Amazon data, how to predict which users are going to be top shoppers in this holiday season.

  3. Which regression methods are you familiar? How to evaluate regression result?

Chapter 1. Introduction to Data Analysis with Spark

This chapter provides a high level overview of what Apache Spark is. If you are already familiar with Apache Spark and its components, feel free to jump ahead to Chapter 2.

What is Apache Spark?

Apache Spark is a cluster computing platform designed to be fast and general-purpose.

On the speed side, Spark extends the popular MapReduce model to efficiently support more types of computations, including interactive queries and stream processing. Speed is important in processing large datasets as it means the difference between exploring data interactively and waiting minutes between queries, or waiting hours to run your program versus minutes. One of the main features Spark offers for speed is the ability to run computations in memory, but the system is also faster than MapReduce for complex applications running on disk.

On the generality side, Spark is designed to cover a wide range of workloads that previously required separate distributed systems, i

Chapter 2. Downloading and Getting Started

In this chapter we will walk through the process of downloading and running Spark in local mode on a single computer. This chapter was written for anybody that is new to Spark, including both Data Scientists and Engineers.

Spark can be used from Python, Java or Scala. To benefit from this book, you don’t need to be an expert programmer, but we do assume that you are comfortable with the basic syntax of at least one of these languages. We will include examples in all languages wherever possible.

Spark itself is written in Scala, and runs on the Java Virtual Machine (JVM). To run Spark on either your laptop or a cluster, all you need is an installation of Java 6 (or newer). If you wish to use the Python API you will also need a Python interpreter (version 2.6 or newer) . Spark does not yet work with Python 3.

Chapter 3. Programming with RDDs

This chapter introduces Spark’s core abstraction for working with data, the Resilient Distributed Dataset (RDD). An RDD is simply a distributed collection of elements. In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result. Under the hood, Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them.

Both Data Scientists and Engineers should read this chapter, as RDDs are the core concept in Spark. We highly recommend that you try some of these examples in an interactive shell (see Introduction to Spark’s Python and Scala Shells). In addition, all code in this chapter is available in the book’s GitHub repository.

Chapter 4. Working with Key-Value Pairs

This chapter covers how to work with RDDs of key-value pairs, which are a common data type required for many operations in Spark. Key-value RDDs expose new operations such as aggregating data items by key (e.g., counting up reviews for each product), grouping together data with the same key, and grouping together two different RDDs. Oftentimes, to work with data records in Spark, you will need to turn them into key-value pairs and apply one of these operations.

We also discuss an advanced feature that lets users control the layout of pair RDDs across nodes: partitioning. Using controllable partitioning, applications can sometimes greatly reduce communication costs, by ensuring that data that will be accessed together will be on the same node. This can provide significant speedups. We illustrate partitioning using the PageRank algorithm as an example. Choosing the right partitioning for a distributed dataset is similar to choosing the right data structure for a lo