dapangmao / ssn.md

Last active August 29, 2015 14:10

Spark practice (3): clean and sort Social Security numbers

Sample.txt

Requirements:
1. separate valid SSN and invalid SSN
2. count the number of valid SSN

402-94-7709 
283-90-3049

dapangmao / test.md

Last active August 29, 2015 14:10

Spark practice (1): find the stranger that shares the most friends with me

Given the friend pairs in the sample text below (each line contains two people who are friends), find the stranger that shares the most friends with me.

sample.txt

me Alice
Henry me
Henry Alice
me Jane
Alice John
Jane John

dapangmao / from halfsea.md

Last active September 16, 2017 23:47

http://www.mitbbs.com/article_t/JobHunting/32477683.html

Sorry for the delay。本来想系统的写点东西，但动笔之后发现自己的水平还是差得太远，没法handle，时间精力目前也不允许。所以估计就只能零零散散的写点感受了。大家随便看看就好，不要期望过高，道歉先。这个板上牛人很多，真正的大牛可能根本没时间来发帖子，我也就抱着回报社会的心态班门弄斧好了。

这几年几次换工作，job版上的信息都对我起到了很大的帮助。所以希望能把我的一点心得回报这里。以下都是我个人的一点浅见，完全可能不正确或者不符合别人的实际情况。仅供大家参考。

这次经历感觉最深刻的有以下几点可以作为经验向大家推荐。

dapangmao / stanford_class_study.md

Last active August 29, 2015 14:10

#Mining Massive Datasets ##Week1 ###Distributed File Systems

Node failures
A single server can stay up for 3 years (1000 days)
1000 servers in cluster => 1 failure/day
1M servers in cluster => 1000 failures/day
MapReduce addresses the challenges of cluster Store data redundantly

dapangmao / spark_py_examples.md

Last active August 29, 2015 14:10

Examples for python and Spark

Link

Word Count

import sys
from operator import add
from pyspark import SparkContext

if __name__ == "__main__":
    if len(sys.argv) != 2:

dapangmao / practice.md

Last active August 29, 2015 14:09

Keep worst by group

Other 2 examples

Method 1

import string
def printWst(s):
   lookup = string.ascii_uppercase

dapangmao / chap6.md

Last active March 18, 2017 16:21

Chapter 6. Advanced Spark Programming

Introduction

This chapter introduces a variety of advanced Spark programming features that we didn’t get to cover in the previous chapters. We introduce two types of shared variables, accumulators to aggregate information and broadcast variables to efficiently distribute large values. Building on our existing transformations on RDDs, we introduce batch operations for tasks with high setup costs, like querying a database. To expand the range of tools accessible to us, we cover Spark’s methods for interacting with external programs, such as scripts written in R.

dapangmao / docker.md

Last active August 29, 2015 14:09

The Docker Book

sudo docker pull ubuntu
sudo docker images
sudo docker run -t -i --name new_container ubuntu:12.04 /bin/bash

Build Node.js App
Delete images

dapangmao / chap4.md

Last active July 3, 2017 11:44

Chapter 4. Working with Key-Value Pairs

This chapter covers how to work with RDDs of key-value pairs, which are a common data type required for many operations in Spark. Key-value RDDs expose new operations such as aggregating data items by key (e.g., counting up reviews for each product), grouping together data with the same key, and grouping together two different RDDs. Oftentimes, to work with data records in Spark, you will need to turn them into key-value pairs and apply one of these operations.

We also discuss an advanced feature that lets users control the layout of pair RDDs across nodes: partitioning. Using controllable partitioning, applications can sometimes greatly reduce communication costs, by ensuring that data that will be accessed together will be on the same node. This can provide significant speedups. We illustrate partitioning using the PageRank algorithm as an example. Choosing the right partitioning for a distributed dataset is similar to choosing the right data structure for a lo

dapangmao / chap3.md

Last active October 14, 2020 05:09

Chapter 3. Programming with RDDs

This chapter introduces Spark’s core abstraction for working with data, the Resilient Distributed Dataset (RDD). An RDD is simply a distributed collection of elements. In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result. Under the hood, Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them.

Both Data Scientists and Engineers should read this chapter, as RDDs are the core concept in Spark. We highly recommend that you try some of these examples in an interactive shell (see Introduction to Spark’s Python and Scala Shells). In addition, all code in this chapter is available in the book’s GitHub repository.

Dapangmao dapangmao

Examples for python and Spark

Keep worst by group

Method 1

Chapter 6. Advanced Spark Programming

Introduction

Chapter 4. Working with Key-Value Pairs

Chapter 3. Programming with RDDs