dapangmao / chap6.md

Last active March 18, 2017 16:21

Chapter 6. Advanced Spark Programming

Introduction

This chapter introduces a variety of advanced Spark programming features that we didn’t get to cover in the previous chapters. We introduce two types of shared variables, accumulators to aggregate information and broadcast variables to efficiently distribute large values. Building on our existing transformations on RDDs, we introduce batch operations for tasks with high setup costs, like querying a database. To expand the range of tools accessible to us, we cover Spark’s methods for interacting with external programs, such as scripts written in R.

dapangmao / practice.md

Last active August 29, 2015 14:09

Keep worst by group

Other 2 examples

Method 1

import string
def printWst(s):
   lookup = string.ascii_uppercase

dapangmao / spark_py_examples.md

Last active August 29, 2015 14:10

Examples for python and Spark

Link

Word Count

import sys
from operator import add
from pyspark import SparkContext

if __name__ == "__main__":
    if len(sys.argv) != 2:

dapangmao / stanford_class_study.md

Last active August 29, 2015 14:10

#Mining Massive Datasets ##Week1 ###Distributed File Systems

Node failures
A single server can stay up for 3 years (1000 days)
1000 servers in cluster => 1 failure/day
1M servers in cluster => 1000 failures/day
MapReduce addresses the challenges of cluster Store data redundantly

dapangmao / from halfsea.md

Last active September 16, 2017 23:47

http://www.mitbbs.com/article_t/JobHunting/32477683.html

Sorry for the delay。本来想系统的写点东西，但动笔之后发现自己的水平还是差得太远，没法handle，时间精力目前也不允许。所以估计就只能零零散散的写点感受了。大家随便看看就好，不要期望过高，道歉先。这个板上牛人很多，真正的大牛可能根本没时间来发帖子，我也就抱着回报社会的心态班门弄斧好了。

这几年几次换工作，job版上的信息都对我起到了很大的帮助。所以希望能把我的一点心得回报这里。以下都是我个人的一点浅见，完全可能不正确或者不符合别人的实际情况。仅供大家参考。

这次经历感觉最深刻的有以下几点可以作为经验向大家推荐。

dapangmao / test.md

Last active August 29, 2015 14:10

Spark practice (1): find the stranger that shares the most friends with me

Given the friend pairs in the sample text below (each line contains two people who are friends), find the stranger that shares the most friends with me.

sample.txt

me Alice
Henry me
Henry Alice
me Jane
Alice John
Jane John

dapangmao / ssn.md

Last active August 29, 2015 14:10

Spark practice (3): clean and sort Social Security numbers

Sample.txt

Requirements:
1. separate valid SSN and invalid SSN
2. count the number of valid SSN

402-94-7709 
283-90-3049

dapangmao / parents.md

Last active August 29, 2015 14:11

find grandchild-grandparent pairs from chid-parent pairs

Sample.txt (the first word is child; the second word is parent)

Tom Lucy
Tom Jack
Jone Lucy
Jone Jack
Lucy Mary
Lucy Ben
Jack Alice

dapangmao / sql.md

Last active August 29, 2015 14:11

query text using SQL

In a class of a few children, use SQL to find those who are male and weight over 100.

class.txt (including Name Sex Age Height Weight)

Alfred M 14 69.0 112.5 
Alice F 13 56.5 84.0

dapangmao / link.md

Last active August 29, 2015 14:12

Rediscover flask

Dapangmao dapangmao

Chapter 6. Advanced Spark Programming

Introduction

Keep worst by group

Method 1

Examples for python and Spark