This chapter introduces a variety of advanced Spark programming features that we didn’t get to cover in the previous chapters. We introduce two types of shared variables, accumulators to aggregate information and broadcast variables to efficiently distribute large values. Building on our existing transformations on RDDs, we introduce batch operations for tasks with high setup costs, like querying a database. To expand the range of tools accessible to us, we cover Spark’s methods for interacting with external programs, such as scripts written in R.
import string
def printWst(s):
lookup = string.ascii_uppercase
- Word Count
import sys
from operator import add
from pyspark import SparkContext
if __name__ == "__main__":
if len(sys.argv) != 2:
#Mining Massive Datasets ##Week1 ###Distributed File Systems
-
Node failures
A single server can stay up for 3 years (1000 days)
1000 servers in cluster => 1 failure/day
1M servers in cluster => 1000 failures/day -
MapReduce addresses the challenges of cluster Store data redundantly
Sorry for the delay。本来想系统的写点东西,但动笔之后发现自己的水平还是差得 太远,没法handle,时间精力目前也不允许。所以估计就只能零零散散的写点感受了。 大家随便看看就好,不要期望过高,道歉先。这个板上牛人很多,真正的大牛可能根本 没时间来发帖子,我也就抱着回报社会的心态班门弄斧好了。
这几年几次换工作,job版上的信息都对我起到了很大的帮助。所以希望能把我的一点 心得回报这里。以下都是我个人的一点浅见,完全可能不正确或者不符合别人的实际情 况。仅供大家参考。
这次经历感觉最深刻的有以下几点可以作为经验向大家推荐。
Given the friend pairs in the sample text below (each line contains two people who are friends), find the stranger that shares the most friends with me.
sample.txt
me Alice
Henry me
Henry Alice
me Jane
Alice John
Jane John
Sample.txt
Requirements:
1. separate valid SSN and invalid SSN
2. count the number of valid SSN
402-94-7709
283-90-3049
Sample.txt (the first word is child; the second word is parent)
Tom Lucy
Tom Jack
Jone Lucy
Jone Jack
Lucy Mary
Lucy Ben
Jack Alice
In a class of a few children, use SQL to find those who are male and weight over 100.
class.txt (including Name Sex Age Height Weight)
Alfred M 14 69.0 112.5
Alice F 13 56.5 84.0