Sample.txt
Requirements:
1. separate valid SSN and invalid SSN
2. count the number of valid SSN
402-94-7709
283-90-3049
Sample.txt
Requirements:
1. separate valid SSN and invalid SSN
2. count the number of valid SSN
402-94-7709
283-90-3049
Given the friend pairs in the sample text below (each line contains two people who are friends), find the stranger that shares the most friends with me.
sample.txt
me Alice
Henry me
Henry Alice
me Jane
Alice John
Jane John
Sorry for the delay。本来想系统的写点东西,但动笔之后发现自己的水平还是差得 太远,没法handle,时间精力目前也不允许。所以估计就只能零零散散的写点感受了。 大家随便看看就好,不要期望过高,道歉先。这个板上牛人很多,真正的大牛可能根本 没时间来发帖子,我也就抱着回报社会的心态班门弄斧好了。
这几年几次换工作,job版上的信息都对我起到了很大的帮助。所以希望能把我的一点 心得回报这里。以下都是我个人的一点浅见,完全可能不正确或者不符合别人的实际情 况。仅供大家参考。
这次经历感觉最深刻的有以下几点可以作为经验向大家推荐。
#Mining Massive Datasets ##Week1 ###Distributed File Systems
Node failures
A single server can stay up for 3 years (1000 days)
1000 servers in cluster => 1 failure/day
1M servers in cluster => 1000 failures/day
MapReduce addresses the challenges of cluster Store data redundantly
import sys
from operator import add
from pyspark import SparkContext
if __name__ == "__main__":
if len(sys.argv) != 2:
import string
def printWst(s):
lookup = string.ascii_uppercase
This chapter introduces a variety of advanced Spark programming features that we didn’t get to cover in the previous chapters. We introduce two types of shared variables, accumulators to aggregate information and broadcast variables to efficiently distribute large values. Building on our existing transformations on RDDs, we introduce batch operations for tasks with high setup costs, like querying a database. To expand the range of tools accessible to us, we cover Spark’s methods for interacting with external programs, such as scripts written in R.
sudo docker pull ubuntu
sudo docker images
sudo docker run -t -i --name new_container ubuntu:12.04 /bin/bash
This chapter covers how to work with RDDs of key-value pairs, which are a common data type required for many operations in Spark. Key-value RDDs expose new operations such as aggregating data items by key (e.g., counting up reviews for each product), grouping together data with the same key, and grouping together two different RDDs. Oftentimes, to work with data records in Spark, you will need to turn them into key-value pairs and apply one of these operations.
We also discuss an advanced feature that lets users control the layout of pair RDDs across nodes: partitioning. Using controllable partitioning, applications can sometimes greatly reduce communication costs, by ensuring that data that will be accessed together will be on the same node. This can provide significant speedups. We illustrate partitioning using the PageRank algorithm as an example. Choosing the right partitioning for a distributed dataset is similar to choosing the right data structure for a lo
This chapter introduces Spark’s core abstraction for working with data, the Resilient Distributed Dataset (RDD). An RDD is simply a distributed collection of elements. In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result. Under the hood, Spark automatically distributes the data contained in RDDs across your cluster and parallelizes the operations you perform on them.
Both Data Scientists and Engineers should read this chapter, as RDDs are the core concept in Spark. We highly recommend that you try some of these examples in an interactive shell (see Introduction to Spark’s Python and Scala Shells). In addition, all code in this chapter is available in the book’s GitHub repository.