- 并行化
- 单点失败的处理(节点宕机,个别节点缓慢)
- 多用户共享,动态进行计算资源的分配
MapReduce--简单通用和自动容错的批处理计算模型,不适合交互计算,迭代计算和流计算
An RDD is a read-only,partitioned collection of records. RDDs can only be created through deterministic operations on either data in stable storage or other RDDs. Users can control two aspects of RDDs: persistence and partitioning.
For iterative algorithms and interactive data mining tools abstractions for leveraging distributed memory
RDD生成
某些transformation比较复杂,会包含多个子transformation,因而会生成多个RDD.
Transformation | Generated RDDs | Compute() |
---|---|---|
map(func) | MappedRDD | iterator(split).map(f) |
filter(func) | FilteredRDD | iterator(split).filter(f) |
flatMap(func) | FlatMappedRDD | iterator(split).flatMap(f) |
- a single RDD has one or more partitions scattered across multiple nodes
- a single partition is processed on a single node
- a single node can handle multiple partitions (with optimum 2-4 partitions per CPU according to the official documentation)
Since Spark supports pluggable resource management details of the distribution will depend on the one you use (Standalone, Yarn, Messos).
分为两种模式
- yarn-client driver在提交Spark作业的机器上运行
- yarn-cluster driver运行在yarn的container上,提交任务之前不知道driver运行在哪个节点
如果不指定任何参数,默认只分配2个executor 有些集群executor的最大内存是8G
- 增加新节点时要注意环境变量的冲突问题
- 格式化namenode时slave节点datanode文件夹要清空
- Configured Capacity = Total Disk Space - Reserved Space
- Non DFS used = (Total Disk Space - Reserved Space) - Remaining Space - DFS Used
- 如果整个集群的HDFS备份数为1,在集群之外的其他节点上传文件到HDFS,文件的备份数以该节点配置的hdfs-site.xml为准
- 磁盘空间使用率超过90%,默认情况下,Yarn会把这个节点视为unhealthy节点,可以在yarn-site.xml里添加property yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage,value为0.0-100.0
- SUSE关闭防火墙 /sbin/SuSEfirewall2 off(永久) /etc/init.d/SuSEfirewall2_setup stop(暂时)
- spark.speculation(Spark configuration values) Default false Setting to true will enable speculative execution of tasks. This means tasks that are running slowly will have a second copy launched on another node. Enabling this can help cut down on straggler tasks in large clusters.
我们可以用三种方式来配置spark speculation
In code:
val conf=new SparkConf().set("spark.speculation","false")
Or We can add configuration dynamically at time for deploy application with the help of flag as follows
--conf spark.speculation=false
Or, We can also add it in properties file with the space delimiter as follows
spark.speculation false
Spark can dynamically increase and decrease the number of executors for an application if the resource requirements for the application changes over time. To enable dynamic allocation, set spark.dynamicAllocation.enabled to true. Specify the minimum number of executors that should be allocated to an application by means of the spark.dynamicAllocation.minExecutors parameter, and specify and the maximum number of executors by means of the spark.dynamicAllocation.maxExecutors parameter. Set the initial number of executors in the spark.dynamicAllocation.initialExecutors parameter. Do not use the --num-executors command line argument or the spark.executor.instances parameter; they are incompatible with dynamic allocation.