Skip to content

Instantly share code, notes, and snippets.

@abruzzi
Last active August 29, 2015 14:01
Show Gist options
  • Save abruzzi/5fc74e44cc16790f0151 to your computer and use it in GitHub Desktop.
Save abruzzi/5fc74e44cc16790f0151 to your computer and use it in GitHub Desktop.
Spark + shark on hadoop yarn

一些应该注意的点

如何在Hive中创建表
CREATE EXTERNAL TABLE IF NOT EXISTS roads (Name string, Layer string, Shape binary)
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '${env:HOME}/sampledata/roads'; 

这里的com.esri.hadoop.hive.serde.JsonSerde定义在ArcGis提供的包中,因此需要先将对应的jar包导入到Hive环境中:

add jar /home/hduser/gis/gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop.jar /home/hduser/gis/gis-tools-for-hadoop/samples/lib/esri-geometry-api.jar;

注意要以分号结束,这样这个jar包就添加成功了。

另一方面,LOCATION中指定了一个目录,/home/hduser/sampledata/roads,注意此处的路径指定的是HDFS的路径,因此需要手工的上传该文件到HDFS:

$ hadoop fs -mkdir -p /home/hduser/sampledata/roads
$ hadoop fs -put roads.json /home/hduser/sampledata/roads/roads.json

这样,Hive会动态的获取此处的文件,并解析成为实际的数据。

参考资料
  1. How to install
  2. 安装过程问题集合1
  3. 安装过程问题结合2
  4. 安装过程问题集合3

Spark安装配置

假设
  1. 已经配置好了Hadoop+Yarn的环境(hadoop-2.3.0-cdh5.0.1)
  2. 已经配置好了Hive环境(hive-0.10.0-cdh4.6.0)
  3. 下载spark-0.9.1-bin-hadoop2.tgz

解压缩并设置环境变量

export SPARK_HOME=$HOME/yarn/spark-0.9.1-bin-hadoop2
export PATH=$SPARK_HOME/bin:$PATH

$SPARK_HOME下的assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar是最主要的包,这个包中包含了所有需要运行的文件。应该注意的是,这个包中包含了jackson的1.8.8版本,可能会与新版本的产生运行时冲突。

配置$SPARK_HOME/conf/spark-env.sh:

export JAVA_HOME=/home/hduser/jdk1.7.0_55
export SCALA_HOME=/home/hduser/scala-2.9.3
export SPARK_WORKER_MEMORY=4g

SPARK有多种运行模式,可以独立运行,可以独立+YARN模式运行。区别在于,如果是独立运行,Spark会以master-slave的模式运行,每个slave上都会运行一个Worker进程,而master上会有一个Master进程。而如果是YARN模式,则完全不用配置,Spark会把任务分发到已经运行的YARN集群中去。

独立模式

配置$SPARK/conf/slaves

# A Spark Worker will be started on each of the machines listed below.
ubuntu
taurus
suse

然后将刚才的配置同步到其他三台机器上:

#!/bin/sh

SLAVES=$HADOOP_CONF_DIR/slaves

for node in $(cat $SLAVES)
do
        echo "sync conf with " $node
        rsync -Pav $HIVE_HOME/ hduser@$node:$HIVE_HOME/
        rsync -Pav $SPARK_HOME/ hduser@$node:$SPARK_HOME/
        rsync -Pav $SHARK_HOME/ hduser@$node:$SHARK_HOME/
        echo "synced with " $node
done

同步完成之后,启动所有节点:

$ cd $SPARK_HOME
$ sbin/start-all.sh

应该会看到如下的输出:

starting org.apache.spark.deploy.master.Master, logging to /home/hduser/yarn/spark-0.9.1-bin-hadoop2/sbin/../logs/spark-hduser-org.apache.spark.deploy.master.Master-1-nassvr.out
ubuntu: starting org.apache.spark.deploy.worker.Worker, logging to /home/hduser/yarn/spark-0.9.1-bin-hadoop2/sbin/../logs/spark-hduser-org.apache.spark.deploy.worker.Worker-1-ubuntu.out
taurus: starting org.apache.spark.deploy.worker.Worker, logging to /home/hduser/yarn/spark-0.9.1-bin-hadoop2/sbin/../logs/spark-hduser-org.apache.spark.deploy.worker.Worker-1-taurus.out
suse: starting org.apache.spark.deploy.worker.Worker, logging to /home/hduser/yarn/spark-0.9.1-bin-hadoop2/sbin/../logs/spark-hduser-org.apache.spark.deploy.worker.Worker-1-suse.out

对应的,可以使用$SPARK/sbin/stop-all.sh来停止所有的worker进程。

启动之后,可以通过JPS来查看:

hduser@ubuntu:~$ jps
9362 Worker
8581 DataNode
9844 Jps
8814 NodeManager

或者

hduser@nassvr:~/yarn> jps
18080 Master
16960 ResourceManager
19346 SharkCliDriver
16592 NameNode
889 Jps
17244 JobHistoryServer

Shark安装及配置

  1. Hive环境已经就绪
  2. Spark环境已经就绪
  3. 下载Scala语言环境
配置环境变量
export SCALA_HOME=$HOME/scala-2.9.3
export PATH=$SCALA_HOME/bin:$PATH

export SHARK_HOME=$HOME/yarn/shark-0.9.1-bin-hadoop2
hduser@ubuntu:~$ cat yarn/shark-0.9.1-bin-hadoop2/conf/shark-env.sh
#!/usr/bin/env bash

export SPARK_HOME=$HOME/yarn/spark-0.9.1-bin-hadoop2
export SPARK_MEM=2g

# (Required) Set the master program's memory
export SHARK_MASTER_MEM=1g

export SHARK_HOME=$HOME/yarn/shark-0.9.1-bin-hadoop2

# Only required if run shark with spark on yarn
export SHARK_EXEC_MODE=yarn
export SPARK_ASSEMBLY_JAR=$SPARK_HOME/assembly/target/scala-2.10/spark-assembly_2.10-0.9.1-hadoop2.2.0.jar
export SHARK_ASSEMBLY_JAR=$SHARK_HOME/target/scala-2.10/shark_2.10-0.9.1.jar

# Java options
# On EC2, change the local.dir to /mnt/tmp
SPARK_JAVA_OPTS=" -Dspark.local.dir=/tmp "
SPARK_JAVA_OPTS+="-Dspark.kryoserializer.buffer.mb=10 "
SPARK_JAVA_OPTS+="-verbose:gc -XX:-PrintGCDetails -XX:+PrintGCTimeStamps "
export SPARK_JAVA_OPTS

HADOOP_VERSION=hadoop-2.3.0-cdh5.0.1
export HADOOP_HOME=$HOME/yarn/$HADOOP_VERSION
export HIVE_HOME=$HOME/yarn/hive-0.10.0-cdh4.6.0
export HIVE_CONF_DIR=$HIVE_HOME/conf
export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop
export MASTER=yarn-client
#export MASTER=spark://nassvr:7077

source $SPARK_HOME/conf/spark-env.sh

需要注意的是此处的MASTER的值,如果这里的值为yarn-client,则不需要启动spark的服务,只需要hadoop集群上的节点启动着即可。 如果此处的值为spark://nassvr:7077,则需要启动spark的服务。

由于spark中自带的jackson版本与ArcGIS的版本不同步,一个简单的解决方法是:

  1. 将spark-assembly_2.10-0.9.1-hadoop2.2.0.jar包中的jackson部分删除,并重新打包成jar
  2. 在ClassPath中添加jackson的新包

下载新版本的jackson包,并保存在$HOME/temp/下,然后拷贝到hive环境:

cp temp/jackson-*.jar yarn/hive-0.10.0-cdh4.6.0/lib/

重启所有服务,使之生效。

cp temp/jackson-*.jar yarn/hadoop-2.3.0-cdh5.0.1/share/hadoop/hdfs/lib/
cp temp/jackson-*.jar yarn/hadoop-2.3.0-cdh5.0.1/share/hadoop/yarn/lib/
cp temp/jackson-*.jar yarn/hadoop-2.3.0-cdh5.0.1/share/hadoop/tools/lib/
cp temp/jackson-*.jar yarn/hadoop-2.3.0-cdh5.0.1/share/hadoop/common/lib/
启动shark

要使用shark来做一些查询:

add jar
  ${env:HOME}/gis/gis-tools-for-hadoop/samples/lib/esri-geometry-api.jar
  ${env:HOME}/gis/gis-tools-for-hadoop/samples/lib/spatial-sdk-hadoop.jar;

create temporary function ST_Point as 'com.esri.hadoop.hive.ST_Point';
create temporary function ST_Contains as 'com.esri.hadoop.hive.ST_Contains';
create temporary function ST_Intersects as 'com.esri.hadoop.hive.ST_Intersects';


CREATE EXTERNAL TABLE IF NOT EXISTS roads (Name string, Layer string, Shape binary)
ROW FORMAT SERDE 'com.esri.hadoop.hive.serde.JsonSerde'
STORED AS INPUTFORMAT 'com.esri.json.hadoop.EnclosedJsonInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION '${env:HOME}/sampledata/roads'; 


select roads.name, count(*) from f_600w
join roads
where st_intersects(st_point(f_600w.longitude, f_600w.latitude), roads.shape)=true
group by roads.name;
110辅线 25131
X032    1184
丰善西路        142
于辛庄路        836
京包线  6244
北清路  14321
友谊路  1663
安济桥路        181
定泗路  1276
展思门路        255
工商南街        287
市场东路        5109
朱辛庄路        1074
李庄路  534
柴禾市大街     181
沙阳路  3000
满白路  71
王于路  90
环城北路        273
百沙路  1573
站前路  134
站西路  849
豆各庄路        281
路小路  125
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment