-
Install (Comming Soon!)
-
Configuration(Comming Soon!)
-
Test Hadoop Installation
yarn jar $YARN_EXAMPLES/hadoop-mapreduce-examples-2.5.0.jar pi 80 100000000
- Install (Comming Soon!)
- Optmization with Intel MKL(Comming Soon!)
- Download
mkdir dev
cd dev
# 1.3/prebuilt hadoop 2.4+
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.3.0-bin-hadoop2.4.tgz
- Install
tar xzvf spark-1.3.0*.tgz
ln -s spark-1.3.0-bin-hadoop2.4 spark
rm -f spark-1.3.0-bin-hadoop2.4.tgz
- Configure
cd spark
cp conf/log4j.properties{.template,}
## Change INFO to ERROR
## => log4j.rootCategory=ERROR, console
cp conf/spark-defaults.conf{.template,}
## modify driver-memory
## => spark.driver.memory 2g
- Test Example locally
$ ./bin/run-example SparkPi
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Pi is roughly 3.13956
- Integrate spark with yarn
/opt/spark/current/bin/pyspark --master yarn-client
- Run Word Count Test
Comming Soon!
- Check Results
hadoop fs -getmerge hdfs:///user/mark/wc-4.txt/ wc.txt
cat wc.txt |\
awk '{print $2" " $1}' |\
sed s/u'//g;s/[\'\",\(\)]//g |\
sort -nr|\
head
## Results
100235423 'the'
55909625 'of'
47670343 'and'
40809068 'in'
33128081 'to'
32454903 'a'
19945342 'was'
16815599 'is'
15793144 'The'
13166506 'for'
- Integrate spark with ipython notebook
export IPYTHON=1
# optinal
# export IPYTHON_OPTS="notebook"
ipython notebook --profile=nbserver
export PYTHONPATH=/opt/spark/current/python/lib/py4j-0.8.2.1-src.zip:/opt/spark/current/python
# edit startup file
- Edit Startup file for profile-nbserver
# cat .ipython/profile_nbserver/startup/00-pyspark-setup.py
# Configure the Spark Env
import os
os.environ['SPARK_HOME'] = '/opt/spark/current/'
os.environ['IPYTHON']= '1'
os.environ['PYSPARK_SUBMIT_ARGS'] = ' --master yarn \
--deploy-mode client \
--num-executors 24 \
--executor-memory 6g \
--executor-cores 5'
# and Python Path
import sys
sys.path.insert(0, '/opt/spark/current/python')
sys.environ['PYTHONPATH'].insert('/usr/lib/spark/python')
sys.environ['PYTHONPATH'].insert('/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip')
#! Detect the PySpark URL
#! CLUSTER_URL = open('/opt/spark/current/cluster-url').read.strip()
spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.1-src.zip'))
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))
from pyspark import SparkContext
conf = SparkConf().setAppName('SparkNB').setMaster('yarn-client')
sc = SparkContext(conf=conf)
file = sc.textFile("hdfs:///user/mark/wiki.txt")
tests = file.filter(lambda line: "test" in line)
# Count all the test
tests.count()
# 759753