Skip to content

Instantly share code, notes, and snippets.

@max6cn
Last active October 12, 2015 00:07
Show Gist options
  • Save max6cn/8a719eba5dfc3cc1696a to your computer and use it in GitHub Desktop.
Save max6cn/8a719eba5dfc3cc1696a to your computer and use it in GitHub Desktop.
Install Hadoop & Hive & iPython & Spark

1. Hadoop

  • Install (Comming Soon!)

  • Configuration(Comming Soon!)

  • Test Hadoop Installation

yarn jar $YARN_EXAMPLES/hadoop-mapreduce-examples-2.5.0.jar pi 80 100000000

2. iPython

  • Install (Comming Soon!)
  • Optmization with Intel MKL(Comming Soon!)

3. Spark

  • Download
mkdir dev
cd dev
# 1.3/prebuilt hadoop 2.4+
wget http://d3kbcqa49mib13.cloudfront.net/spark-1.3.0-bin-hadoop2.4.tgz
  • Install
tar xzvf spark-1.3.0*.tgz 
ln -s spark-1.3.0-bin-hadoop2.4 spark
rm -f spark-1.3.0-bin-hadoop2.4.tgz
  • Configure
cd spark
cp conf/log4j.properties{.template,}
## Change INFO to ERROR   
##  => log4j.rootCategory=ERROR, console
cp conf/spark-defaults.conf{.template,}
## modify driver-memory
## => spark.driver.memory              2g

  • Test Example locally
$ ./bin/run-example SparkPi
Spark assembly has been built with Hive, including Datanucleus jars on classpath
Pi is roughly 3.13956         

4. Integeration

  • Integrate spark with yarn
/opt/spark/current/bin/pyspark --master yarn-client
  • Run Word Count Test

Comming Soon!

  • Check Results
hadoop fs -getmerge hdfs:///user/mark/wc-4.txt/ wc.txt

cat wc.txt |\
  awk '{print $2" " $1}' |\
  sed s/u'//g;s/[\'\",\(\)]//g |\
  sort -nr|\
  head
  
## Results
100235423 'the'
55909625 'of'
47670343 'and'
40809068 'in'
33128081 'to'
32454903 'a'
19945342 'was'
16815599 'is'
15793144 'The'
13166506 'for'

  • Integrate spark with ipython notebook
export IPYTHON=1
# optinal
# export IPYTHON_OPTS="notebook" 
ipython notebook --profile=nbserver
export PYTHONPATH=/opt/spark/current/python/lib/py4j-0.8.2.1-src.zip:/opt/spark/current/python
# edit startup file 
  • Edit Startup file for profile-nbserver
# cat .ipython/profile_nbserver/startup/00-pyspark-setup.py 
# Configure the Spark Env
import os
os.environ['SPARK_HOME'] = '/opt/spark/current/'
os.environ['IPYTHON']= '1'

os.environ['PYSPARK_SUBMIT_ARGS'] = ' --master yarn \
                    --deploy-mode client \
                    --num-executors 24 \
                    --executor-memory 6g \
                    --executor-cores 5'
# and Python Path
import sys
sys.path.insert(0, '/opt/spark/current/python')
sys.environ['PYTHONPATH'].insert('/usr/lib/spark/python')
sys.environ['PYTHONPATH'].insert('/usr/lib/spark/python/lib/py4j-0.8.2.1-src.zip')

#! Detect the PySpark URL
#! CLUSTER_URL = open('/opt/spark/current/cluster-url').read.strip()

spark_home = os.environ.get('SPARK_HOME', None)
sys.path.insert(0, os.path.join(spark_home, 'python/lib/py4j-0.8.1-src.zip'))
execfile(os.path.join(spark_home, 'python/pyspark/shell.py'))

5. Example

from pyspark import SparkContext
conf = SparkConf().setAppName('SparkNB').setMaster('yarn-client')
sc = SparkContext(conf=conf)
file = sc.textFile("hdfs:///user/mark/wiki.txt")
tests = file.filter(lambda line: "test" in line)
# Count all the test
tests.count()
# 759753
@max6cn
Copy link
Author

max6cn commented Oct 10, 2015

counts = text_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://n1/user/spark/wordcount")

@max6cn
Copy link
Author

max6cn commented Oct 12, 2015

text_file = sc.textFile("hdfs://n1/user/spark/wiki.xml")
counts = text_file.flatMap(lambda line: line.split(" ")).map(lambda word: (word, 1)).reduceByKey(lambda a, b: a + b)
counts.saveAsTextFile("hdfs://n1/user/spark/wordcount")
# Count all the test
counts.first()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment