Skip to content

Instantly share code, notes, and snippets.

@tag1216
Created October 6, 2016 04:48
Show Gist options
  • Save tag1216/4bf8af37849bca8b3bb7b11c15fd3243 to your computer and use it in GitHub Desktop.
Save tag1216/4bf8af37849bca8b3bb7b11c15fd3243 to your computer and use it in GitHub Desktop.
Hadoop Streaming
#!/bin/bash
hadoop="/usr/bin/hadoop"
STREAMING=/opt/cloudera/parcels/CDH/lib/hadoop-mapreduce/hadoop-streaming.jar
cwd=`pwd`
mapper="${cwd}/getcvlist.py"
reducer="${cwd}/getcvlist.py"
echo $1 $2
uuidfile=$2
inputdir=""
inputdir="${inputdir} -input /path/to/input "
outputdir="/user/ktaguchi/test"
${hadoop} fs -rm -r ${outputdir}
${hadoop} jar ${STREAMING} \
-Dmapred.output.compress=true \
-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.DefaultCodec \
-Dmapred.reduce.tasks=0 \
-mapper "cut -f 2-6" \
${inputdir} \
-output ${outputdir} \
-inputformat 'org.apache.hadoop.mapred.SequenceFileAsTextInputFormat' \
-outputformat 'org.apache.hadoop.mapred.SequenceFileOutputFormat'
#${hadoop} fs -text "${outputdir}/part*" > ./results/uu_list_data_1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment