goddoe · April 29, 2022 06:19
diff --git a/map per file.sh b/map per file.sh
 # Reference: https://www.ghostar.org/2013/09/mass-gzip-files-inside-hdfs-using-the-power-of-hadoop/
 Once that’s done, we want to run the hadoop streaming job. I ran it like this. There’s a lot of output here. I include it only for reference.

 $ hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-*.jar 
  -Dmapred.reduce.tasks=0 -mapper gzipit.sh 
  -input ./gzipped.txt 
  -output /user/hcoyote/gzipped.log 
  -verbose 
  -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat 
  -file gzipit.sh
 Important to note on this command line: org.apache.hadoop.mapred.lib.NLineInputFormat is the magic here. It basically tells the job to feed one file per maptask. This allows gzipit.sh to run once per file and gain the parallelism of running on all available map slots on the cluster. One thing I should have done was to turn off speculative execution. Since we’re creating a specific output file, I was seeing some tasks preemptively fail because the output had already been produced.
	# Reference: https://www.ghostar.org/2013/09/mass-gzip-files-inside-hdfs-using-the-power-of-hadoop/
	Once that’s done, we want to run the hadoop streaming job. I ran it like this. There’s a lot of output here. I include it only for reference.

	$ hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-*.jar
	-Dmapred.reduce.tasks=0 -mapper gzipit.sh
	-input ./gzipped.txt
	-output /user/hcoyote/gzipped.log
	-verbose
	-inputformat org.apache.hadoop.mapred.lib.NLineInputFormat
	-file gzipit.sh
	Important to note on this command line: org.apache.hadoop.mapred.lib.NLineInputFormat is the magic here. It basically tells the job to feed one file per maptask. This allows gzipit.sh to run once per file and gain the parallelism of running on all available map slots on the cluster. One thing I should have done was to turn off speculative execution. Since we’re creating a specific output file, I was seeing some tasks preemptively fail because the output had already been produced.
No results found