Skip to content

Instantly share code, notes, and snippets.

@goddoe
Created April 29, 2022 06:19
Show Gist options
  • Select an option

  • Save goddoe/ca9e49ad80aed9f84ce7e0018d629a73 to your computer and use it in GitHub Desktop.

Select an option

Save goddoe/ca9e49ad80aed9f84ce7e0018d629a73 to your computer and use it in GitHub Desktop.
map per file.sh
# Reference: https://www.ghostar.org/2013/09/mass-gzip-files-inside-hdfs-using-the-power-of-hadoop/
Once that’s done, we want to run the hadoop streaming job. I ran it like this. There’s a lot of output here. I include it only for reference.
$ hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-*.jar
-Dmapred.reduce.tasks=0 -mapper gzipit.sh
-input ./gzipped.txt
-output /user/hcoyote/gzipped.log
-verbose
-inputformat org.apache.hadoop.mapred.lib.NLineInputFormat
-file gzipit.sh
Important to note on this command line: org.apache.hadoop.mapred.lib.NLineInputFormat is the magic here. It basically tells the job to feed one file per maptask. This allows gzipit.sh to run once per file and gain the parallelism of running on all available map slots on the cluster. One thing I should have done was to turn off speculative execution. Since we’re creating a specific output file, I was seeing some tasks preemptively fail because the output had already been produced.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment