Created
April 29, 2022 06:19
-
-
Save goddoe/ca9e49ad80aed9f84ce7e0018d629a73 to your computer and use it in GitHub Desktop.
map per file.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| # Reference: https://www.ghostar.org/2013/09/mass-gzip-files-inside-hdfs-using-the-power-of-hadoop/ | |
| Once that’s done, we want to run the hadoop streaming job. I ran it like this. There’s a lot of output here. I include it only for reference. | |
| $ hadoop jar /usr/lib/hadoop-0.20/contrib/streaming/hadoop-streaming-*.jar | |
| -Dmapred.reduce.tasks=0 -mapper gzipit.sh | |
| -input ./gzipped.txt | |
| -output /user/hcoyote/gzipped.log | |
| -verbose | |
| -inputformat org.apache.hadoop.mapred.lib.NLineInputFormat | |
| -file gzipit.sh | |
| Important to note on this command line: org.apache.hadoop.mapred.lib.NLineInputFormat is the magic here. It basically tells the job to feed one file per maptask. This allows gzipit.sh to run once per file and gain the parallelism of running on all available map slots on the cluster. One thing I should have done was to turn off speculative execution. Since we’re creating a specific output file, I was seeing some tasks preemptively fail because the output had already been produced. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment