Created
June 11, 2012 18:09
-
-
Save ceteri/2911686 to your computer and use it in GitHub Desktop.
Cascading for the Impatient, Part 1
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
public class | |
Main | |
{ | |
public static void | |
main( String[] args ) | |
{ | |
String inPath = args[ 0 ]; | |
String outPath = args[ 1 ]; | |
Properties properties = new Properties(); | |
AppProps.setApplicationJarClass( properties, Main.class ); | |
HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties ); | |
// create the source tap | |
Tap inTap = new Hfs( new TextDelimited( true, "\t" ), inPath ); | |
// create the sink tap | |
Tap outTap = new Hfs( new TextDelimited( true, "\t" ), outPath ); | |
// specify a pipe to connect the taps | |
Pipe copyPipe = new Pipe( "copy" ); | |
// connect the taps, pipes, etc., into a flow | |
FlowDef flowDef = FlowDef.flowDef() | |
.addSource( copyPipe, inTap ) | |
.addTailSink( copyPipe, outTap ); | |
// run the flow | |
flowConnector.connect( flowDef ).complete(); | |
} | |
} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
copyPipe = LOAD '$inPath' USING PigStorage('\t', 'tagsource'); | |
STORE copyPipe INTO '$outPath' using PigStorage('\t', 'tagsource'); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
doc_id text | |
doc01 A rain shadow is a dry area on the lee back side of a mountainous area. | |
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover. | |
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain. | |
doc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley. | |
doc05 Two Women. Secrets. A Broken Land. [DVD Australia] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
bash-3.2$ ls -lth | |
total 32 | |
-rw-r--r-- 1 paco staff 1.7K Jun 28 15:14 build.gradle | |
-rw-r--r-- 1 paco staff 819B Jun 28 15:14 LICENSE.txt | |
-rw-r--r-- 1 paco staff 5.2K Jun 27 15:54 README.md | |
drwxr-xr-x 3 paco staff 102B Jun 26 14:46 src | |
drwxr-xr-x 3 paco staff 102B Jun 11 10:18 data | |
bash-3.2$ gradle -version | |
------------------------------------------------------------ | |
Gradle 1.0 | |
------------------------------------------------------------ | |
Gradle build time: Tuesday, June 12, 2012 12:56:21 AM UTC | |
Groovy: 1.8.6 | |
Ant: Apache Ant(TM) version 1.8.2 compiled on December 20 2010 | |
Ivy: 2.2.0 | |
JVM: 1.6.0_33 (Apple Inc. 20.8-b03-424) | |
OS: Mac OS X 10.6.8 x86_64 | |
bash-3.2$ hadoop version | |
Warning: $HADOOP_HOME is deprecated. | |
Hadoop 1.0.3 | |
Subversion https://svn.apache.org/repos/asf/hadoop/common/branches/branch-1.0 -r 1335192 | |
Compiled by hortonfo on Tue May 8 20:31:25 UTC 2012 | |
From source with checksum e6b0c1e23dcf76907c5fecb4b832f3be | |
bash-3.2$ gradle clean jar | |
:clean UP-TO-DATE | |
:compileJava | |
:processResources UP-TO-DATE | |
:classes | |
:jar | |
BUILD SUCCESSFUL | |
Total time: 16.061 secs | |
bash-3.2$ hadoop jar ./build/libs/impatient.jar data/rain.txt output/rain | |
Warning: $HADOOP_HOME is deprecated. | |
12/06/29 09:01:55 INFO util.HadoopUtil: resolving application jar from found main method on: impatient.Main | |
12/06/29 09:01:55 INFO planner.HadoopPlanner: using application jar: /Users/paco/src/concur/impatient/part1/./build/libs/impatient.jar | |
12/06/29 09:01:55 INFO property.AppProps: using app.id: FEE428FA32D899D051AA404BA448DE3A | |
12/06/29 09:01:55 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable | |
12/06/29 09:01:55 WARN snappy.LoadSnappy: Snappy native library not loaded | |
12/06/29 09:01:55 INFO mapred.FileInputFormat: Total input paths to process : 1 | |
12/06/29 09:01:56 INFO util.Version: Concurrent, Inc - Cascading 2.0.1 | |
12/06/29 09:01:56 INFO flow.Flow: [] starting | |
12/06/29 09:01:56 INFO flow.Flow: [] source: Hfs["TextDelimited[['doc_id', 'text']->[ALL]]"]["data/rain.txt"]"] | |
12/06/29 09:01:56 INFO flow.Flow: [] sink: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'text']]"]["output/rain"]"] | |
12/06/29 09:01:56 INFO flow.Flow: [] parallel execution is enabled: false | |
12/06/29 09:01:56 INFO flow.Flow: [] starting jobs: 1 | |
12/06/29 09:01:56 INFO flow.Flow: [] allocating threads: 1 | |
12/06/29 09:01:56 INFO flow.FlowStep: [] starting step: (1/1) output/rain | |
12/06/29 09:01:56 INFO mapred.FileInputFormat: Total input paths to process : 1 | |
12/06/29 09:01:56 INFO flow.FlowStep: [] submitted hadoop job: job_local_0001 | |
12/06/29 09:01:56 INFO mapred.Task: Using ResourceCalculatorPlugin : null | |
12/06/29 09:01:56 INFO io.MultiInputSplit: current split input path: file:/Users/paco/src/concur/impatient/part1/data/rain.txt | |
12/06/29 09:01:56 INFO mapred.MapTask: numReduceTasks: 0 | |
12/06/29 09:01:56 INFO hadoop.FlowMapper: sourcing from: Hfs["TextDelimited[['doc_id', 'text']->[ALL]]"]["data/rain.txt"]"] | |
12/06/29 09:01:56 INFO hadoop.FlowMapper: sinking to: Hfs["TextDelimited[[UNKNOWN]->['doc_id', 'text']]"]["output/rain"]"] | |
12/06/29 09:01:56 INFO mapred.Task: Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting | |
12/06/29 09:01:56 INFO mapred.LocalJobRunner: | |
12/06/29 09:01:56 INFO mapred.Task: Task attempt_local_0001_m_000000_0 is allowed to commit now | |
12/06/29 09:01:56 INFO mapred.FileOutputCommitter: Saved output of task 'attempt_local_0001_m_000000_0' to file:/Users/paco/src/concur/impatient/part1/output/rain | |
12/06/29 09:01:59 INFO mapred.LocalJobRunner: file:/Users/paco/src/concur/impatient/part1/data/rain.txt:0+510 | |
12/06/29 09:01:59 INFO mapred.Task: Task 'attempt_local_0001_m_000000_0' done. | |
12/06/29 09:02:01 INFO util.Hadoop18TapUtil: deleting temp path output/rain/_temporary | |
bash-3.2$ | |
bash-3.2$ head output/rain/part-00000 | |
doc_id text | |
doc01 A rain shadow is a dry area on the lee back side of a mountainous area. | |
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover. | |
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain. | |
doc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley. | |
doc05 Two Women. Secrets. A Broken Land. [DVD Australia] | |
bash-3.2$ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
bash-3.2$ pig -version | |
Warning: $HADOOP_HOME is deprecated. | |
Apache Pig version 0.10.0 (r1328203) | |
compiled Apr 19 2012, 22:54:12 | |
bash-3.2$ pig -p inPath=./data/rain.txt -p outPath=./output/rain ./src/scripts/copy.pig | |
Warning: $HADOOP_HOME is deprecated. | |
2012-08-27 13:24:21,632 [main] INFO org.apache.pig.Main - Apache Pig version 0.10.0 (r1328203) compiled Apr 19 2012, 22:54:12 | |
2012-08-27 13:24:21,633 [main] INFO org.apache.pig.Main - Logging error messages to: /Users/ceteri/src/concur/Impatient/part1/pig_1346099061629.log | |
2012-08-27 13:24:21.724 java[69946:1903] Unable to load realm info from SCDynamicStore | |
2012-08-27 13:24:21,931 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// | |
2012-08-27 13:24:22,261 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN | |
2012-08-27 13:24:22,340 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false | |
2012-08-27 13:24:22,355 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1 | |
2012-08-27 13:24:22,355 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1 | |
2012-08-27 13:24:22,373 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job | |
2012-08-27 13:24:22,384 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3 | |
2012-08-27 13:24:22,386 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job8693841438339640396.jar | |
2012-08-27 13:24:26,339 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job8693841438339640396.jar created | |
2012-08-27 13:24:26,350 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job | |
2012-08-27 13:24:26,369 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission. | |
2012-08-27 13:24:26,377 [Thread-5] WARN org.apache.hadoop.util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable | |
2012-08-27 13:24:26,481 [Thread-5] INFO org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1 | |
2012-08-27 13:24:26,481 [Thread-5] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1 | |
2012-08-27 13:24:26,489 [Thread-5] WARN org.apache.hadoop.io.compress.snappy.LoadSnappy - Snappy native library not loaded | |
2012-08-27 13:24:26,492 [Thread-5] INFO org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1 | |
2012-08-27 13:24:26,674 [Thread-6] INFO org.apache.hadoop.mapred.Task - Using ResourceCalculatorPlugin : null | |
2012-08-27 13:24:26,686 [Thread-6] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigRecordReader - Current split being processed file:/Users/ceteri/src/concur/Impatient/part1/data/rain.txt:0+510 | |
2012-08-27 13:24:26,717 [Thread-6] INFO org.apache.hadoop.mapred.Task - Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting | |
2012-08-27 13:24:26,720 [Thread-6] INFO org.apache.hadoop.mapred.LocalJobRunner - | |
2012-08-27 13:24:26,720 [Thread-6] INFO org.apache.hadoop.mapred.Task - Task attempt_local_0001_m_000000_0 is allowed to commit now | |
2012-08-27 13:24:26,722 [Thread-6] INFO org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter - Saved output of task 'attempt_local_0001_m_000000_0' to file:/Users/ceteri/src/concur/Impatient/part1/output/rain | |
2012-08-27 13:24:26,871 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_local_0001 | |
2012-08-27 13:24:26,871 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete | |
2012-08-27 13:24:29,657 [Thread-6] INFO org.apache.hadoop.mapred.LocalJobRunner - | |
2012-08-27 13:24:29,657 [Thread-6] INFO org.apache.hadoop.mapred.Task - Task 'attempt_local_0001_m_000000_0' done. | |
2012-08-27 13:24:29,658 [Thread-6] WARN org.apache.hadoop.mapred.FileOutputCommitter - Output path is null in cleanup | |
2012-08-27 13:24:31,882 [main] WARN org.apache.pig.tools.pigstats.PigStatsUtil - Failed to get RunningJob for job job_local_0001 | |
2012-08-27 13:24:31,885 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete | |
2012-08-27 13:24:31,886 [main] INFO org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: | |
HadoopVersion PigVersion UserId StartedAt FinishedAt Features | |
1.0.3 0.10.0 ceteri 2012-08-27 13:24:22 2012-08-27 13:24:31 UNKNOWN | |
Success! | |
Job Stats (time in seconds): | |
JobId Maps Reduces MaxMapTime MinMapTIme AvgMapTime MaxReduceTime MinReduceTime AvgReduceTime Alias Feature Outputs | |
job_local_0001 1 0 n/a n/a n/a 0 0 0 copyPipe MAP_ONLY file:///Users/ceteri/src/concur/Impatient/part1/output/rain, | |
Input(s): | |
Successfully read 0 records from: "file:///Users/ceteri/src/concur/Impatient/part1/data/rain.txt" | |
Output(s): | |
Successfully stored 0 records in: "file:///Users/ceteri/src/concur/Impatient/part1/output/rain" | |
Counters: | |
Total records written : 0 | |
Total bytes written : 0 | |
Spillable Memory Manager spill count : 0 | |
Total bags proactively spilled: 0 | |
Total records proactively spilled: 0 | |
Job DAG: | |
job_local_0001 | |
2012-08-27 13:24:31,887 [main] INFO org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success! | |
bash-3.2$ cat output/rain/part-m-00000 | |
doc_id text | |
doc01 A rain shadow is a dry area on the lee back side of a mountainous area. | |
doc02 This sinking, dry air produces a rain shadow, or area in the lee of a mountain with less rain and cloudcover. | |
doc03 A rain shadow is an area of dry land that lies on the leeward (or downwind) side of a mountain. | |
doc04 This is known as the rain shadow effect and is the primary cause of leeward deserts of mountain ranges, such as California's Death Valley. | |
doc05 Two Women. Secrets. A Broken Land. [DVD Australia] | |
bash-3.2$ |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment