Skip to content

Instantly share code, notes, and snippets.

@sheimi
Created November 9, 2014 05:28
Show Gist options
  • Save sheimi/ff737621a93703f44396 to your computer and use it in GitHub Desktop.
Save sheimi/ff737621a93703f44396 to your computer and use it in GitHub Desktop.
code in blog.sheimi.me: 2012-05-17-source-code-02 (2) injector
public static JobConf createJob(Configuration config, Path crawlDb)
throws IOException {
Path newCrawlDb = new Path(crawlDb,
Integer.toString(new Random().nextInt(Integer.MAX_VALUE)));
JobConf job = new NutchJob(config);
job.setJobName("crawldb " + crawlDb);
Path current = new Path(crawlDb, CURRENT_NAME);
if (FileSystem.get(job).exists(current)) {
FileInputFormat.addInputPath(job, current);
}
job.setInputFormat(SequenceFileInputFormat.class);
job.setMapperClass(CrawlDbFilter.class);
job.setReducerClass(CrawlDbReducer.class);
FileOutputFormat.setOutputPath(job, newCrawlDb);
job.setOutputFormat(MapFileOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(CrawlDatum.class);
// https://issues.apache.org/jira/browse/NUTCH-1110
job.setBoolean("mapreduce.fileoutputcommitter.marksuccessfuljobs", false);
return job;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment