Skip to content

Instantly share code, notes, and snippets.

@airawat
airawat / 00-OozieWorkflowJavaMapReduceAction
Last active February 23, 2023 20:19
Oozie workflow application with a Java Mapreduce action that parses syslog generated log files and generates a report Gist includes sample data, all workflow components, java mapreduce program code, commands - hdfs and Oozie
This gist includes components of a oozie workflow - scripts/code, sample data
and commands; Oozie actions covered: java mapreduce action; Oozie controls
covered: start, kill, end; The java program uses regex to parse the logs, and
also extracts the path of the mapper input directory path and includes in the
key emitted.
Note: The reducer can be specified as a combiner as well.
Usecase
-------
@airawat
airawat / 00-OozieWorkflowSqoopAction
Last active January 20, 2024 07:08
Oozie workflow application with sqoop action Pipes data from Hive table to mysql database table Oozie 3.3.0; Sqoop (1.4.2) with Mysql (5.1.69 )
This gist includes components of a simple workflow application (oozie 3.3.0) that
pipes data in a Hive table to mysql;
The sample application includes:
--------------------------------
1. Oozie actions: sqoop action
2. Oozie workflow controls: start, end, and kill.
3. Workflow components: job.properties and workflow.xml
4. Sample data
5. Prep tasks in Hive
@airawat
airawat / 00-OozieWorkflowHdfsAndEmailActions
Last active November 21, 2018 14:33
Oozie workflow application with FS and email actions; Includes sample data, workflow components, commands.
This gist includes components of a simple workflow application that created a directory and moves files within
hdfs to this directory;
Emails are sent out to notify designated users of success/failure of workflow. There is a prepare section,
to allow re-run of the action..the prepare essentially negates the move done by a potential prior run
of the action. Sample data is also included.
The sample application includes:
--------------------------------
1. Oozie actions: hdfs action and email action
2. Oozie workflow controls: start, end, and kill.
@airawat
airawat / 00-OozieCoordinatorJobWithDatasetCreationAsTrigger
Last active May 21, 2021 16:09
Sample Oozie coordinator job that executes upon availability of a specified dataset. Includes scripts/code, sample data, commands.
This gist includes components of a oozie, dataset availability initiated, coordinator job -
scripts/code, sample data and commands; Oozie actions covered: hdfs action, email action,
sqoop action (mysql database); Oozie controls covered: decision;
Usecase
-------
Pipe report data available in HDFS, to mysql database;
Pictorial overview of job:
--------------------------
@airawat
airawat / 00-OozieCoordinatorJobWithFileAsTrigger
Last active February 12, 2018 10:10
Oozie coordinator job example with trigger file as trigger
This gist includes components of a oozie (trigger file initiated) coordinator job -
scripts/code, sample data and commands; Oozie actions covered: hdfs action, email action,
java main action, hive action; Oozie controls covered: decision, fork-join; The workflow
includes a sub-workflow that runs two hive actions concurrently. The hive table is
partitioned; Parsing uses hive-regex serde, and Java-regex. Also, the java mapper, gets
the input directory path and includes part of it in the key.
Usecase
-------
Parse Syslog generated log files to generate reports;
@airawat
airawat / 00-OozieCoordinatorJobWithTimeAsTrigger
Last active October 21, 2017 15:40
Oozie coordinator job example with time as trigger
This gist includes components of a oozie (time initiated) coordinator application - scripts/code, sample data
and commands; Oozie actions covered: hdfs action, email action, java main action,
hive action; Oozie controls covered: decision, fork-join; The workflow includes a
sub-workflow that runs two hive actions concurrently. The hive table is partitioned;
Parsing uses hive-regex serde, and Java-regex. Also, the java mapper, gets the input
directory path and includes part of it in the key.
Usecase: Parse Syslog generated log files to generate reports;
Pictorial overview of job:
@airawat
airawat / 00-OozieWorkflowStreamingMRAction-Python
Last active November 21, 2018 06:24
Sample of an Oozie workflow with streaming action - parses Syslog generated log files using python -regex
This gist includes oozie workflow components (streaming map reduce action) to execute
python mapper and reducer scripts to parse Syslog generated log files using regex;
Usecase: Count the number of occurances of processes that got logged, by month, and process.
Pictorial overview of workflow:
--------------------------------
http://hadooped.blogspot.com/2013/07/apache-oozie-part-5-oozie-workflow-with.html
Includes:
---------
@airawat
airawat / 00-OozieWorkflowWithPigAction
Last active January 25, 2022 21:41
Sample of an Oozie workflow with pig action - parses Syslog generated log files using regex.
This gist includes oozie workflow components to run a pig latin script to parse
(Syslog generated) log files using regex;
Usecase: Count the number of occurances of processes that got logged, by month,
day and process.
Pictorial overview of workflow:
-------------------------------
http://hadooped.blogspot.com/2013/07/apache-oozie-part-7-oozie-workflow-with_3.html
Includes:
@airawat
airawat / 00-LogParserPigLatinNativeMapReduce
Last active December 19, 2015 07:49
There might be situations were you may have to reuse java map reduce programs within a pig program. This blog includes a sample pig script, with associated jars and sample data. The input is Syslog generated log files, and the output is a count of occurrences of processes logged inception to date.
This gist includes a pig latin script to parse Syslog generated log files through a
java mapreduce program that uses regex;
Usecase: Count the number of occurances of processes that got logged, by month,
day and process.
Related gist that covers the java code - https://gist.github.com/airawat/5915374
Pig version: version 0.10.0
This gist includes a pig latin script to parse Syslog generated log files using regex;
Usecase: Count the number of occurances of processes that got logged, by month,
day and process.
Includes:
---------
Sample data and structure: 01-SampleDataAndStructure
Data and script download: 02-DataAndScriptDownload
Data load commands: 03-HdfsLoadCommands
Pig script: 04-PigLatinScript