asw456’s gists

asw456 / 00-LogParser-JavaMapReduce-Regex

Created May 4, 2014 23:44 — forked from airawat/00-LogParser-JavaMapReduce-Regex

	This gist includes a mapper, reducer and driver in java that can parse log files using
	regex; The code for combiner is the same as reducer;
	Usecase: Count the number of occurances of processes that got logged, inception to date.

	Includes:
	---------
	Sample data and scripts for download:01-ScriptAndDataDownload
	Sample data and structure: 02-SampleDataAndStructure
	Mapper: 03-LogEventCountMapper.java
	Reducer: 04-LogEventCountReducer.java

asw456 / 00-LogParser-Hive-Regex

Created May 4, 2014 23:43 — forked from airawat/00-LogParser-Hive-Regex

	This gist includes hive ql scripts to create an external partitioned table for Syslog
	generated log files using regex serde;
	Usecase: Count the number of occurances of processes that got logged, by year, month,
	day and process.

	Includes:
	---------
	Sample data and structure: 01-SampleDataAndStructure
	Data download: 02-DataDownload
	Data load commands: 03-DataLoadCommands

asw456 / 00-CreatingSequenceFile

Created May 4, 2014 23:42 — forked from airawat/00-CreatingSequenceFile

	This gist demonstrates how to create a sequence file (compressed and uncompressed), from a text file.

	Includes:
	---------
	1. Input data and script download
	2. Input data-review
	3. Data load commands
	4. Mapper code
	5. Driver code to create the sequence file out of a text file in HDFS
	6. Command to run Java program

asw456 / 00-ReduceSideJoin

Created May 4, 2014 23:42 — forked from airawat/00-ReduceSideJoin

	My blog has an introduction to reduce side join in Java map reduce-
	http://hadooped.blogspot.com/2013/09/reduce-side-join-options-in-java-map.html

asw456 / 00-CombineFileInputFornat

Created May 4, 2014 23:41 — forked from airawat/00-CombineFileInputFornat

	*************************
	Gist
	*************************

	One more gist related to controlling the number of mappers in a mapreduce task.

	Background on Inputsplits
	--------------------------
	An inputsplit is a chunk of the input data allocated to a map task for processing. FileInputFormat
	generates inputsplits (and divides the same into records) - one inputsplit for each file, unless the

asw456 / 00-MapSideJoinLargeDatasets

Created May 4, 2014 23:41 — forked from airawat/00-MapSideJoinLargeDatasets

	**********************
	**Gist
	**********************

	This gist details how to inner join two large datasets on the map-side, leveraging the join capability
	in mapreduce. Such a join makes sense if both input datasets are too large to qualify for distribution
	through distributedcache, and can be implemented if both input datasets can be joined by the join key
	and both input datasets are sorted in the same order, by the join key.

	There are two critical pieces to engaging the join behavior:

asw456 / 00-MultipleOutputs

Created May 4, 2014 23:40 — forked from airawat/00-MultipleOutputs

	********************************
	Gist
	********************************

	Motivation
	-----------
	The typical mapreduce job creates files with the prefix "part-"..and then the "m" or "r" depending
	on whether it is a map or a reduce output, and then the part number. There are scenarios where we
	may want to create separate files based on criteria-data keys and/or values. Enter the "MultipleOutputs"
	functionality.

asw456 / gist:764d961de61a10305aaa

Created April 30, 2014 08:13 — forked from anonymous/gist:2912716

	public class ExampleRowKey
	{
	long userId;
	String applicationId;

	public byte[] getBytes() throws IOException
	{
	ByteArrayOutputStream byteOutput = new ByteArrayOutputStream();
	DataOutputStream data = new DataOutputStream(byteOutput);

asw456 / gist:6af8be185511a799621f

Created April 30, 2014 06:22 — forked from anonymous/gist:4227336

	Scan scan = new Scan();
	scan.setFilter(new ProxyFilter(new MyFilter(appId)));

asw456 / gist:2c748257897f595bb3a4

Created April 30, 2014 06:22 — forked from anonymous/gist:4227003

	Scan scan = new Scan();
	scan.setFilter(new MyFilter(appId)); // get only rows for the app with appId
	Htable table = new HTable(config, Bytes.UTF8(tableName); // for this table
	ResultScanner results = table.getScanner(scan); // apply the scan

asw456 asw456