Skip to content

Instantly share code, notes, and snippets.

@asw456
Forked from airawat/00-LogParser-Hive-Regex
Created May 4, 2014 23:43
Show Gist options
  • Save asw456/6c7b35c3b1c99aa57147 to your computer and use it in GitHub Desktop.
Save asw456/6c7b35c3b1c99aa57147 to your computer and use it in GitHub Desktop.
This gist includes hive ql scripts to create an external partitioned table for Syslog
generated log files using regex serde;
Usecase: Count the number of occurances of processes that got logged, by year, month,
day and process.
Includes:
---------
Sample data and structure: 01-SampleDataAndStructure
Data download: 02-DataDownload
Data load commands: 03-DataLoadCommands
Hive commands: 04-HiveCommands
Sample output: 05-SampleOutput
Sample data
------------
May 3 11:52:54 cdh-dn03 init: tty (/dev/tty6) main process (1208) killed by TERM signal
May 3 11:53:31 cdh-dn03 kernel: registered taskstats version 1
May 3 11:53:31 cdh-dn03 kernel: sr0: scsi3-mmc drive: 32x/32x xa/form2 tray
May 3 11:53:31 cdh-dn03 kernel: piix4_smbus 0000:00:07.0: SMBus base address uninitialized - upgrade BIOS or use force_addr=0xaddr
May 3 11:53:31 cdh-dn03 kernel: nf_conntrack version 0.5.0 (7972 buckets, 31888 max)
May 3 11:53:57 cdh-dn03 kernel: hrtimer: interrupt took 11250457 ns
May 3 11:53:59 cdh-dn03 ntpd_initres[1705]: host name not found: 0.rhel.pool.ntp.org
Structure
----------
Month = May
Day = 3
Time = 11:52:54
Node = cdh-dn03
Process = init:
Log msg = tty (/dev/tty6) main process (1208) killed by TERM signal
Data download
-------------
https://groups.google.com/forum/?hl=en#!topic/hadooped/_tj8w_E-MGY
Directory structure
-------------------
LogParserSampleHive
logs
airawat-syslog
2013
04
messages
2013
05
messages
Data load commands
------------------
a) Load the data
$ hadoop fs -mkdir LogParserSampleHive
$ hadoop fs -mkdir LogParserSampleHive/logs
$ hadoop fs -put LogParserSampleHive/logs/* LogParserSampleHive/logs/
$ hadoop fs -ls -R LogParserSampleHive/ | awk {'print $8'}
LogParserSampleHive/logs
LogParserSampleHive/logs/airawat-syslog
LogParserSampleHive/logs/airawat-syslog/2013
LogParserSampleHive/logs/airawat-syslog/2013/04
LogParserSampleHive/logs/airawat-syslog/2013/04/messages
LogParserSampleHive/logs/airawat-syslog/2013/05
LogParserSampleHive/logs/airawat-syslog/2013/05/messages
Hive commands
--------------
a) Create external table:
hive> CREATE EXTERNAL TABLE LogParserSample(
month_name STRING,
day STRING,
time STRING,
host STRING,
event STRING,
log STRING)
PARTITIONED BY(year int, month int)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe'
WITH SERDEPROPERTIES (
"input.regex" = "(\\w+)\\s+(\\d+)\\s+(\\d+:\\d+:\\d+)\\s+(\\w+\\W*\\w*)\\s+(.*?\\:)\\s+(.*$)"
)
stored as textfile;
b) Create partitions and load data:
Note: Replace '/user/airawat' with '/user/<your userID>'
hive> Alter table LogParserSample Add IF NOT EXISTS partition(year=2013, month=04)
location '/user/airawat/LogParserSampleHive/logs/airawat-syslog/2013/04/';
hive> Alter table LogParserSample Add IF NOT EXISTS partition(year=2013, month=05)
location '/user/airawat/LogParserSampleHive/logs/airawat-syslog/2013/05/';
Hive query
-----------
hive> set hive.cli.print.header=true;
hive> add jar hadoop-lib/hive-contrib-0.10.0-cdh4.2.0.jar; --I need this as my environment is not properly configured
hive> select Year,Month,Day,Event,Count(*) Occurrence from LogParserSample group by year,month,day,event order by event desc,year,month,day;
Query output
------------
year month day event occurrence
2013 05 7 udevd[361]: 1
2013 04 23 sudo: 1
2013 05 3 sudo: 1
2013 05 3 ntpd_initres[1705]: 144
2013 05 4 ntpd_initres[1705]: 261
2013 05 5 ntpd_initres[1705]: 264
2013 05 6 ntpd_initres[1705]: 123
2013 05 3 kernel: 5
2013 05 6 kernel: 1
2013 05 7 kernel: 52
2013 05 3 init: 5
2013 05 7 init: 18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment