Skip to content

Instantly share code, notes, and snippets.

@thanoojgithub
Last active March 20, 2017 17:03
Show Gist options
  • Save thanoojgithub/28ffe681e04322edaee30259646bf28b to your computer and use it in GitHub Desktop.
Save thanoojgithub/28ffe681e04322edaee30259646bf28b to your computer and use it in GitHub Desktop.
Word Count in Hive
$ ls -ltr
total 4
-rw-r--r-- 1 thanooj users 3221 Mar 20 05:36 words_count.txt
$ vi words_count.txt
$ cat words_count.txt
Apache Sqoop is a tool designed for efficiently transferring data betweeen structured, semi-structured and unstructured data sources.
Relational databases are examples of structured data sources with well defined schema for the data they store.
Cassandra, Hbase are examples of semi-structured data sources.
HDFS is an example of unstructured data source that Sqoop can support.
With Sqoop, you can import data from a relational database system or a mainframe into HDFS
For databases, Sqoop will read the table row-by-row into HDFS.
For mainframe datasets, Sqoop will read records from each mainframe dataset into HDFS.
The output of this import process is a set of files
The import process is performed in parallel.
These files may be delimited text files (for example, with commas or tabs separating each field)
binary Avro or SequenceFiles containing serialized record data.
$
hive> use thanooj;
OK
Time taken: 0.807 seconds
hive> select words,count(words) words_count from (select explode(split(word, ' ')) words from words_count) i group by words order by words_count desc;
Query ID = thanooj_20170320094519_f766fa57-2419-485b-b2cc-9217bf67603c
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1489591891391_121653, Tracking URL = http://localhost:8088/proxy/application_1489591891391_121653/
Kill Command = /opt/mapr/hadoop/hadoop-2.7.0/bin/hadoop job -kill job_1489591891391_121653
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2017-03-20 09:45:26,267 Stage-1 map = 0%, reduce = 0%
2017-03-20 09:45:32,454 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 9.63 sec
2017-03-20 09:45:39,634 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.06 sec
MapReduce Total cumulative CPU time: 11 seconds 60 msec
Ended Job = job_1489591891391_121653
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1489591891391_121665, Tracking URL = http://localhost:8088/proxy/application_1489591891391_121665/
Kill Command = /opt/mapr/hadoop/hadoop-2.7.0/bin/hadoop job -kill job_1489591891391_121665
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
2017-03-20 09:45:45,211 Stage-2 map = 0%, reduce = 0%
2017-03-20 09:45:51,388 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 1.34 sec
2017-03-20 09:45:57,529 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 2.78 sec
MapReduce Total cumulative CPU time: 2 seconds 780 msec
Ended Job = job_1489591891391_121665
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 11.06 sec MAPRFS Read: 0 MAPRFS Write: 0 SUCCESS
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 2.78 sec MAPRFS Read: 0 MAPRFS Write: 0 SUCCESS
Total MapReduce CPU Time Spent: 13 seconds 840 msec
OK
data 7
of 5
Sqoop 4
is 4
a 4
files 3
or 3
mainframe 3
into 3
import 3
3
from 2
For 2
HDFS 2
HDFS. 2
The 2
are 2
can 2
each 2
examples 2
for 2
process 2
read 2
semi-structured 2
sources. 2
the 2
unstructured 2
will 2
with 2
databases 1
databases, 1
dataset 1
datasets, 1
defined 1
delimited 1
designed 1
that 1
efficiently 1
example 1
example, 1
well 1
field) 1
Sqoop, 1
they 1
SequenceFiles 1
Relational 1
in 1
Hbase 1
Cassandra, 1
Avro 1
may 1
Apache 1
(for 1
output 1
parallel. 1
performed 1
this 1
tool 1
record 1
records 1
relational 1
row-by-row 1
schema 1
transferring 1
separating 1
serialized 1
set 1
source 1
sources 1
you 1
store. 1
structured 1
structured, 1
support. 1
system 1
table 1
With 1
an 1
and 1
tabs 1
be 1
betweeen 1
binary 1
text 1
commas 1
containing 1
These 1
data. 1
database 1
Time taken: 38.662 seconds, Fetched: 89 row(s)
hive>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment