Last active
March 20, 2017 17:03
-
-
Save thanoojgithub/28ffe681e04322edaee30259646bf28b to your computer and use it in GitHub Desktop.
Word Count in Hive
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ ls -ltr | |
total 4 | |
-rw-r--r-- 1 thanooj users 3221 Mar 20 05:36 words_count.txt | |
$ vi words_count.txt | |
$ cat words_count.txt | |
Apache Sqoop is a tool designed for efficiently transferring data betweeen structured, semi-structured and unstructured data sources. | |
Relational databases are examples of structured data sources with well defined schema for the data they store. | |
Cassandra, Hbase are examples of semi-structured data sources. | |
HDFS is an example of unstructured data source that Sqoop can support. | |
With Sqoop, you can import data from a relational database system or a mainframe into HDFS | |
For databases, Sqoop will read the table row-by-row into HDFS. | |
For mainframe datasets, Sqoop will read records from each mainframe dataset into HDFS. | |
The output of this import process is a set of files | |
The import process is performed in parallel. | |
These files may be delimited text files (for example, with commas or tabs separating each field) | |
binary Avro or SequenceFiles containing serialized record data. | |
$ | |
hive> use thanooj; | |
OK | |
Time taken: 0.807 seconds | |
hive> select words,count(words) words_count from (select explode(split(word, ' ')) words from words_count) i group by words order by words_count desc; | |
Query ID = thanooj_20170320094519_f766fa57-2419-485b-b2cc-9217bf67603c | |
Total jobs = 2 | |
Launching Job 1 out of 2 | |
Number of reduce tasks not specified. Estimated from input data size: 1 | |
In order to change the average load for a reducer (in bytes): | |
set hive.exec.reducers.bytes.per.reducer=<number> | |
In order to limit the maximum number of reducers: | |
set hive.exec.reducers.max=<number> | |
In order to set a constant number of reducers: | |
set mapreduce.job.reduces=<number> | |
Starting Job = job_1489591891391_121653, Tracking URL = http://localhost:8088/proxy/application_1489591891391_121653/ | |
Kill Command = /opt/mapr/hadoop/hadoop-2.7.0/bin/hadoop job -kill job_1489591891391_121653 | |
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 | |
2017-03-20 09:45:26,267 Stage-1 map = 0%, reduce = 0% | |
2017-03-20 09:45:32,454 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 9.63 sec | |
2017-03-20 09:45:39,634 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.06 sec | |
MapReduce Total cumulative CPU time: 11 seconds 60 msec | |
Ended Job = job_1489591891391_121653 | |
Launching Job 2 out of 2 | |
Number of reduce tasks determined at compile time: 1 | |
In order to change the average load for a reducer (in bytes): | |
set hive.exec.reducers.bytes.per.reducer=<number> | |
In order to limit the maximum number of reducers: | |
set hive.exec.reducers.max=<number> | |
In order to set a constant number of reducers: | |
set mapreduce.job.reduces=<number> | |
Starting Job = job_1489591891391_121665, Tracking URL = http://localhost:8088/proxy/application_1489591891391_121665/ | |
Kill Command = /opt/mapr/hadoop/hadoop-2.7.0/bin/hadoop job -kill job_1489591891391_121665 | |
Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1 | |
2017-03-20 09:45:45,211 Stage-2 map = 0%, reduce = 0% | |
2017-03-20 09:45:51,388 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 1.34 sec | |
2017-03-20 09:45:57,529 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 2.78 sec | |
MapReduce Total cumulative CPU time: 2 seconds 780 msec | |
Ended Job = job_1489591891391_121665 | |
MapReduce Jobs Launched: | |
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 11.06 sec MAPRFS Read: 0 MAPRFS Write: 0 SUCCESS | |
Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 2.78 sec MAPRFS Read: 0 MAPRFS Write: 0 SUCCESS | |
Total MapReduce CPU Time Spent: 13 seconds 840 msec | |
OK | |
data 7 | |
of 5 | |
Sqoop 4 | |
is 4 | |
a 4 | |
files 3 | |
or 3 | |
mainframe 3 | |
into 3 | |
import 3 | |
3 | |
from 2 | |
For 2 | |
HDFS 2 | |
HDFS. 2 | |
The 2 | |
are 2 | |
can 2 | |
each 2 | |
examples 2 | |
for 2 | |
process 2 | |
read 2 | |
semi-structured 2 | |
sources. 2 | |
the 2 | |
unstructured 2 | |
will 2 | |
with 2 | |
databases 1 | |
databases, 1 | |
dataset 1 | |
datasets, 1 | |
defined 1 | |
delimited 1 | |
designed 1 | |
that 1 | |
efficiently 1 | |
example 1 | |
example, 1 | |
well 1 | |
field) 1 | |
Sqoop, 1 | |
they 1 | |
SequenceFiles 1 | |
Relational 1 | |
in 1 | |
Hbase 1 | |
Cassandra, 1 | |
Avro 1 | |
may 1 | |
Apache 1 | |
(for 1 | |
output 1 | |
parallel. 1 | |
performed 1 | |
this 1 | |
tool 1 | |
record 1 | |
records 1 | |
relational 1 | |
row-by-row 1 | |
schema 1 | |
transferring 1 | |
separating 1 | |
serialized 1 | |
set 1 | |
source 1 | |
sources 1 | |
you 1 | |
store. 1 | |
structured 1 | |
structured, 1 | |
support. 1 | |
system 1 | |
table 1 | |
With 1 | |
an 1 | |
and 1 | |
tabs 1 | |
be 1 | |
betweeen 1 | |
binary 1 | |
text 1 | |
commas 1 | |
containing 1 | |
These 1 | |
data. 1 | |
database 1 | |
Time taken: 38.662 seconds, Fetched: 89 row(s) | |
hive> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment