thanoojgithub · March 20, 2017 17:03
diff --git a/WordCountInHive.sql b/WordCountInHive.sql
 $ ls -ltr
 total 4
 -rw-r--r-- 1 thanooj users 3221 Mar 20 05:36 words_count.txt
 $ vi words_count.txt
 $ cat words_count.txt
 Apache Sqoop is a tool designed for efficiently transferring data betweeen structured, semi-structured and unstructured data sources.
 Relational databases are examples of structured data sources with well defined schema for the data they store.
 Cassandra, Hbase are examples of semi-structured data sources.
 HDFS is an example of unstructured data source that Sqoop can support.
 With Sqoop, you can import data from a relational database system or a mainframe into HDFS
 For databases, Sqoop will read the table row-by-row into HDFS.
 For mainframe datasets, Sqoop will read records from each mainframe dataset into HDFS.
 The output of this import process is a set of files
 The import process is performed in parallel.
 These files may be delimited text files (for example, with commas or tabs separating each field)
 binary Avro or SequenceFiles containing serialized record data.
 $

 hive> use thanooj;
 OK
 Time taken: 0.807 seconds
 hive> select words,count(words) words_count from (select explode(split(word, ' ')) words from words_count) i group by words order by words_count desc;
 Query ID = thanooj_20170320094519_f766fa57-2419-485b-b2cc-9217bf67603c
 Total jobs = 2
 Launching Job 1 out of 2
 Number of reduce tasks not specified. Estimated from input data size: 1
 In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
 In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
 In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
 Starting Job = job_1489591891391_121653, Tracking URL = http://localhost:8088/proxy/application_1489591891391_121653/
 Kill Command = /opt/mapr/hadoop/hadoop-2.7.0/bin/hadoop job  -kill job_1489591891391_121653
 Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
 2017-03-20 09:45:26,267 Stage-1 map = 0%,  reduce = 0%
 2017-03-20 09:45:32,454 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 9.63 sec
 2017-03-20 09:45:39,634 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 11.06 sec
 MapReduce Total cumulative CPU time: 11 seconds 60 msec
 Ended Job = job_1489591891391_121653
 Launching Job 2 out of 2
 Number of reduce tasks determined at compile time: 1
 In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
 In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
 In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
 Starting Job = job_1489591891391_121665, Tracking URL = http://localhost:8088/proxy/application_1489591891391_121665/
 Kill Command = /opt/mapr/hadoop/hadoop-2.7.0/bin/hadoop job  -kill job_1489591891391_121665
 Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
 2017-03-20 09:45:45,211 Stage-2 map = 0%,  reduce = 0%
 2017-03-20 09:45:51,388 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 1.34 sec
 2017-03-20 09:45:57,529 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 2.78 sec
 MapReduce Total cumulative CPU time: 2 seconds 780 msec
 Ended Job = job_1489591891391_121665
 MapReduce Jobs Launched:
 Stage-Stage-1: Map: 1  Reduce: 1   Cumulative CPU: 11.06 sec   MAPRFS Read: 0 MAPRFS Write: 0 SUCCESS
 Stage-Stage-2: Map: 1  Reduce: 1   Cumulative CPU: 2.78 sec   MAPRFS Read: 0 MAPRFS Write: 0 SUCCESS
 Total MapReduce CPU Time Spent: 13 seconds 840 msec
 OK
 data    7
 of      5
 Sqoop   4
 is      4
 a       4
 files   3
 or      3
 mainframe       3
 into    3
 import  3
        3
 from    2
 For     2
 HDFS    2
 HDFS.   2
 The     2
 are     2
 can     2
 each    2
 examples        2
 for     2
 process 2
 read    2
 semi-structured 2
 sources.        2
 the     2
 unstructured    2
 will    2
 with    2
 databases       1
 databases,      1
 dataset 1
 datasets,       1
 defined 1
 delimited       1
 designed        1
 that    1
 efficiently     1
 example 1
 example,        1
 well    1
 field)  1
 Sqoop,  1
 they    1
 SequenceFiles   1
 Relational      1
 in      1
 Hbase   1
 Cassandra,      1
 Avro    1
 may     1
 Apache  1
 (for    1
 output  1
 parallel.       1
 performed       1
 this    1
 tool    1
 record  1
 records 1
 relational      1
 row-by-row      1
 schema  1
 transferring    1
 separating      1
 serialized      1
 set     1
 source  1
 sources 1
 you     1
 store.  1
 structured      1
 structured,     1
 support.        1
 system  1
 table   1
 With    1
 an      1
 and     1
 tabs    1
 be      1
 betweeen        1
 binary  1
 text    1
 commas  1
 containing      1
 These   1
 data.   1
 database        1
 Time taken: 38.662 seconds, Fetched: 89 row(s)
 hive>
	$ ls -ltr
	total 4
	-rw-r--r-- 1 thanooj users 3221 Mar 20 05:36 words_count.txt
	$ vi words_count.txt
	$ cat words_count.txt
	Apache Sqoop is a tool designed for efficiently transferring data betweeen structured, semi-structured and unstructured data sources.
	Relational databases are examples of structured data sources with well defined schema for the data they store.
	Cassandra, Hbase are examples of semi-structured data sources.
	HDFS is an example of unstructured data source that Sqoop can support.
	With Sqoop, you can import data from a relational database system or a mainframe into HDFS
	For databases, Sqoop will read the table row-by-row into HDFS.
	For mainframe datasets, Sqoop will read records from each mainframe dataset into HDFS.
	The output of this import process is a set of files
	The import process is performed in parallel.
	These files may be delimited text files (for example, with commas or tabs separating each field)
	binary Avro or SequenceFiles containing serialized record data.
	$

	hive> use thanooj;
	OK
	Time taken: 0.807 seconds
	hive> select words,count(words) words_count from (select explode(split(word, ' ')) words from words_count) i group by words order by words_count desc;
	Query ID = thanooj_20170320094519_f766fa57-2419-485b-b2cc-9217bf67603c
	Total jobs = 2
	Launching Job 1 out of 2
	Number of reduce tasks not specified. Estimated from input data size: 1
	In order to change the average load for a reducer (in bytes):
	set hive.exec.reducers.bytes.per.reducer=<number>
	In order to limit the maximum number of reducers:
	set hive.exec.reducers.max=<number>
	In order to set a constant number of reducers:
	set mapreduce.job.reduces=<number>
	Starting Job = job_1489591891391_121653, Tracking URL = http://localhost:8088/proxy/application_1489591891391_121653/
	Kill Command = /opt/mapr/hadoop/hadoop-2.7.0/bin/hadoop job -kill job_1489591891391_121653
	Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
	2017-03-20 09:45:26,267 Stage-1 map = 0%, reduce = 0%
	2017-03-20 09:45:32,454 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 9.63 sec
	2017-03-20 09:45:39,634 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 11.06 sec
	MapReduce Total cumulative CPU time: 11 seconds 60 msec
	Ended Job = job_1489591891391_121653
	Launching Job 2 out of 2
	Number of reduce tasks determined at compile time: 1
	In order to change the average load for a reducer (in bytes):
	set hive.exec.reducers.bytes.per.reducer=<number>
	In order to limit the maximum number of reducers:
	set hive.exec.reducers.max=<number>
	In order to set a constant number of reducers:
	set mapreduce.job.reduces=<number>
	Starting Job = job_1489591891391_121665, Tracking URL = http://localhost:8088/proxy/application_1489591891391_121665/
	Kill Command = /opt/mapr/hadoop/hadoop-2.7.0/bin/hadoop job -kill job_1489591891391_121665
	Hadoop job information for Stage-2: number of mappers: 1; number of reducers: 1
	2017-03-20 09:45:45,211 Stage-2 map = 0%, reduce = 0%
	2017-03-20 09:45:51,388 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 1.34 sec
	2017-03-20 09:45:57,529 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 2.78 sec
	MapReduce Total cumulative CPU time: 2 seconds 780 msec
	Ended Job = job_1489591891391_121665
	MapReduce Jobs Launched:
	Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 11.06 sec MAPRFS Read: 0 MAPRFS Write: 0 SUCCESS
	Stage-Stage-2: Map: 1 Reduce: 1 Cumulative CPU: 2.78 sec MAPRFS Read: 0 MAPRFS Write: 0 SUCCESS
	Total MapReduce CPU Time Spent: 13 seconds 840 msec
	OK
	data 7
	of 5
	Sqoop 4
	is 4
	a 4
	files 3
	or 3
	mainframe 3
	into 3
	import 3
	3
	from 2
	For 2
	HDFS 2
	HDFS. 2
	The 2
	are 2
	can 2
	each 2
	examples 2
	for 2
	process 2
	read 2
	semi-structured 2
	sources. 2
	the 2
	unstructured 2
	will 2
	with 2
	databases 1
	databases, 1
	dataset 1
	datasets, 1
	defined 1
	delimited 1
	designed 1
	that 1
	efficiently 1
	example 1
	example, 1
	well 1
	field) 1
	Sqoop, 1
	they 1
	SequenceFiles 1
	Relational 1
	in 1
	Hbase 1
	Cassandra, 1
	Avro 1
	may 1
	Apache 1
	(for 1
	output 1
	parallel. 1
	performed 1
	this 1
	tool 1
	record 1
	records 1
	relational 1
	row-by-row 1
	schema 1
	transferring 1
	separating 1
	serialized 1
	set 1
	source 1
	sources 1
	you 1
	store. 1
	structured 1
	structured, 1
	support. 1
	system 1
	table 1
	With 1
	an 1
	and 1
	tabs 1
	be 1
	betweeen 1
	binary 1
	text 1
	commas 1
	containing 1
	These 1
	data. 1
	database 1
	Time taken: 38.662 seconds, Fetched: 89 row(s)
	hive>