Skip to content

Instantly share code, notes, and snippets.

@thanoojgithub
Last active August 9, 2018 12:01
Show Gist options
  • Save thanoojgithub/84c7509db07d8d3e7798 to your computer and use it in GitHub Desktop.
Save thanoojgithub/84c7509db07d8d3e7798 to your computer and use it in GitHub Desktop.
collect_set collect_list in Hive
hive> select * from thanooj.cust_credit;
OK
cust_credit.id cust_credit.name cust_credit.doa cust_credit.location
1 sriram 2015-10-12 ayodhya
2 seeta 2015-09-12 midhila
3 lakshman 2015-11-12 ayodhya
4 bharata 2015-12-12 ayodhya
5 sathrugna 2015-12-12 ayodhya
6 hanuma 2015-11-12 ayodhya
7 sriram 2015-10-12 ayodhya
8 seeta 2015-09-12 midhila
9 lakshman 2015-11-12 ayodhya
10 bharata 2015-12-12 ayodhya
11 sathrugna 2015-12-12 ayodhya
12 hanuma 2015-11-12 ayodhya
Time taken: 0.085 seconds, Fetched: 12 row(s)
hive> select doa, collect_set(location) as c_location from thanooj.cust_credit group by doa;
Query ID = ubuntu_20160304013923_5b743ed0-5ce2-4663-ad7a-9127fc1b8f65
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1457070465534_0011, Tracking URL = http://ubuntu:8088/proxy/application_1457070465534_0011/
Kill Command = /home/ubuntu/softwares/hadoop-2.7.2/bin/hadoop job -kill job_1457070465534_0011
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-03-04 01:39:30,505 Stage-1 map = 0%, reduce = 0%
2016-03-04 01:39:49,275 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.28 sec
2016-03-04 01:39:55,563 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.82 sec
MapReduce Total cumulative CPU time: 2 seconds 820 msec
Ended Job = job_1457070465534_0011
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.82 sec HDFS Read: 8357 HDFS Write: 76 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 820 msec
OK
doa c_location
2015-09-12 ["midhila"]
2015-10-12 ["ayodhya"]
2015-11-12 ["ayodhya"]
2015-12-12 ["ayodhya"]
Time taken: 33.306 seconds, Fetched: 4 row(s)
hive> select doa, collect_list(location) as c_location from thanooj.cust_credit group by doa;
Query ID = ubuntu_20160304014426_86800032-1c12-4d9d-8b42-b3f14984ef8d
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1457070465534_0012, Tracking URL = http://ubuntu:8088/proxy/application_1457070465534_0012/
Kill Command = /home/ubuntu/softwares/hadoop-2.7.2/bin/hadoop job -kill job_1457070465534_0012
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-03-04 01:44:34,866 Stage-1 map = 0%, reduce = 0%
2016-03-04 01:44:53,565 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.35 sec
2016-03-04 01:45:00,850 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 2.87 sec
MapReduce Total cumulative CPU time: 2 seconds 870 msec
Ended Job = job_1457070465534_0012
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 2.87 sec HDFS Read: 8365 HDFS Write: 140 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 870 msec
OK
doa c_location
2015-09-12 ["midhila","midhila"]
2015-10-12 ["ayodhya","ayodhya"]
2015-11-12 ["ayodhya","ayodhya","ayodhya","ayodhya"]
2015-12-12 ["ayodhya","ayodhya","ayodhya","ayodhya"]
Time taken: 35.462 seconds, Fetched: 4 row(s)
hive> select doa, collect_set(location)[0] as c_location from thanooj.cust_credit group by doa;
Query ID = ubuntu_20160304013713_6b5f6836-aea4-4f3d-9db7-8bc48afdc25a
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1457070465534_0009, Tracking URL = http://ubuntu:8088/proxy/application_1457070465534_0009/
Kill Command = /home/ubuntu/softwares/hadoop-2.7.2/bin/hadoop job -kill job_1457070465534_0009
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-03-04 01:37:21,641 Stage-1 map = 0%, reduce = 0%
2016-03-04 01:37:51,873 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.1 sec
2016-03-04 01:38:07,512 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.0 sec
MapReduce Total cumulative CPU time: 3 seconds 0 msec
Ended Job = job_1457070465534_0009
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.0 sec HDFS Read: 8662 HDFS Write: 76 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 0 msec
OK
doa c_location
2015-09-12 midhila
2015-10-12 ayodhya
2015-11-12 ayodhya
2015-12-12 ayodhya
Time taken: 55.03 seconds, Fetched: 4 row(s)
hive> select doa, collect_list(location)[0] as c_location from thanooj.cust_credit group by doa;
Query ID = ubuntu_20160304013821_aadf5d59-2744-4399-a112-6f0b168dc49c
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1457070465534_0010, Tracking URL = http://ubuntu:8088/proxy/application_1457070465534_0010/
Kill Command = /home/ubuntu/softwares/hadoop-2.7.2/bin/hadoop job -kill job_1457070465534_0010
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-03-04 01:38:28,539 Stage-1 map = 0%, reduce = 0%
2016-03-04 01:38:42,097 Stage-1 map = 100%, reduce = 0%
2016-03-04 01:38:43,121 Stage-1 map = 0%, reduce = 0%
2016-03-04 01:38:48,278 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.51 sec
2016-03-04 01:38:55,556 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.51 sec
MapReduce Total cumulative CPU time: 1 seconds 510 msec
Ended Job = job_1457070465534_0010
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 3.35 sec HDFS Read: 8670 HDFS Write: 76 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 350 msec
OK
doa c_location
2015-09-12 midhila
2015-10-12 ayodhya
2015-11-12 ayodhya
2015-12-12 ayodhya
Time taken: 36.058 seconds, Fetched: 4 row(s)
@thanoojgithub
Copy link
Author

collect_set(col) Returns a set of objects(array) with duplicate elements eliminated.
collect_list(col) Returns a list of objects(array) with duplicates. (As of Hive 0.13.0.).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment