Skip to content

Instantly share code, notes, and snippets.

@thanoojgithub
Last active May 29, 2018 08:02
Show Gist options
  • Select an option

  • Save thanoojgithub/b899c05451d14a3d79da to your computer and use it in GitHub Desktop.

Select an option

Save thanoojgithub/b899c05451d14a3d79da to your computer and use it in GitHub Desktop.
GroupBy in Hive
hive> select * from raw_bucketed_users;
OK
raw_bucketed_users.id raw_bucketed_users.name raw_bucketed_users.dt
1 sriram 12-10-2015
2 seeta 12-11-2015
3 sriram 12-10-2015
4 seeta 12-11-2015
5 sriram 12-10-2015
6 seeta 12-11-2015
7 sriram 12-10-2015
8 seeta 12-11-2015
9 seeta 12-11-2015
10 seeta 12-11-2015
Time taken: 0.096 seconds, Fetched: 10 row(s)
hive> select * from raw_bucketed_users group by id,name,dt;
Query ID = ubuntu_20160128125208_fe83cd3f-795f-4ff0-9c8c-b44f29adb05c
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1453963805097_0002, Tracking URL = http://ubuntu:8088/proxy/application_1453963805097_0002/
Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1453963805097_0002
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-01-28 12:52:15,864 Stage-1 map = 0%, reduce = 0%
2016-01-28 12:52:22,522 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.89 sec
2016-01-28 12:52:32,182 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.79 sec
MapReduce Total cumulative CPU time: 1 seconds 790 msec
Ended Job = job_1453963805097_0002
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 1.79 sec HDFS Read: 7370 HDFS Write: 195 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 790 msec
OK
1 sriram 12-10-2015
2 seeta 12-11-2015
3 sriram 12-10-2015
4 seeta 12-11-2015
5 sriram 12-10-2015
6 seeta 12-11-2015
7 sriram 12-10-2015
8 seeta 12-11-2015
9 seeta 12-11-2015
10 seeta 12-11-2015
Time taken: 25.29 seconds, Fetched: 10 row(s)
hive> select rbu.id, rbu.name, rbu.dt as doj from (select dt from raw_bucketed_users group by dt) as rbu_s join raw_bucketed_users as rbu on rbu.dt = rbu_s.dt;
Query ID = ubuntu_20160128150618_545a4eca-e7b9-4309-906b-9a598019812f
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 1
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
set mapreduce.job.reduces=<number>
Starting Job = job_1453963805097_0018, Tracking URL = http://ubuntu:8088/proxy/application_1453963805097_0018/
Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1453963805097_0018
Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1
2016-01-28 15:06:24,473 Stage-1 map = 0%, reduce = 0%
2016-01-28 15:06:31,107 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.16 sec
2016-01-28 15:06:39,581 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.97 sec
MapReduce Total cumulative CPU time: 1 seconds 970 msec
Ended Job = job_1453963805097_0018
2016-01-28 15:06:44,509 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Execution log at: /tmp/ubuntu/ubuntu_20160128150618_545a4eca-e7b9-4309-906b-9a598019812f.log
2016-01-28 15:06:45 Starting to launch local task to process map join; maximum memory = 518979584
2016-01-28 15:06:46 Dump the side-table for tag: 1 with group count: 2 into file: file:/tmp/hive/a6180571-652c-4ef6-8150-63c6872bb912/hive_2016-01-28_15-06-18_745_3596905921358697981-1/-local-10004/HashTable-Stage-4/MapJoin-mapfile61--.hashtable
2016-01-28 15:06:46 Uploaded 1 File to: file:/tmp/hive/a6180571-652c-4ef6-8150-63c6872bb912/hive_2016-01-28_15-06-18_745_3596905921358697981-1/-local-10004/HashTable-Stage-4/MapJoin-mapfile61--.hashtable (432 bytes)
2016-01-28 15:06:46 End of local task; Time Taken: 1.533 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 2 out of 2
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1453963805097_0019, Tracking URL = http://ubuntu:8088/proxy/application_1453963805097_0019/
Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1453963805097_0019
Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0
2016-01-28 15:06:53,240 Stage-4 map = 0%, reduce = 0%
2016-01-28 15:06:57,555 Stage-4 map = 100%, reduce = 0%, Cumulative CPU 0.79 sec
MapReduce Total cumulative CPU time: 790 msec
Ended Job = job_1453963805097_0019
MapReduce Jobs Launched:
Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 1.97 sec HDFS Read: 6332 HDFS Write: 152 SUCCESS
Stage-Stage-4: Map: 1 Cumulative CPU: 0.79 sec HDFS Read: 5791 HDFS Write: 195 SUCCESS
Total MapReduce CPU Time Spent: 2 seconds 760 msec
OK
rbu.id rbu.name doj
1 sriram 12-10-2015
3 sriram 12-10-2015
5 sriram 12-10-2015
7 sriram 12-10-2015
2 seeta 12-11-2015
4 seeta 12-11-2015
6 seeta 12-11-2015
8 seeta 12-11-2015
9 seeta 12-11-2015
10 seeta 12-11-2015
Time taken: 39.915 seconds, Fetched: 10 row(s)
hive>
@thanoojgithub
Copy link
Author

Map-side Aggregation for Group By
hive.map.aggr controls how we do aggregations. The default is false. If it is set to true, Hive will do the first-level aggregation directly in the map task.
This usually provides better efficiency, but may require more memory to run successfully.
set hive.map.aggr=true;
SELECT COUNT(*) FROM table2;
Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment