Last active
May 29, 2018 08:02
-
-
Save thanoojgithub/b899c05451d14a3d79da to your computer and use it in GitHub Desktop.
GroupBy in Hive
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| hive> select * from raw_bucketed_users; | |
| OK | |
| raw_bucketed_users.id raw_bucketed_users.name raw_bucketed_users.dt | |
| 1 sriram 12-10-2015 | |
| 2 seeta 12-11-2015 | |
| 3 sriram 12-10-2015 | |
| 4 seeta 12-11-2015 | |
| 5 sriram 12-10-2015 | |
| 6 seeta 12-11-2015 | |
| 7 sriram 12-10-2015 | |
| 8 seeta 12-11-2015 | |
| 9 seeta 12-11-2015 | |
| 10 seeta 12-11-2015 | |
| Time taken: 0.096 seconds, Fetched: 10 row(s) | |
| hive> select * from raw_bucketed_users group by id,name,dt; | |
| Query ID = ubuntu_20160128125208_fe83cd3f-795f-4ff0-9c8c-b44f29adb05c | |
| Total jobs = 1 | |
| Launching Job 1 out of 1 | |
| Number of reduce tasks not specified. Estimated from input data size: 1 | |
| In order to change the average load for a reducer (in bytes): | |
| set hive.exec.reducers.bytes.per.reducer=<number> | |
| In order to limit the maximum number of reducers: | |
| set hive.exec.reducers.max=<number> | |
| In order to set a constant number of reducers: | |
| set mapreduce.job.reduces=<number> | |
| Starting Job = job_1453963805097_0002, Tracking URL = http://ubuntu:8088/proxy/application_1453963805097_0002/ | |
| Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1453963805097_0002 | |
| Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 | |
| 2016-01-28 12:52:15,864 Stage-1 map = 0%, reduce = 0% | |
| 2016-01-28 12:52:22,522 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 0.89 sec | |
| 2016-01-28 12:52:32,182 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.79 sec | |
| MapReduce Total cumulative CPU time: 1 seconds 790 msec | |
| Ended Job = job_1453963805097_0002 | |
| MapReduce Jobs Launched: | |
| Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 1.79 sec HDFS Read: 7370 HDFS Write: 195 SUCCESS | |
| Total MapReduce CPU Time Spent: 1 seconds 790 msec | |
| OK | |
| 1 sriram 12-10-2015 | |
| 2 seeta 12-11-2015 | |
| 3 sriram 12-10-2015 | |
| 4 seeta 12-11-2015 | |
| 5 sriram 12-10-2015 | |
| 6 seeta 12-11-2015 | |
| 7 sriram 12-10-2015 | |
| 8 seeta 12-11-2015 | |
| 9 seeta 12-11-2015 | |
| 10 seeta 12-11-2015 | |
| Time taken: 25.29 seconds, Fetched: 10 row(s) | |
| hive> select rbu.id, rbu.name, rbu.dt as doj from (select dt from raw_bucketed_users group by dt) as rbu_s join raw_bucketed_users as rbu on rbu.dt = rbu_s.dt; | |
| Query ID = ubuntu_20160128150618_545a4eca-e7b9-4309-906b-9a598019812f | |
| Total jobs = 2 | |
| Launching Job 1 out of 2 | |
| Number of reduce tasks not specified. Estimated from input data size: 1 | |
| In order to change the average load for a reducer (in bytes): | |
| set hive.exec.reducers.bytes.per.reducer=<number> | |
| In order to limit the maximum number of reducers: | |
| set hive.exec.reducers.max=<number> | |
| In order to set a constant number of reducers: | |
| set mapreduce.job.reduces=<number> | |
| Starting Job = job_1453963805097_0018, Tracking URL = http://ubuntu:8088/proxy/application_1453963805097_0018/ | |
| Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1453963805097_0018 | |
| Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 1 | |
| 2016-01-28 15:06:24,473 Stage-1 map = 0%, reduce = 0% | |
| 2016-01-28 15:06:31,107 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 1.16 sec | |
| 2016-01-28 15:06:39,581 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 1.97 sec | |
| MapReduce Total cumulative CPU time: 1 seconds 970 msec | |
| Ended Job = job_1453963805097_0018 | |
| 2016-01-28 15:06:44,509 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable | |
| Execution log at: /tmp/ubuntu/ubuntu_20160128150618_545a4eca-e7b9-4309-906b-9a598019812f.log | |
| 2016-01-28 15:06:45 Starting to launch local task to process map join; maximum memory = 518979584 | |
| 2016-01-28 15:06:46 Dump the side-table for tag: 1 with group count: 2 into file: file:/tmp/hive/a6180571-652c-4ef6-8150-63c6872bb912/hive_2016-01-28_15-06-18_745_3596905921358697981-1/-local-10004/HashTable-Stage-4/MapJoin-mapfile61--.hashtable | |
| 2016-01-28 15:06:46 Uploaded 1 File to: file:/tmp/hive/a6180571-652c-4ef6-8150-63c6872bb912/hive_2016-01-28_15-06-18_745_3596905921358697981-1/-local-10004/HashTable-Stage-4/MapJoin-mapfile61--.hashtable (432 bytes) | |
| 2016-01-28 15:06:46 End of local task; Time Taken: 1.533 sec. | |
| Execution completed successfully | |
| MapredLocal task succeeded | |
| Launching Job 2 out of 2 | |
| Number of reduce tasks is set to 0 since there's no reduce operator | |
| Starting Job = job_1453963805097_0019, Tracking URL = http://ubuntu:8088/proxy/application_1453963805097_0019/ | |
| Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1453963805097_0019 | |
| Hadoop job information for Stage-4: number of mappers: 1; number of reducers: 0 | |
| 2016-01-28 15:06:53,240 Stage-4 map = 0%, reduce = 0% | |
| 2016-01-28 15:06:57,555 Stage-4 map = 100%, reduce = 0%, Cumulative CPU 0.79 sec | |
| MapReduce Total cumulative CPU time: 790 msec | |
| Ended Job = job_1453963805097_0019 | |
| MapReduce Jobs Launched: | |
| Stage-Stage-1: Map: 1 Reduce: 1 Cumulative CPU: 1.97 sec HDFS Read: 6332 HDFS Write: 152 SUCCESS | |
| Stage-Stage-4: Map: 1 Cumulative CPU: 0.79 sec HDFS Read: 5791 HDFS Write: 195 SUCCESS | |
| Total MapReduce CPU Time Spent: 2 seconds 760 msec | |
| OK | |
| rbu.id rbu.name doj | |
| 1 sriram 12-10-2015 | |
| 3 sriram 12-10-2015 | |
| 5 sriram 12-10-2015 | |
| 7 sriram 12-10-2015 | |
| 2 seeta 12-11-2015 | |
| 4 seeta 12-11-2015 | |
| 6 seeta 12-11-2015 | |
| 8 seeta 12-11-2015 | |
| 9 seeta 12-11-2015 | |
| 10 seeta 12-11-2015 | |
| Time taken: 39.915 seconds, Fetched: 10 row(s) | |
| hive> |
Author
Author
Map-side Aggregation for Group By
hive.map.aggr controls how we do aggregations. The default is false. If it is set to true, Hive will do the first-level aggregation directly in the map task.
This usually provides better efficiency, but may require more memory to run successfully.
set hive.map.aggr=true;
SELECT COUNT(*) FROM table2;
Note that for versions of Hive which don't include HIVE-287, you'll need to use COUNT(1) in place of COUNT(*).
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
hive> select rbu.id, rbu.name, rbu.dt as doj from (select dt from raw_bucketed_users group by dt) as rbu_s join raw_bucketed_users as rbu on rbu.dt = rbu_s.dt;