Last active
May 29, 2018 08:02
-
-
Save thanoojgithub/0d9f6abde5db1775ca72 to your computer and use it in GitHub Desktop.
Joins in Hive
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| hive> select * from raw1 r1 join raw2 r2 on r1.id=r2.id where r1.name like "s%" and r2.dt like "%2015"; | |
| Query ID = ubuntu_20160128153623_8773b4d1-5b8a-4f41-b9bf-8dc7ed17fc77 | |
| Total jobs = 1 | |
| 2016-01-28 15:36:25,997 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable | |
| Execution log at: /tmp/ubuntu/ubuntu_20160128153623_8773b4d1-5b8a-4f41-b9bf-8dc7ed17fc77.log | |
| 2016-01-28 15:36:26 Starting to launch local task to process map join; maximum memory = 518979584 | |
| 2016-01-28 15:36:27 Dump the side-table for tag: 0 with group count: 2 into file: file:/tmp/hive/a6180571-652c-4ef6-8150-63c6872bb912/hive_2016-01-28_15-36-23_009_1042588085665505632-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile150--.hashtable | |
| 2016-01-28 15:36:27 Uploaded 1 File to: file:/tmp/hive/a6180571-652c-4ef6-8150-63c6872bb912/hive_2016-01-28_15-36-23_009_1042588085665505632-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile150--.hashtable (336 bytes) | |
| 2016-01-28 15:36:27 End of local task; Time Taken: 1.23 sec. | |
| Execution completed successfully | |
| MapredLocal task succeeded | |
| Launching Job 1 out of 1 | |
| Number of reduce tasks is set to 0 since there's no reduce operator | |
| Starting Job = job_1453963805097_0028, Tracking URL = http://ubuntu:8088/proxy/application_1453963805097_0028/ | |
| Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1453963805097_0028 | |
| Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0 | |
| 2016-01-28 15:36:33,080 Stage-3 map = 0%, reduce = 0% | |
| 2016-01-28 15:36:38,417 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.12 sec | |
| MapReduce Total cumulative CPU time: 1 seconds 120 msec | |
| Ended Job = job_1453963805097_0028 | |
| MapReduce Jobs Launched: | |
| Stage-Stage-3: Map: 1 Cumulative CPU: 1.12 sec HDFS Read: 7255 HDFS Write: 84 SUCCESS | |
| Total MapReduce CPU Time Spent: 1 seconds 120 msec | |
| OK | |
| r1.id r1.name r1.dt r2.id r2.name r2.dt | |
| 1 sriram 12-10-2015 1 sriram 12-10-2015 | |
| 2 seetamma 12-09-2015 2 seetamma 12-09-2015 | |
| Time taken: 17.597 seconds, Fetched: 2 row(s) | |
| hive> select * from raw1 r1 join raw2 r2 on r1.id=r2.id and r2.id=r1.id where r1.name like "s%" OR r2.dt like "%2015"; | |
| Query ID = ubuntu_20160128155404_62ef2450-43b0-4c88-b864-adddca14f1e0 | |
| Total jobs = 1 | |
| 2016-01-28 15:54:07,366 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable | |
| Execution log at: /tmp/ubuntu/ubuntu_20160128155404_62ef2450-43b0-4c88-b864-adddca14f1e0.log | |
| 2016-01-28 15:54:08 Starting to launch local task to process map join; maximum memory = 518979584 | |
| 2016-01-28 15:54:09 Dump the side-table for tag: 0 with group count: 6 into file: file:/tmp/hive/a6180571-652c-4ef6-8150-63c6872bb912/hive_2016-01-28_15-54-04_825_2683943815301159820-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile190--.hashtable | |
| 2016-01-28 15:54:09 Uploaded 1 File to: file:/tmp/hive/a6180571-652c-4ef6-8150-63c6872bb912/hive_2016-01-28_15-54-04_825_2683943815301159820-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile190--.hashtable (497 bytes) | |
| 2016-01-28 15:54:09 End of local task; Time Taken: 1.202 sec. | |
| Execution completed successfully | |
| MapredLocal task succeeded | |
| Launching Job 1 out of 1 | |
| Number of reduce tasks is set to 0 since there's no reduce operator | |
| Starting Job = job_1453963805097_0031, Tracking URL = http://ubuntu:8088/proxy/application_1453963805097_0031/ | |
| Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1453963805097_0031 | |
| Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0 | |
| 2016-01-28 15:54:16,035 Stage-3 map = 0%, reduce = 0% | |
| 2016-01-28 15:54:21,349 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.13 sec | |
| MapReduce Total cumulative CPU time: 1 seconds 130 msec | |
| Ended Job = job_1453963805097_0031 | |
| MapReduce Jobs Launched: | |
| Stage-Stage-3: Map: 1 Cumulative CPU: 1.13 sec HDFS Read: 7394 HDFS Write: 258 SUCCESS | |
| Total MapReduce CPU Time Spent: 1 seconds 130 msec | |
| OK | |
| r1.id r1.name r1.dt r2.id r2.name r2.dt | |
| 1 sriram 12-10-2015 1 sriram 12-10-2015 | |
| 2 seetamma 12-09-2015 2 seetamma 12-09-2015 | |
| 3 lakshmana 12-11-2015 3 lakshmana 12-11-2015 | |
| 4 bharatha 12-12-2015 4 bharatha 12-12-2015 | |
| 5 Shatrughna 12--2015 5 Shatrughna 12--2015 | |
| 6 hanuma 12-11-2015 6 hanuma 12-11-2015 | |
| Time taken: 17.619 seconds, Fetched: 6 row(s) | |
| hive> |
Author
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Map-side Joins
If all but one table is small, the largest table can be streamed through the mappers while the small tables are cached in memory. Hive can do all the joining map-side, since it can look up every possible match against the small tables in memory, thereby eliminating
the reduce step required in the more common join scenarios. Even on smaller data sets, this optimization is noticeably faster than the normal join. Not only does it eliminate reduce steps, it sometimes reduces the number of map steps, too.
The joins between stocks and dividends can exploit this optimization, as the dividends data set is small enough to be cached.
hive> set hive.auto.convert.join=true;
It’s false by default
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol WHERE s.symbol = 'AAPL';
Note that you can also configure the threshold size for table files considered small enough to use this optimization. Here is the default definition of the property (in bytes):
hive> set hive.mapjoin.smalltable.filesize=25000000