Skip to content

Instantly share code, notes, and snippets.

@thanoojgithub
Last active May 29, 2018 08:02
Show Gist options
  • Select an option

  • Save thanoojgithub/0d9f6abde5db1775ca72 to your computer and use it in GitHub Desktop.

Select an option

Save thanoojgithub/0d9f6abde5db1775ca72 to your computer and use it in GitHub Desktop.
Joins in Hive
hive> select * from raw1 r1 join raw2 r2 on r1.id=r2.id where r1.name like "s%" and r2.dt like "%2015";
Query ID = ubuntu_20160128153623_8773b4d1-5b8a-4f41-b9bf-8dc7ed17fc77
Total jobs = 1
2016-01-28 15:36:25,997 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Execution log at: /tmp/ubuntu/ubuntu_20160128153623_8773b4d1-5b8a-4f41-b9bf-8dc7ed17fc77.log
2016-01-28 15:36:26 Starting to launch local task to process map join; maximum memory = 518979584
2016-01-28 15:36:27 Dump the side-table for tag: 0 with group count: 2 into file: file:/tmp/hive/a6180571-652c-4ef6-8150-63c6872bb912/hive_2016-01-28_15-36-23_009_1042588085665505632-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile150--.hashtable
2016-01-28 15:36:27 Uploaded 1 File to: file:/tmp/hive/a6180571-652c-4ef6-8150-63c6872bb912/hive_2016-01-28_15-36-23_009_1042588085665505632-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile150--.hashtable (336 bytes)
2016-01-28 15:36:27 End of local task; Time Taken: 1.23 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1453963805097_0028, Tracking URL = http://ubuntu:8088/proxy/application_1453963805097_0028/
Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1453963805097_0028
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2016-01-28 15:36:33,080 Stage-3 map = 0%, reduce = 0%
2016-01-28 15:36:38,417 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.12 sec
MapReduce Total cumulative CPU time: 1 seconds 120 msec
Ended Job = job_1453963805097_0028
MapReduce Jobs Launched:
Stage-Stage-3: Map: 1 Cumulative CPU: 1.12 sec HDFS Read: 7255 HDFS Write: 84 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 120 msec
OK
r1.id r1.name r1.dt r2.id r2.name r2.dt
1 sriram 12-10-2015 1 sriram 12-10-2015
2 seetamma 12-09-2015 2 seetamma 12-09-2015
Time taken: 17.597 seconds, Fetched: 2 row(s)
hive> select * from raw1 r1 join raw2 r2 on r1.id=r2.id and r2.id=r1.id where r1.name like "s%" OR r2.dt like "%2015";
Query ID = ubuntu_20160128155404_62ef2450-43b0-4c88-b864-adddca14f1e0
Total jobs = 1
2016-01-28 15:54:07,366 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Execution log at: /tmp/ubuntu/ubuntu_20160128155404_62ef2450-43b0-4c88-b864-adddca14f1e0.log
2016-01-28 15:54:08 Starting to launch local task to process map join; maximum memory = 518979584
2016-01-28 15:54:09 Dump the side-table for tag: 0 with group count: 6 into file: file:/tmp/hive/a6180571-652c-4ef6-8150-63c6872bb912/hive_2016-01-28_15-54-04_825_2683943815301159820-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile190--.hashtable
2016-01-28 15:54:09 Uploaded 1 File to: file:/tmp/hive/a6180571-652c-4ef6-8150-63c6872bb912/hive_2016-01-28_15-54-04_825_2683943815301159820-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile190--.hashtable (497 bytes)
2016-01-28 15:54:09 End of local task; Time Taken: 1.202 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_1453963805097_0031, Tracking URL = http://ubuntu:8088/proxy/application_1453963805097_0031/
Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1453963805097_0031
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0
2016-01-28 15:54:16,035 Stage-3 map = 0%, reduce = 0%
2016-01-28 15:54:21,349 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.13 sec
MapReduce Total cumulative CPU time: 1 seconds 130 msec
Ended Job = job_1453963805097_0031
MapReduce Jobs Launched:
Stage-Stage-3: Map: 1 Cumulative CPU: 1.13 sec HDFS Read: 7394 HDFS Write: 258 SUCCESS
Total MapReduce CPU Time Spent: 1 seconds 130 msec
OK
r1.id r1.name r1.dt r2.id r2.name r2.dt
1 sriram 12-10-2015 1 sriram 12-10-2015
2 seetamma 12-09-2015 2 seetamma 12-09-2015
3 lakshmana 12-11-2015 3 lakshmana 12-11-2015
4 bharatha 12-12-2015 4 bharatha 12-12-2015
5 Shatrughna 12--2015 5 Shatrughna 12--2015
6 hanuma 12-11-2015 6 hanuma 12-11-2015
Time taken: 17.619 seconds, Fetched: 6 row(s)
hive>
@thanoojgithub
Copy link
Author

Map-side Joins
If all but one table is small, the largest table can be streamed through the mappers while the small tables are cached in memory. Hive can do all the joining map-side, since it can look up every possible match against the small tables in memory, thereby eliminating
the reduce step required in the more common join scenarios. Even on smaller data sets, this optimization is noticeably faster than the normal join. Not only does it eliminate reduce steps, it sometimes reduces the number of map steps, too.
The joins between stocks and dividends can exploit this optimization, as the dividends data set is small enough to be cached.

hive> set hive.auto.convert.join=true;
It’s false by default

hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol WHERE s.symbol = 'AAPL';

Note that you can also configure the threshold size for table files considered small enough to use this optimization. Here is the default definition of the property (in bytes):
hive> set hive.mapjoin.smalltable.filesize=25000000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment