-
-
Save thanoojgithub/0d9f6abde5db1775ca72 to your computer and use it in GitHub Desktop.
| hive> select * from raw1 r1 join raw2 r2 on r1.id=r2.id where r1.name like "s%" and r2.dt like "%2015"; | |
| Query ID = ubuntu_20160128153623_8773b4d1-5b8a-4f41-b9bf-8dc7ed17fc77 | |
| Total jobs = 1 | |
| 2016-01-28 15:36:25,997 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable | |
| Execution log at: /tmp/ubuntu/ubuntu_20160128153623_8773b4d1-5b8a-4f41-b9bf-8dc7ed17fc77.log | |
| 2016-01-28 15:36:26 Starting to launch local task to process map join; maximum memory = 518979584 | |
| 2016-01-28 15:36:27 Dump the side-table for tag: 0 with group count: 2 into file: file:/tmp/hive/a6180571-652c-4ef6-8150-63c6872bb912/hive_2016-01-28_15-36-23_009_1042588085665505632-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile150--.hashtable | |
| 2016-01-28 15:36:27 Uploaded 1 File to: file:/tmp/hive/a6180571-652c-4ef6-8150-63c6872bb912/hive_2016-01-28_15-36-23_009_1042588085665505632-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile150--.hashtable (336 bytes) | |
| 2016-01-28 15:36:27 End of local task; Time Taken: 1.23 sec. | |
| Execution completed successfully | |
| MapredLocal task succeeded | |
| Launching Job 1 out of 1 | |
| Number of reduce tasks is set to 0 since there's no reduce operator | |
| Starting Job = job_1453963805097_0028, Tracking URL = http://ubuntu:8088/proxy/application_1453963805097_0028/ | |
| Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1453963805097_0028 | |
| Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0 | |
| 2016-01-28 15:36:33,080 Stage-3 map = 0%, reduce = 0% | |
| 2016-01-28 15:36:38,417 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.12 sec | |
| MapReduce Total cumulative CPU time: 1 seconds 120 msec | |
| Ended Job = job_1453963805097_0028 | |
| MapReduce Jobs Launched: | |
| Stage-Stage-3: Map: 1 Cumulative CPU: 1.12 sec HDFS Read: 7255 HDFS Write: 84 SUCCESS | |
| Total MapReduce CPU Time Spent: 1 seconds 120 msec | |
| OK | |
| r1.id r1.name r1.dt r2.id r2.name r2.dt | |
| 1 sriram 12-10-2015 1 sriram 12-10-2015 | |
| 2 seetamma 12-09-2015 2 seetamma 12-09-2015 | |
| Time taken: 17.597 seconds, Fetched: 2 row(s) | |
| hive> select * from raw1 r1 join raw2 r2 on r1.id=r2.id and r2.id=r1.id where r1.name like "s%" OR r2.dt like "%2015"; | |
| Query ID = ubuntu_20160128155404_62ef2450-43b0-4c88-b864-adddca14f1e0 | |
| Total jobs = 1 | |
| 2016-01-28 15:54:07,366 WARN [main] util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable | |
| Execution log at: /tmp/ubuntu/ubuntu_20160128155404_62ef2450-43b0-4c88-b864-adddca14f1e0.log | |
| 2016-01-28 15:54:08 Starting to launch local task to process map join; maximum memory = 518979584 | |
| 2016-01-28 15:54:09 Dump the side-table for tag: 0 with group count: 6 into file: file:/tmp/hive/a6180571-652c-4ef6-8150-63c6872bb912/hive_2016-01-28_15-54-04_825_2683943815301159820-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile190--.hashtable | |
| 2016-01-28 15:54:09 Uploaded 1 File to: file:/tmp/hive/a6180571-652c-4ef6-8150-63c6872bb912/hive_2016-01-28_15-54-04_825_2683943815301159820-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile190--.hashtable (497 bytes) | |
| 2016-01-28 15:54:09 End of local task; Time Taken: 1.202 sec. | |
| Execution completed successfully | |
| MapredLocal task succeeded | |
| Launching Job 1 out of 1 | |
| Number of reduce tasks is set to 0 since there's no reduce operator | |
| Starting Job = job_1453963805097_0031, Tracking URL = http://ubuntu:8088/proxy/application_1453963805097_0031/ | |
| Kill Command = /usr/local/hadoop2/bin/hadoop job -kill job_1453963805097_0031 | |
| Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 0 | |
| 2016-01-28 15:54:16,035 Stage-3 map = 0%, reduce = 0% | |
| 2016-01-28 15:54:21,349 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.13 sec | |
| MapReduce Total cumulative CPU time: 1 seconds 130 msec | |
| Ended Job = job_1453963805097_0031 | |
| MapReduce Jobs Launched: | |
| Stage-Stage-3: Map: 1 Cumulative CPU: 1.13 sec HDFS Read: 7394 HDFS Write: 258 SUCCESS | |
| Total MapReduce CPU Time Spent: 1 seconds 130 msec | |
| OK | |
| r1.id r1.name r1.dt r2.id r2.name r2.dt | |
| 1 sriram 12-10-2015 1 sriram 12-10-2015 | |
| 2 seetamma 12-09-2015 2 seetamma 12-09-2015 | |
| 3 lakshmana 12-11-2015 3 lakshmana 12-11-2015 | |
| 4 bharatha 12-12-2015 4 bharatha 12-12-2015 | |
| 5 Shatrughna 12--2015 5 Shatrughna 12--2015 | |
| 6 hanuma 12-11-2015 6 hanuma 12-11-2015 | |
| Time taken: 17.619 seconds, Fetched: 6 row(s) | |
| hive> |
LEFT SEMI JOIN implements the uncorrelated IN / * EXISTS* subquery semantics in an efficient way. As of Hive 0.13 the IN/NOT IN/EXISTS/NOT EXISTS operators are supported using subqueries so most of these JOINs don't have to be performed manually anymore. The restrictions of using LEFT SEMI JOIN is that the right-hand-side table should only be referenced in the join condition (ON-clause), but not in WHERE- or SELECT-clauses etc.
Note:
The SELECT and WHERE clauses can’t reference columns from the righthand table.
Right semi-joins are not supported in Hive.
The reason semi-joins are more efficient than the more general inner join is as follows. For a given record in the left hand table, Hive can stop looking for matching records in the righthand table as soon as any match is found. At that point, the selected columns from the left hand table record can be projected.
Map-side Joins
If all but one table is small, the largest table can be streamed through the mappers while the small tables are cached in memory. Hive can do all the joining map-side, since it can look up every possible match against the small tables in memory, thereby eliminating
the reduce step required in the more common join scenarios. Even on smaller data sets, this optimization is noticeably faster than the normal join. Not only does it eliminate reduce steps, it sometimes reduces the number of map steps, too.
The joins between stocks and dividends can exploit this optimization, as the dividends data set is small enough to be cached.
hive> set hive.auto.convert.join=true;
It’s false by default
hive> SELECT s.ymd, s.symbol, s.price_close, d.dividend FROM stocks s JOIN dividends d ON s.ymd = d.ymd AND s.symbol = d.symbol WHERE s.symbol = 'AAPL';
Note that you can also configure the threshold size for table files considered small enough to use this optimization. Here is the default definition of the property (in bytes):
hive> set hive.mapjoin.smalltable.filesize=25000000
When joining three or more tables, if every ON clause uses the same join key, a single MapReduce job will be used.