-
-
Save idris75/d6abfc67adb9a782601342c1f7e8fb80 to your computer and use it in GitHub Desktop.
hive tuning hints mapjoin bucketmapjoin - partition-bucket design
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
1.MapJoin: | |
small tables can be loaded in memory and joined with bigger tables. | |
1. use hint /*+ MAPJOIN(table_name) */ | |
2. 'better' option - let hive do automatically by setting these properties: | |
hive.auto.convert.join - true | |
hive.mapjoin.smalltable.filesize = <> default is 25MB | |
2.Partition Design | |
Low cardinality column -eg, regiou, year | |
3.Bucket Design: | |
For high cardinality columns like itineraryIds, and joining tables with the similar Ids. | |
both tables be bucketed on the same column. | |
The number of buckets in both tables be strict multiple of each other. | |
1.set hive.enforce.bucketing=true | |
2.set hive.optimize.bucketmapjoin=true | |
3.use hint - /*+ MAPJOIN(a,b) */ | |
4.BucketSampling: | |
Due to hashing, simiar ids will be grouped in the similar bucket ids. | |
use tablesample clause and pass same bucket number for both tables - only those two buckets will be sampled and they will have related rows as well. | |
select ... from <table1> TABLESAMPLE(bucket 30 out of 64 on <bucketed column>) | |
, <table2> TABLESAMPLE(bucket 30 out of 64 on <bucketed column>) | |
where ... | |
5.BlockSampling: | |
For single table analysis - no buckets | |
can sample with % of bytes, number of bytes, number of rows | |
eg: TABLESAMPLE(1 PERCENT), TABLESAMPLE(10 ROWS), TABLESAMPLE(100M) | |
6. Hive parallel Executeion: | |
Hive genrally breaks down the work into multiple MRs and run them sequentially - set hive.exec.parallel=false | |
But some of these MRs can run independently, they will run in parallel - set hive.exec.parallel = true |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment