Hive and postgres handle where
vs on
clauses differently. Postgres' query engine is smarter: where
and on
clause joins will be handled the same. In Hive, where
clause is more efficient than on
clause.
On
clause: In stage 1, pulls in ~400MM records; takes ~13 minutes to execute
Where
clause: In stage 1, pulls in ~60MM records; takes ~5 minutes to execute
Both queries have the same query plan (see below), and take the same amount of time to execute.
explain
select
w.auction_id
, w.date_wid
, e.event_id
, e.type
, e.sub_type
, e.winning_price
, w.data_cost
, w.win_type
, w.advertiser_id
, w.revenue
from auctions_won w
left join auction_events e on (w.auction_id = e.auction_id)
where w.date_wid = 20170604
;
STAGE DEPENDENCIES:
2 Stage-4 is a root stage , consists of Stage-1
3 Stage-1
4 Stage-0 depends on stages: Stage-1
5
6 STAGE PLANS:
7 Stage: Stage-4
8 Conditional Operator
9
10 Stage: Stage-1
11 Map Reduce
12 Map Operator Tree:
13 TableScan
14 alias: w
15 filterExpr: (date_wid = 20170604) (type: boolean)
16 Statistics: Num rows: 62861427 Data size: 31430713565 Basic stats: COMPLETE Column stats: NONE
17 Reduce Output Operator
18 key expressions: auction_id (type: string)
19 sort order: +
20 Map-reduce partition columns: auction_id (type: string)
21 Statistics: Num rows: 62861427 Data size: 31430713565 Basic stats: COMPLETE Column stats: NONE
22 value expressions: data_cost (type: string), revenue (type: string), win_type (type: string), advertiser_id (type: string)
23 TableScan
24 alias: e
25 Statistics: Num rows: 1588825 Data size: 794418447 Basic stats: COMPLETE Column stats: NONE
26 Reduce Output Operator
27 key expressions: auction_id (type: string)
28 sort order: +
29 Map-reduce partition columns: auction_id (type: string)
30 Statistics: Num rows: 1588825 Data size: 794418447 Basic stats: COMPLETE Column stats: NONE
31 value expressions: event_id (type: string), type (type: string), sub_type (type: string), winning_price (type: string)
32 Reduce Operator Tree:
33 Join Operator
34 condition map:
35 Left Outer Join0 to 1
36 keys:
37 0 auction_id (type: string)
38 1 auction_id (type: string)
39 outputColumnNames: _col0, _col2, _col4, _col9, _col59, _col141, _col143, _col144, _col151
40 Statistics: Num rows: 69147571 Data size: 34573785670 Basic stats: COMPLETE Column stats: NONE
41 Select Operator
42 expressions: _col0 (type: string), 20170604 (type: bigint), _col141 (type: string), _col143 (type: string), _col144 (type: string), _col151 (type: string), _col2 (type: string), _col9 (type: string), _col59 (type: string), _col4 (type: string)
43 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9
44 Statistics: Num rows: 69147571 Data size: 34573785670 Basic stats: COMPLETE Column stats: NONE
45 File Output Operator
46 compressed: false
47 Statistics: Num rows: 69147571 Data size: 34573785670 Basic stats: COMPLETE Column stats: NONE
48 table:
49 input format: org.apache.hadoop.mapred.TextInputFormat
50 output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
51 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
52
53 Stage: Stage-0
54 Fetch Operator
55 limit: -1
56 Processor Tree:
57 ListSink
58
explain
select
w.auction_id
, w.date_wid
, e.event_id
, e.type
, e.sub_type
, e.winning_price
, w.data_cost
, w.win_type
, w.advertiser_id
, w.revenue
from auctions_won w
left join auction_events e on (
w.auction_id = e.auction_id
and w.date_wid = 20170604)
STAGE DEPENDENCIES:
2 Stage-4 is a root stage , consists of Stage-1
3 Stage-1
4 Stage-0 depends on stages: Stage-1
5
6 STAGE PLANS:
7 Stage: Stage-4
8 Conditional Operator
9
10 Stage: Stage-1
11 Map Reduce
12 Map Operator Tree:
13 TableScan
14 alias: w
15 Statistics: Num rows: 488498286 Data size: 244249144664 Basic stats: COMPLETE Column stats: PARTIAL
16 Reduce Output Operator
17 key expressions: auction_id (type: string)
18 sort order: +
19 Map-reduce partition columns: auction_id (type: string)
20 Statistics: Num rows: 488498286 Data size: 244249144664 Basic stats: COMPLETE Column stats: PARTIAL
21 value expressions: data_cost (type: string), revenue (type: string), win_type (type: string), advertiser_id (type: string), date_wid (type: bigint)
22 TableScan
23 alias: e
24 Statistics: Num rows: 1588825 Data size: 794418447 Basic stats: COMPLETE Column stats: NONE
25 Reduce Output Operator
26 key expressions: auction_id (type: string)
27 sort order: +
28 Map-reduce partition columns: auction_id (type: string)
29 Statistics: Num rows: 1588825 Data size: 794418447 Basic stats: COMPLETE Column stats: NONE
30 value expressions: event_id (type: string), type (type: string), sub_type (type: string), winning_price (type: string)
31 Reduce Operator Tree:
32 Join Operator
33 condition map:
34 Left Outer Join0 to 1
35 filter predicates:
36 0 {(VALUE._col131 = 20170604)}
37 1
38 keys:
39 0 auction_id (type: string)
40 1 auction_id (type: string)
41 outputColumnNames: _col0, _col2, _col4, _col9, _col59, _col132, _col141, _col143, _col144, _col151
42 Statistics: Num rows: 537348126 Data size: 268674064953 Basic stats: COMPLETE Column stats: NONE
43 Select Operator
44 expressions: _col0 (type: string), _col132 (type: bigint), _col141 (type: string), _col143 (type: string), _col144 (type: string), _col151 (type: string), _col2 (type: string), _col9 (type: string), _col59 (type: string), _col4 (type: string)
45 outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7, _col8, _col9
46 Statistics: Num rows: 537348126 Data size: 268674064953 Basic stats: COMPLETE Column stats: NONE
47 File Output Operator
48 compressed: false
49 Statistics: Num rows: 537348126 Data size: 268674064953 Basic stats: COMPLETE Column stats: NONE
50 table:
51 input format: org.apache.hadoop.mapred.TextInputFormat
52 output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
53 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
54
55 Stage: Stage-0
56 Fetch Operator
57 limit: -1
58 Processor Tree:
59 ListSink
dm_production=> explain select
dm_production-> w.auction_id
dm_production-> , e.event_id
dm_production-> , e.type
dm_production-> from won_1491319843_part0 w
dm_production-> left join events_20170404 e on (w.auction_id = e.auction_id)
dm_production-> where e.type = 'WIN_EVENT'
dm_production-> ;
QUERY PLAN
----------------------------------------------------------------------------------------------
Hash Join (cost=295492.13..2504549.42 rows=5303291 width=60)
Hash Cond: (e.auction_id = w.auction_id)
-> Seq Scan on events_20170404 e (cost=0.00..2014161.04 rows=5303291 width=60)
Filter: (type = 'WIN_EVENT'::text)
-> Hash (cost=272756.28..272756.28 rows=1175828 width=25)
-> Seq Scan on won_1491319843_part0 w (cost=0.00..272756.28 rows=1175828 width=25)
(6 rows)
dm_production=> explain
dm_production-> select
dm_production-> w.auction_id
dm_production-> , e.event_id
dm_production-> , e.type
dm_production-> from won_1491319843_part0 w
dm_production-> left join events_20170404 e on (
dm_production(> w.auction_id = e.auction_id
dm_production(> and e.type = 'WIN_EVENT')
dm_production-> ;
QUERY PLAN
----------------------------------------------------------------------------------------------
Hash Right Join (cost=295492.13..2504549.42 rows=5303291 width=60)
Hash Cond: (e.auction_id = w.auction_id)
-> Seq Scan on events_20170404 e (cost=0.00..2014161.04 rows=5303291 width=60)
Filter: (type = 'WIN_EVENT'::text)
-> Hash (cost=272756.28..272756.28 rows=1175828 width=25)
-> Seq Scan on won_1491319843_part0 w (cost=0.00..272756.28 rows=1175828 width=25)
(6 rows)