See http://randyzwitch.com/mapd-pjm-electricity-data/ for the data and pythons
script to create hourly_loads.csv
(146MB) and
https://news.ycombinator.com/item?id=16458105 for discussion.
CREATE TABLE hourly_loads (
actual_date date,
zone_name text,
hour_ending int,
mw text
);
COPY hourly_loads FROM '/Users/felixge/code/mapd/data/hourly_loads.csv' WITH (FORMAT csv, HEADER);
COPY 4925112
Make sure we have up-to-date statistics:
VACUUM ANALYZE hourly_loads;
EXPLAIN ANALYZE
SELECT count(*) FROM hourly_loads;
Finalize Aggregate (cost=58804.78..58804.79 rows=1 width=8) (actual time=482.737..482.737 rows=1 loops=1)
-> Gather (cost=58804.57..58804.78 rows=2 width=8) (actual time=482.686..482.728 rows=3 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial Aggregate (cost=57804.57..57804.58 rows=1 width=8) (actual time=478.589..478.589 rows=1 loops=3)
-> Parallel Seq Scan on hourly_loads (cost=0.00..52673.65 rows=2052365 width=0) (actual time=0.048..287.810 rows=1641704 loops=3)
Planning time: 0.099 ms
Execution time: 484.249 ms
--MapD doesn't currently support window functions, so need to precalculate maximum by day
with qry as (select
actual_date,
zone_name,
max(MW) as daily_max_usage
from hourly_loads
where zone_name = 'MIDATL' and actual_date between '2017-06-01' and '2017-09-30'
group by 1,2)
select
hl.actual_date,
hl.zone_name,
hl.hour_ending,
hl.MW
from hourly_loads as hl
inner join qry on qry.actual_date = hl.actual_date and qry.daily_max_usage = hl.mw
order by daily_max_usage desc
limit 10;
2017-07-20 | MIDATL | 17 | 55219.896 |
---|---|---|---|
2017-07-19 | MIDATL | 18 | 53889.273 |
2017-06-13 | MIDATL | 18 | 53700.225 |
2017-07-21 | MIDATL | 18 | 53172.482 |
2017-08-22 | MIDATL | 17 | 52932.059 |
2017-07-13 | MIDATL | 15 | 52210.192 |
2017-06-12 | MIDATL | 18 | 51282.16 |
2017-07-12 | MIDATL | 18 | 50855.295 |
2017-07-18 | MIDATL | 17 | 50718.12 |
2017-06-30 | MIDATL | 17 | 49530.03 |
The same query from above with EXPLAIN ANALYZE
prefix:
Limit (cost=290919.59..290919.60 rows=6 width=52) (actual time=4129.125..4129.127 rows=10 loops=1)
CTE qry
-> Finalize GroupAggregate (cost=69092.13..69271.81 rows=1485 width=40) (actual time=271.653..273.234 rows=122 loops=1)
Group Key: hourly_loads.actual_date, hourly_loads.zone_name
-> Gather Merge (cost=69092.13..69247.66 rows=1240 width=40) (actual time=271.620..272.871 rows=366 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial GroupAggregate (cost=68092.11..68104.51 rows=620 width=40) (actual time=267.232..268.233 rows=122 loops=3)
Group Key: hourly_loads.actual_date, hourly_loads.zone_name
-> Sort (cost=68092.11..68093.66 rows=620 width=16) (actual time=267.176..267.262 rows=976 loops=3)
Sort Key: hourly_loads.actual_date
Sort Method: quicksort Memory: 134kB
-> Parallel Seq Scan on hourly_loads (cost=0.00..68063.35 rows=620 width=16) (actual time=244.850..266.663 rows=976 loops=3)
Filter: ((actual_date >= '2017-06-01'::date) AND (actual_date <= '2017-09-30'::date) AND (zone_name = 'MIDATL'::text))
Rows Removed by Filter: 1640728
-> Sort (cost=221647.78..221647.79 rows=6 width=52) (actual time=4129.124..4129.125 rows=10 loops=1)
Sort Key: hl.mw DESC
Sort Method: top-N heapsort Memory: 26kB
-> Hash Join (cost=184140.50..221647.70 rows=6 width=52) (actual time=3228.383..4128.810 rows=122 loops=1)
Hash Cond: ((qry.actual_date = hl.actual_date) AND (qry.daily_max_usage = hl.mw))
-> CTE Scan on qry (cost=0.00..29.70 rows=1485 width=36) (actual time=271.656..273.315 rows=122 loops=1)
-> Hash (cost=81402.60..81402.60 rows=4925260 width=20) (actual time=2953.151..2953.151 rows=4846290 loops=1)
Buckets: 2097152 Batches: 4 Memory Usage: 80006kB
-> Seq Scan on hourly_loads hl (cost=0.00..81402.60 rows=4925260 width=20) (actual time=0.019..783.327 rows=4925112 loops=1)
Planning time: 0.638 ms
Execution time: 4134.577 ms
CREATE INDEX ON hourly_loads (actual_date, zone_name);
VACUUM ANALYZE hourly_loads;
Limit (cost=100773.84..100773.86 rows=6 width=52) (actual time=116.622..116.624 rows=10 loops=1)
CTE qry
-> GroupAggregate (cost=0.43..5296.77 rows=1423 width=40) (actual time=0.237..17.793 rows=122 loops=1)
Group Key: hourly_loads.actual_date, hourly_loads.zone_name
-> Index Scan using hourly_loads_actual_date_zone_name_idx on hourly_loads (cost=0.43..5271.85 rows=1425 width=16) (actual time=0.054..13.930 rows=2928 loops=1)
Index Cond: ((actual_date >= '2017-06-01'::date) AND (actual_date <= '2017-09-30'::date) AND (zone_name = 'MIDATL'::text))
-> Sort (cost=95477.07..95477.09 rows=6 width=52) (actual time=116.621..116.622 rows=10 loops=1)
Sort Key: hl.mw DESC
Sort Method: top-N heapsort Memory: 26kB
-> Nested Loop (cost=0.43..95477.00 rows=6 width=52) (actual time=0.953..116.080 rows=122 loops=1)
-> CTE Scan on qry (cost=0.00..28.46 rows=1423 width=36) (actual time=0.239..17.941 rows=122 loops=1)
-> Index Scan using hourly_loads_actual_date_zone_name_idx on hourly_loads hl (cost=0.43..67.07 rows=1 width=20) (actual time=0.472..0.802 rows=1 loops=122)
Index Cond: (actual_date = qry.actual_date)
Filter: (qry.daily_max_usage = mw)
Rows Removed by Filter: 911
Planning time: 0.592 ms
Execution time: 116.697 ms
116ms - not bad.
EXPLAIN ANALYZE
SELECT * FROM (
SELECT *, rank() OVER (PARTITION BY actual_date ORDER BY mw DESC)
FROM hourly_loads
WHERE zone_name = 'MIDATL' and actual_date between '2017-06-01' and '2017-09-30'
) q
WHERE rank = 1
ORDER BY mw DESC
LIMIT 10;
Limit (cost=5392.91..5392.93 rows=7 width=28) (actual time=34.918..34.921 rows=10 loops=1)
-> Sort (cost=5392.91..5392.93 rows=7 width=28) (actual time=34.917..34.918 rows=10 loops=1)
Sort Key: q.mw DESC
Sort Method: top-N heapsort Memory: 26kB
-> Subquery Scan on q (cost=5346.50..5392.81 rows=7 width=28) (actual time=30.378..34.593 rows=122 loops=1)
Filter: (q.rank = 1)
Rows Removed by Filter: 2806
-> WindowAgg (cost=5346.50..5375.00 rows=1425 width=28) (actual time=30.376..34.181 rows=2928 loops=1)
-> Sort (cost=5346.50..5350.06 rows=1425 width=20) (actual time=30.368..30.726 rows=2928 loops=1)
Sort Key: hourly_loads.actual_date, hourly_loads.mw DESC
Sort Method: quicksort Memory: 325kB
-> Index Scan using hourly_loads_actual_date_zone_name_idx on hourly_loads (cost=0.43..5271.85 rows=1425 width=20) (actual time=0.060..16.958 rows=2928 loops=1)
Index Cond: ((actual_date >= '2017-06-01'::date) AND (actual_date <= '2017-09-30'::date) AND (zone_name = 'MIDATL'::text))
Planning time: 0.224 ms
Execution time: 34.983 ms
34ms 🎉
After dropping the index we created, we can re-run our new query:
DROP INDEX hourly_loads_actual_date_zone_name_idx;
Limit (cost=69324.61..69324.63 rows=7 width=28) (actual time=275.526..275.528 rows=10 loops=1)
-> Sort (cost=69324.61..69324.63 rows=7 width=28) (actual time=275.524..275.525 rows=10 loops=1)
Sort Key: q.mw DESC
Sort Method: top-N heapsort Memory: 26kB
-> Subquery Scan on q (cost=69278.20..69324.52 rows=7 width=28) (actual time=271.890..275.263 rows=122 loops=1)
Filter: (q.rank = 1)
Rows Removed by Filter: 2806
-> WindowAgg (cost=69278.20..69306.70 rows=1425 width=28) (actual time=271.889..274.931 rows=2928 loops=1)
-> Sort (cost=69278.20..69281.77 rows=1425 width=20) (actual time=271.880..272.182 rows=2928 loops=1)
Sort Key: hourly_loads.actual_date, hourly_loads.mw DESC
Sort Method: quicksort Memory: 325kB
-> Gather (cost=1000.00..69203.56 rows=1425 width=20) (actual time=240.425..261.364 rows=2928 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on hourly_loads (cost=0.00..68061.06 rows=594 width=20) (actual time=236.324..256.842 rows=976 loops=3)
Filter: ((actual_date >= '2017-06-01'::date) AND (actual_date <= '2017-09-30'::date) AND (zone_name = 'MIDATL'::text))
Rows Removed by Filter: 1640728
Planning time: 0.181 ms
Execution time: 275.615 ms