Load the data

See http://randyzwitch.com/mapd-pjm-electricity-data/ for the data and pythons script to create hourly_loads.csv (146MB) and https://news.ycombinator.com/item?id=16458105 for discussion.

CREATE TABLE hourly_loads (
	actual_date date,
	zone_name text,
	hour_ending int,
	mw text
);
COPY hourly_loads FROM '/Users/felixge/code/mapd/data/hourly_loads.csv' WITH (FORMAT csv, HEADER);

COPY 4925112

Make sure we have up-to-date statistics:

VACUUM ANALYZE hourly_loads;

Check time to sequentially scan the table

EXPLAIN ANALYZE
SELECT count(*) FROM hourly_loads;

Finalize Aggregate  (cost=58804.78..58804.79 rows=1 width=8) (actual time=482.737..482.737 rows=1 loops=1)
  ->  Gather  (cost=58804.57..58804.78 rows=2 width=8) (actual time=482.686..482.728 rows=3 loops=1)
        Workers Planned: 2
        Workers Launched: 2
        ->  Partial Aggregate  (cost=57804.57..57804.58 rows=1 width=8) (actual time=478.589..478.589 rows=1 loops=3)
              ->  Parallel Seq Scan on hourly_loads  (cost=0.00..52673.65 rows=2052365 width=0) (actual time=0.048..287.810 rows=1641704 loops=3)
Planning time: 0.099 ms
Execution time: 484.249 ms

Run example query

--MapD doesn't currently support window functions, so need to precalculate maximum by day
with qry as (select
actual_date,
zone_name,
max(MW) as daily_max_usage
from hourly_loads
where zone_name = 'MIDATL' and actual_date between '2017-06-01' and '2017-09-30'
group by 1,2)
select
hl.actual_date,
hl.zone_name,
hl.hour_ending,
hl.MW
from hourly_loads as hl
inner join qry on qry.actual_date = hl.actual_date and qry.daily_max_usage = hl.mw
order by daily_max_usage desc
limit 10;

2017-07-20	MIDATL	17	55219.896
2017-07-19	MIDATL	18	53889.273
2017-06-13	MIDATL	18	53700.225
2017-07-21	MIDATL	18	53172.482
2017-08-22	MIDATL	17	52932.059
2017-07-13	MIDATL	15	52210.192
2017-06-12	MIDATL	18	51282.16
2017-07-12	MIDATL	18	50855.295
2017-07-18	MIDATL	17	50718.12
2017-06-30	MIDATL	17	49530.03

The same query from above with EXPLAIN ANALYZE prefix:

Limit  (cost=290919.59..290919.60 rows=6 width=52) (actual time=4129.125..4129.127 rows=10 loops=1)
  CTE qry
    ->  Finalize GroupAggregate  (cost=69092.13..69271.81 rows=1485 width=40) (actual time=271.653..273.234 rows=122 loops=1)
          Group Key: hourly_loads.actual_date, hourly_loads.zone_name
          ->  Gather Merge  (cost=69092.13..69247.66 rows=1240 width=40) (actual time=271.620..272.871 rows=366 loops=1)
                Workers Planned: 2
                Workers Launched: 2
                ->  Partial GroupAggregate  (cost=68092.11..68104.51 rows=620 width=40) (actual time=267.232..268.233 rows=122 loops=3)
                      Group Key: hourly_loads.actual_date, hourly_loads.zone_name
                      ->  Sort  (cost=68092.11..68093.66 rows=620 width=16) (actual time=267.176..267.262 rows=976 loops=3)
                            Sort Key: hourly_loads.actual_date
                            Sort Method: quicksort  Memory: 134kB
                            ->  Parallel Seq Scan on hourly_loads  (cost=0.00..68063.35 rows=620 width=16) (actual time=244.850..266.663 rows=976 loops=3)
                                  Filter: ((actual_date >= '2017-06-01'::date) AND (actual_date <= '2017-09-30'::date) AND (zone_name = 'MIDATL'::text))
                                  Rows Removed by Filter: 1640728
  ->  Sort  (cost=221647.78..221647.79 rows=6 width=52) (actual time=4129.124..4129.125 rows=10 loops=1)
        Sort Key: hl.mw DESC
        Sort Method: top-N heapsort  Memory: 26kB
        ->  Hash Join  (cost=184140.50..221647.70 rows=6 width=52) (actual time=3228.383..4128.810 rows=122 loops=1)
              Hash Cond: ((qry.actual_date = hl.actual_date) AND (qry.daily_max_usage = hl.mw))
              ->  CTE Scan on qry  (cost=0.00..29.70 rows=1485 width=36) (actual time=271.656..273.315 rows=122 loops=1)
              ->  Hash  (cost=81402.60..81402.60 rows=4925260 width=20) (actual time=2953.151..2953.151 rows=4846290 loops=1)
                    Buckets: 2097152  Batches: 4  Memory Usage: 80006kB
                    ->  Seq Scan on hourly_loads hl  (cost=0.00..81402.60 rows=4925260 width=20) (actual time=0.019..783.327 rows=4925112 loops=1)
Planning time: 0.638 ms
Execution time: 4134.577 ms

Add an index

CREATE INDEX ON hourly_loads (actual_date, zone_name);
VACUUM ANALYZE hourly_loads;

Limit  (cost=100773.84..100773.86 rows=6 width=52) (actual time=116.622..116.624 rows=10 loops=1)
  CTE qry
    ->  GroupAggregate  (cost=0.43..5296.77 rows=1423 width=40) (actual time=0.237..17.793 rows=122 loops=1)
          Group Key: hourly_loads.actual_date, hourly_loads.zone_name
          ->  Index Scan using hourly_loads_actual_date_zone_name_idx on hourly_loads  (cost=0.43..5271.85 rows=1425 width=16) (actual time=0.054..13.930 rows=2928 loops=1)
                Index Cond: ((actual_date >= '2017-06-01'::date) AND (actual_date <= '2017-09-30'::date) AND (zone_name = 'MIDATL'::text))
  ->  Sort  (cost=95477.07..95477.09 rows=6 width=52) (actual time=116.621..116.622 rows=10 loops=1)
        Sort Key: hl.mw DESC
        Sort Method: top-N heapsort  Memory: 26kB
        ->  Nested Loop  (cost=0.43..95477.00 rows=6 width=52) (actual time=0.953..116.080 rows=122 loops=1)
              ->  CTE Scan on qry  (cost=0.00..28.46 rows=1423 width=36) (actual time=0.239..17.941 rows=122 loops=1)
              ->  Index Scan using hourly_loads_actual_date_zone_name_idx on hourly_loads hl  (cost=0.43..67.07 rows=1 width=20) (actual time=0.472..0.802 rows=1 loops=122)
                    Index Cond: (actual_date = qry.actual_date)
                    Filter: (qry.daily_max_usage = mw)
                    Rows Removed by Filter: 911
Planning time: 0.592 ms
Execution time: 116.697 ms

116ms - not bad.

Use window functions

EXPLAIN ANALYZE
SELECT * FROM (
	SELECT *, rank() OVER (PARTITION BY actual_date ORDER BY mw DESC)
	FROM hourly_loads
	WHERE zone_name = 'MIDATL' and actual_date between '2017-06-01' and '2017-09-30'
) q
WHERE rank = 1
ORDER BY mw DESC
LIMIT 10;

Limit  (cost=5392.91..5392.93 rows=7 width=28) (actual time=34.918..34.921 rows=10 loops=1)
  ->  Sort  (cost=5392.91..5392.93 rows=7 width=28) (actual time=34.917..34.918 rows=10 loops=1)
        Sort Key: q.mw DESC
        Sort Method: top-N heapsort  Memory: 26kB
        ->  Subquery Scan on q  (cost=5346.50..5392.81 rows=7 width=28) (actual time=30.378..34.593 rows=122 loops=1)
              Filter: (q.rank = 1)
              Rows Removed by Filter: 2806
              ->  WindowAgg  (cost=5346.50..5375.00 rows=1425 width=28) (actual time=30.376..34.181 rows=2928 loops=1)
                    ->  Sort  (cost=5346.50..5350.06 rows=1425 width=20) (actual time=30.368..30.726 rows=2928 loops=1)
                          Sort Key: hourly_loads.actual_date, hourly_loads.mw DESC
                          Sort Method: quicksort  Memory: 325kB
                          ->  Index Scan using hourly_loads_actual_date_zone_name_idx on hourly_loads  (cost=0.43..5271.85 rows=1425 width=20) (actual time=0.060..16.958 rows=2928 loops=1)
                                Index Cond: ((actual_date >= '2017-06-01'::date) AND (actual_date <= '2017-09-30'::date) AND (zone_name = 'MIDATL'::text))
Planning time: 0.224 ms
Execution time: 34.983 ms

34ms 🎉

New query without index

After dropping the index we created, we can re-run our new query:

DROP INDEX hourly_loads_actual_date_zone_name_idx;

Limit  (cost=69324.61..69324.63 rows=7 width=28) (actual time=275.526..275.528 rows=10 loops=1)
  ->  Sort  (cost=69324.61..69324.63 rows=7 width=28) (actual time=275.524..275.525 rows=10 loops=1)
        Sort Key: q.mw DESC
        Sort Method: top-N heapsort  Memory: 26kB
        ->  Subquery Scan on q  (cost=69278.20..69324.52 rows=7 width=28) (actual time=271.890..275.263 rows=122 loops=1)
              Filter: (q.rank = 1)
              Rows Removed by Filter: 2806
              ->  WindowAgg  (cost=69278.20..69306.70 rows=1425 width=28) (actual time=271.889..274.931 rows=2928 loops=1)
                    ->  Sort  (cost=69278.20..69281.77 rows=1425 width=20) (actual time=271.880..272.182 rows=2928 loops=1)
                          Sort Key: hourly_loads.actual_date, hourly_loads.mw DESC
                          Sort Method: quicksort  Memory: 325kB
                          ->  Gather  (cost=1000.00..69203.56 rows=1425 width=20) (actual time=240.425..261.364 rows=2928 loops=1)
                                Workers Planned: 2
                                Workers Launched: 2
                                ->  Parallel Seq Scan on hourly_loads  (cost=0.00..68061.06 rows=594 width=20) (actual time=236.324..256.842 rows=976 loops=3)
                                      Filter: ((actual_date >= '2017-06-01'::date) AND (actual_date <= '2017-09-30'::date) AND (zone_name = 'MIDATL'::text))
                                      Rows Removed by Filter: 1640728
Planning time: 0.181 ms
Execution time: 275.615 ms

mpneuried/pg-vs-mapd.md

Load the data

Check time to sequentially scan the table

Run example query

Add an index

Use window functions

New query without index