knewter · December 16, 2015 23:48
diff --git a/gistfile1.txt b/gistfile1.txt
 --- Log opened Fri May 03 21:44:08 2013
 21:44            --> | jadams [[email protected]] has joined #postgresql
 21:44            --- | Users 663 nicks [0 ops, 0 halfops, 0 voices, 663 normal]
 21:44            --- | Channel #postgresql was synced in 2 seconds
 21:44 <      jadams> | hey hey
 21:44 <  supplicant> | hello
 21:45            <-- | trbs2 [~trbs@2001:470:d2ad:1:4a5b:39ff:fe7d:1624] has quit (Remote host closed the connection)
 21:46 <      jadams> | are there any freelance dbas around these parts?  Once I've spent another day or so on a query I might could use a dba to help make it go fast :D
 21:46            <-- | nicholasf_ [[email protected]] has quit (Remote host closed the connection)
 21:46 <      jadams> | at present, I've got this working and it's ok-ish https://gist.github.com/knewter/5515761
 21:46            <-- | Kassandry [~kassandry@70.89.177-157-BusName-smpls.oregon.hfc.comcastbusiness.net] has quit (Quit: Leaving.)
 21:47 <      jadams> | not great though.  The data's what I want and it's flexible, but it takes like 10 seconds for the query to respond.  I need to get it substantially faster.  Part of it's that I'm using an sql aggregate function (first/last) and I could use a C extension for those.  But removing those two aggregates doesn't make the query any faster really
 21:48 <  supplicant> | pastebin the explain
 21:48 <  supplicant> | at paste.depesz.com
 21:48 <      jadams> | okie dokie
 21:48 <  supplicant> | I suspect five_minute_intervals is rather large
 21:49            --> | lchiessi [[email protected]] has joined #postgresql
 21:49 < RhodiumToad> | replacing the aggregates with window functions might help
 21:50            <-- | ramr [[email protected]] has quit (Quit: Leaving.)
 21:50 < RhodiumToad> | the five_minute_intervals thing seems extremely dubious.
 21:50            <-- | lchiessi [[email protected]] has quit (Client Quit)
 21:50 <      jadams> | I keep getting a 'sorry something went wrong' when I try to paste to paste.depesz.com
 21:51 <  supplicant> | I'm fairly certain that converting your date into a unix timestamp, then dividing by 300 and grouping by that result will probably be faster
 21:51 < RhodiumToad> | explain.depesz.com
 21:51 <  supplicant> | yeah, my bad on that one
 21:51 <      jadams> | https://gist.github.com/knewter/5515761
 21:51 <      jadams> | ahh sorry
 21:52 <      jadams> | supplicant: I actually have it as a unix timestamp
 21:52 <      jadams> | in the date field
 21:52 <      jadams> | the date_as_datetime field is just a conversion of that I do on insert in my client
 21:52 <      jadams> | because I (foolishly?) thought it would make things faster to use dates
 21:53 <      jadams> | http://explain.depesz.com/s/Mw3Q
 21:53 < RhodiumToad> | are you confusing date and timestamp?
 21:53 <      jadams> | no, it's called date but it's a timestamp
 21:53 <      jadams> | I only did that because it mimics the remote service data i'm storing 
 21:53 <      jadams> | they call the column date :-\
 21:53 < RhodiumToad> | oh, that plan is going to suck so hard
 21:53 <      jadams> | it is an epoch integer
 21:54 <      jadams> | RhodiumToad: did you see the explain I pasted?
 21:54 < RhodiumToad> | in pg, you can convert epoch integers to and from timestamp with time zone without loss of information
 21:55 < RhodiumToad> | yes, that's what I was referring to as sucking :-)
 21:55 <      jadams> | ah...I don't even know how to read it to see why it's obvious it sucks
 21:55 < RhodiumToad> | an explain analyze would be better
 21:55 <      jadams> | can do
 21:56            <-- | lamneth_ [[email protected]] has left #postgresql ( )
 21:57 < RhodiumToad> | anyway, rather than join against a cte for the 5 minute thing, just calculate it
 21:57 <      jadams> | RhodiumToad: http://explain.depesz.com/s/Mii
 21:58 <  supplicant> | this plan will also go faster with more work_mem
 21:58 < RhodiumToad> | not all that much
 21:59 <  supplicant> | this is also true
 21:59 < RhodiumToad> | again, Amdahl's Law
 22:00 < RhodiumToad> | so did I understand correctly that you already have a t.date column with a unixtime?
 22:00 <      jadams> | you did
 22:01            --> | mmitchell [[email protected]] has joined #postgresql
 22:01 < RhodiumToad> | so your starting point should be:
 22:01            <-- | daemonkeeper [~Arno@debian/developer/arno] has quit (Ping timeout: 245 seconds)
 22:01 < RhodiumToad> | select ... from trades t group by t.date/300;
 22:01            <-- | booo [[email protected]] has quit (Read error: Operation timed out)
 22:01 < RhodiumToad> | the only question mark would be how best to do the first/last stuff
 22:02 <      jadams> | ther'es a c extension for the aggregate first/last functions
 22:02 <      jadams> | I can take the first/last bits out for now
 22:02 < RhodiumToad> | probably not a good idea; ORDER BY for aggs has a nontrivial penalty
 22:02 <      jadams> | aha
 22:03            --> | _AnK [[email protected]] has joined #postgresql
 22:03              * | RhodiumToad wrote that, knows how it works
 22:03            <-- | mmitchell [[email protected]] has quit (Remote host closed the connection)
 22:04 < RhodiumToad> | so there's a window function solution, and also a self-join solution
 22:04            --> | nichola__ [[email protected]] has joined #postgresql
 22:04 < RhodiumToad> | I would probably try both before considering C aggregates
 22:04            <-- | dcramer_ [[email protected]] has quit (Quit: dcramer_)
 22:05            --> | daemonkeeper [~Arno@debian/developer/arno] has joined #postgresql
 22:05            --> | booo [[email protected]] has joined #postgresql
 22:05 <      jadams> | ok.  I'm going to try to keep this log in my working notes on this stuff so I can reference back to it later
 22:06            --- | nichola__ is now known as nicholasf_
 22:07            --- | nicholasf_ is now known as nicholasf
 22:10 <      jadams> | RhodiumToad: so if I do the group by like that, how do I get the start/end time of the interval?
 22:10 <      jadams> | NEVERMIND
 22:10 <      jadams> | I will hush, easy enough to solve
 22:12 <      jadams> | so the first/last bits aren't TERRIBLY egregious with the t.date/300
 22:14 < RhodiumToad> | select start_time, start_time+300, sum(amount) as volume, min(price) as low, max(price) as high, max(first_price) as first, max(last_price) as last from (select date/300 as start_time, price, first_value(price) over w as first_price, last_value(price) over w as last_price from trades window w as (partition by date/300 order by date rows between unbounded preceding and unbounded following)) s group by start_time order by start_tim
 22:15 < RhodiumToad> | or something of that ilk
 22:15            <-- | tureba [[email protected]] has quit (Quit: leaving)
 22:15            --> | MmikeDOMA [[email protected]] has joined #postgresql
 22:17            --> | phil_ [[email protected]] has joined #postgresql
 22:18            --> | tureba [[email protected]] has joined #postgresql
 22:19            <-- | Mmike [[email protected]] has quit (Ping timeout: 255 seconds)
 22:19            --> | dmishe [[email protected]] has joined #postgresql
 22:20            <-- | blaa [~alvaro@unaffiliated/blaa] has quit (Quit: KVIrc 4.2.0 Equilibrium http://www.kvirc.net/)
 22:20            <-- | _Tenchi_ [[email protected]] has quit (Ping timeout: 264 seconds)
 22:24 <      jadams> | RhodiumToad: that's a very fast query, and on the surface looks like it does the right thing (haven't tried to verify the data as correct yet).  Output here (and I tweaked it a little) https://gist.github.com/knewter/5515990
 22:24            --- | phil_ is now known as _Tenchi_
 22:25 <      jadams> | RhodiumToad: my onle question is, how can I determine what time range a given result was over?
 22:25 <      jadams> | I have some values for start_time and end_time, but they aren't epochs because they were divided by 300
 22:25            <-- | cartan [[email protected]] has quit (Quit: Zzzzzzzzzz)
 22:25 < RhodiumToad> | multiply them by 300 again
 22:25 <      jadams> | fair enough...
 22:25 < RhodiumToad> | and use to_timestamp(start_time*300) if you want a timestamptz
 22:25 < RhodiumToad> | what does the explain analyze look like?
 22:27 <      jadams> | I'll grab it in a sec for you.  Just pasted the query and output here: https://gist.github.com/knewter/5515990
 22:27 <      jadams> | the start_time and end_time aren't five minutes apart...
 22:27 <      jadams> | let me paste the explain analyze
 22:27 < RhodiumToad> | oh, sorry
 22:27 < RhodiumToad> | start_time*300, start_time*300+300
 22:28 <      jadams> | http://explain.depesz.com/s/FDR
 22:28 <      jadams> | RhodiumToad: aha, thanks...derp
 22:28 < RhodiumToad> | the start_time+300 in the original should have been start_time+1
 22:29            --> | eggyknap_ [[email protected]] has joined #postgresql
 22:29 < RhodiumToad> | it's better if you can sort by the original raw start_time rather than the timestamptz version of it
 22:29 < RhodiumToad> | saves a sort step
 22:29            <-- | eggyknap [~jtolley@unaffiliated/eggyknap] has quit (Quit: leaving)
 22:30            --- | eggyknap_ is now known as eggyknap
 22:30            <-- | eggyknap [[email protected]] has quit (Changing host)
 22:30            --> | eggyknap [~jtolley@unaffiliated/eggyknap] has joined #postgresql
 22:30 < RhodiumToad> | anyway, two orders of magnitude speedup, not bad going
 22:30            --> | drale2k [[email protected]] has joined #postgresql
 22:31 <      jadams> | here I fixed the start_time replacement (didn't mean to override the start_time with a tz): http://explain.depesz.com/s/0KC
 22:32 < RhodiumToad> | that's the query plan I was aiming for
 22:32 <      jadams> | awesome.  The query's here https://gist.github.com/knewter/5515990
 22:32            <-- | dmishe [[email protected]] has quit (Quit: ["Textual IRC Client: www.textualapp.com"])
 22:33            --> | mk3548208 [[email protected]] has joined #postgresql
 22:33 <      jadams> | RhodiumToad: many many thanks on this sir.
 22:34 < RhodiumToad> | however, you may have noticed that rows are missing from the output where there's no data in the interval
 22:34 <      jadams> | yeah
 22:34 <      jadams> | I'm fairly sure the graphing library can just take care of that
 22:34 <      jadams> | it's time aware
 22:34 <      jadams> | I knew that was a consequence of losing the temp table
 22:35            --> | an [[email protected]] has joined #postgresql
 22:35 < RhodiumToad> | it's possible to fill in the gaps if need be (there's about two different ways of doing so)
 22:35            --> | Ja [[email protected]] has joined #postgresql
 22:35            <-- | Ja [[email protected]] has left #postgresql ( )
 22:35            --- | an is now known as Guest83986
 22:35 < RhodiumToad> | one is to do an outer join against a generated time series _after_ the grouping (much more efficient to do it after than before)
 22:36 < RhodiumToad> | the other is a little trick with window functions and generate_series in the select list
 22:37 < RhodiumToad> | the outer join one is probably easier to understand:
 22:37 < RhodiumToad> | select * from (select ... group by start_time) s right join generate_series(...) g(start_time) using (start_time) order by start_time;
 22:41            --> | josh__ [[email protected]] has joined #postgresql
 22:41            --- | josh__ is now known as agliodbs
 22:43            <-- | Guest83986 [[email protected]] has quit (Remote host closed the connection)
 22:46            <-- | _Tenchi_ [[email protected]] has quit (Ping timeout: 248 seconds)
	--- Log opened Fri May 03 21:44:08 2013
	21:44 --> \| jadams [[email protected]] has joined #postgresql
	21:44 --- \| Users 663 nicks [0 ops, 0 halfops, 0 voices, 663 normal]
	21:44 --- \| Channel #postgresql was synced in 2 seconds
	21:44 < jadams> \| hey hey
	21:44 < supplicant> \| hello
	21:45 <-- \| trbs2 [~trbs@2001:470:d2ad:1:4a5b:39ff:fe7d:1624] has quit (Remote host closed the connection)
	21:46 < jadams> \| are there any freelance dbas around these parts? Once I've spent another day or so on a query I might could use a dba to help make it go fast :D
	21:46 <-- \| nicholasf_ [[email protected]] has quit (Remote host closed the connection)
	21:46 < jadams> \| at present, I've got this working and it's ok-ish https://gist.github.com/knewter/5515761
	21:46 <-- \| Kassandry [~kassandry@70.89.177-157-BusName-smpls.oregon.hfc.comcastbusiness.net] has quit (Quit: Leaving.)
	21:47 < jadams> \| not great though. The data's what I want and it's flexible, but it takes like 10 seconds for the query to respond. I need to get it substantially faster. Part of it's that I'm using an sql aggregate function (first/last) and I could use a C extension for those. But removing those two aggregates doesn't make the query any faster really
	21:48 < supplicant> \| pastebin the explain
	21:48 < supplicant> \| at paste.depesz.com
	21:48 < jadams> \| okie dokie
	21:48 < supplicant> \| I suspect five_minute_intervals is rather large
	21:49 --> \| lchiessi [[email protected]] has joined #postgresql
	21:49 < RhodiumToad> \| replacing the aggregates with window functions might help
	21:50 <-- \| ramr [[email protected]] has quit (Quit: Leaving.)
	21:50 < RhodiumToad> \| the five_minute_intervals thing seems extremely dubious.
	21:50 <-- \| lchiessi [[email protected]] has quit (Client Quit)
	21:50 < jadams> \| I keep getting a 'sorry something went wrong' when I try to paste to paste.depesz.com
	21:51 < supplicant> \| I'm fairly certain that converting your date into a unix timestamp, then dividing by 300 and grouping by that result will probably be faster
	21:51 < RhodiumToad> \| explain.depesz.com
	21:51 < supplicant> \| yeah, my bad on that one
	21:51 < jadams> \| https://gist.github.com/knewter/5515761
	21:51 < jadams> \| ahh sorry
	21:52 < jadams> \| supplicant: I actually have it as a unix timestamp
	21:52 < jadams> \| in the date field
	21:52 < jadams> \| the date_as_datetime field is just a conversion of that I do on insert in my client
	21:52 < jadams> \| because I (foolishly?) thought it would make things faster to use dates
	21:53 < jadams> \| http://explain.depesz.com/s/Mw3Q
	21:53 < RhodiumToad> \| are you confusing date and timestamp?
	21:53 < jadams> \| no, it's called date but it's a timestamp
	21:53 < jadams> \| I only did that because it mimics the remote service data i'm storing
	21:53 < jadams> \| they call the column date :-\
	21:53 < RhodiumToad> \| oh, that plan is going to suck so hard
	21:53 < jadams> \| it is an epoch integer
	21:54 < jadams> \| RhodiumToad: did you see the explain I pasted?
	21:54 < RhodiumToad> \| in pg, you can convert epoch integers to and from timestamp with time zone without loss of information
	21:55 < RhodiumToad> \| yes, that's what I was referring to as sucking :-)
	21:55 < jadams> \| ah...I don't even know how to read it to see why it's obvious it sucks
	21:55 < RhodiumToad> \| an explain analyze would be better
	21:55 < jadams> \| can do
	21:56 <-- \| lamneth_ [[email protected]] has left #postgresql ( )
	21:57 < RhodiumToad> \| anyway, rather than join against a cte for the 5 minute thing, just calculate it
	21:57 < jadams> \| RhodiumToad: http://explain.depesz.com/s/Mii
	21:58 < supplicant> \| this plan will also go faster with more work_mem
	21:58 < RhodiumToad> \| not all that much
	21:59 < supplicant> \| this is also true
	21:59 < RhodiumToad> \| again, Amdahl's Law
	22:00 < RhodiumToad> \| so did I understand correctly that you already have a t.date column with a unixtime?
	22:00 < jadams> \| you did
	22:01 --> \| mmitchell [[email protected]] has joined #postgresql
	22:01 < RhodiumToad> \| so your starting point should be:
	22:01 <-- \| daemonkeeper [~Arno@debian/developer/arno] has quit (Ping timeout: 245 seconds)
	22:01 < RhodiumToad> \| select ... from trades t group by t.date/300;
	22:01 <-- \| booo [[email protected]] has quit (Read error: Operation timed out)
	22:01 < RhodiumToad> \| the only question mark would be how best to do the first/last stuff
	22:02 < jadams> \| ther'es a c extension for the aggregate first/last functions
	22:02 < jadams> \| I can take the first/last bits out for now
	22:02 < RhodiumToad> \| probably not a good idea; ORDER BY for aggs has a nontrivial penalty
	22:02 < jadams> \| aha
	22:03 --> \| _AnK [[email protected]] has joined #postgresql
	22:03 * \| RhodiumToad wrote that, knows how it works
	22:03 <-- \| mmitchell [[email protected]] has quit (Remote host closed the connection)
	22:04 < RhodiumToad> \| so there's a window function solution, and also a self-join solution
	22:04 --> \| nichola__ [[email protected]] has joined #postgresql
	22:04 < RhodiumToad> \| I would probably try both before considering C aggregates
	22:04 <-- \| dcramer_ [[email protected]] has quit (Quit: dcramer_)
	22:05 --> \| daemonkeeper [~Arno@debian/developer/arno] has joined #postgresql
	22:05 --> \| booo [[email protected]] has joined #postgresql
	22:05 < jadams> \| ok. I'm going to try to keep this log in my working notes on this stuff so I can reference back to it later
	22:06 --- \| nichola__ is now known as nicholasf_
	22:07 --- \| nicholasf_ is now known as nicholasf
	22:10 < jadams> \| RhodiumToad: so if I do the group by like that, how do I get the start/end time of the interval?
	22:10 < jadams> \| NEVERMIND
	22:10 < jadams> \| I will hush, easy enough to solve
	22:12 < jadams> \| so the first/last bits aren't TERRIBLY egregious with the t.date/300
	22:14 < RhodiumToad> \| select start_time, start_time+300, sum(amount) as volume, min(price) as low, max(price) as high, max(first_price) as first, max(last_price) as last from (select date/300 as start_time, price, first_value(price) over w as first_price, last_value(price) over w as last_price from trades window w as (partition by date/300 order by date rows between unbounded preceding and unbounded following)) s group by start_time order by start_tim
	22:15 < RhodiumToad> \| or something of that ilk
	22:15 <-- \| tureba [[email protected]] has quit (Quit: leaving)
	22:15 --> \| MmikeDOMA [[email protected]] has joined #postgresql
	22:17 --> \| phil_ [[email protected]] has joined #postgresql
	22:18 --> \| tureba [[email protected]] has joined #postgresql
	22:19 <-- \| Mmike [[email protected]] has quit (Ping timeout: 255 seconds)
	22:19 --> \| dmishe [[email protected]] has joined #postgresql
	22:20 <-- \| blaa [~alvaro@unaffiliated/blaa] has quit (Quit: KVIrc 4.2.0 Equilibrium http://www.kvirc.net/)
	22:20 <-- \| _Tenchi_ [[email protected]] has quit (Ping timeout: 264 seconds)
	22:24 < jadams> \| RhodiumToad: that's a very fast query, and on the surface looks like it does the right thing (haven't tried to verify the data as correct yet). Output here (and I tweaked it a little) https://gist.github.com/knewter/5515990
	22:24 --- \| phil_ is now known as _Tenchi_
	22:25 < jadams> \| RhodiumToad: my onle question is, how can I determine what time range a given result was over?
	22:25 < jadams> \| I have some values for start_time and end_time, but they aren't epochs because they were divided by 300
	22:25 <-- \| cartan [[email protected]] has quit (Quit: Zzzzzzzzzz)
	22:25 < RhodiumToad> \| multiply them by 300 again
	22:25 < jadams> \| fair enough...
	22:25 < RhodiumToad> \| and use to_timestamp(start_time*300) if you want a timestamptz
	22:25 < RhodiumToad> \| what does the explain analyze look like?
	22:27 < jadams> \| I'll grab it in a sec for you. Just pasted the query and output here: https://gist.github.com/knewter/5515990
	22:27 < jadams> \| the start_time and end_time aren't five minutes apart...
	22:27 < jadams> \| let me paste the explain analyze
	22:27 < RhodiumToad> \| oh, sorry
	22:27 < RhodiumToad> \| start_time300, start_time300+300
	22:28 < jadams> \| http://explain.depesz.com/s/FDR
	22:28 < jadams> \| RhodiumToad: aha, thanks...derp
	22:28 < RhodiumToad> \| the start_time+300 in the original should have been start_time+1
	22:29 --> \| eggyknap_ [[email protected]] has joined #postgresql
	22:29 < RhodiumToad> \| it's better if you can sort by the original raw start_time rather than the timestamptz version of it
	22:29 < RhodiumToad> \| saves a sort step
	22:29 <-- \| eggyknap [~jtolley@unaffiliated/eggyknap] has quit (Quit: leaving)
	22:30 --- \| eggyknap_ is now known as eggyknap
	22:30 <-- \| eggyknap [[email protected]] has quit (Changing host)
	22:30 --> \| eggyknap [~jtolley@unaffiliated/eggyknap] has joined #postgresql
	22:30 < RhodiumToad> \| anyway, two orders of magnitude speedup, not bad going
	22:30 --> \| drale2k [[email protected]] has joined #postgresql
	22:31 < jadams> \| here I fixed the start_time replacement (didn't mean to override the start_time with a tz): http://explain.depesz.com/s/0KC
	22:32 < RhodiumToad> \| that's the query plan I was aiming for
	22:32 < jadams> \| awesome. The query's here https://gist.github.com/knewter/5515990
	22:32 <-- \| dmishe [[email protected]] has quit (Quit: ["Textual IRC Client: www.textualapp.com"])
	22:33 --> \| mk3548208 [[email protected]] has joined #postgresql
	22:33 < jadams> \| RhodiumToad: many many thanks on this sir.
	22:34 < RhodiumToad> \| however, you may have noticed that rows are missing from the output where there's no data in the interval
	22:34 < jadams> \| yeah
	22:34 < jadams> \| I'm fairly sure the graphing library can just take care of that
	22:34 < jadams> \| it's time aware
	22:34 < jadams> \| I knew that was a consequence of losing the temp table
	22:35 --> \| an [[email protected]] has joined #postgresql
	22:35 < RhodiumToad> \| it's possible to fill in the gaps if need be (there's about two different ways of doing so)
	22:35 --> \| Ja [[email protected]] has joined #postgresql
	22:35 <-- \| Ja [[email protected]] has left #postgresql ( )
	22:35 --- \| an is now known as Guest83986
	22:35 < RhodiumToad> \| one is to do an outer join against a generated time series _after_ the grouping (much more efficient to do it after than before)
	22:36 < RhodiumToad> \| the other is a little trick with window functions and generate_series in the select list
	22:37 < RhodiumToad> \| the outer join one is probably easier to understand:
	22:37 < RhodiumToad> \| select * from (select ... group by start_time) s right join generate_series(...) g(start_time) using (start_time) order by start_time;
	22:41 --> \| josh__ [[email protected]] has joined #postgresql
	22:41 --- \| josh__ is now known as agliodbs
	22:43 <-- \| Guest83986 [[email protected]] has quit (Remote host closed the connection)
	22:46 <-- \| _Tenchi_ [[email protected]] has quit (Ping timeout: 248 seconds)