Skip to content

Instantly share code, notes, and snippets.

@knewter
Last active December 16, 2015 23:48
Show Gist options
  • Save knewter/5515951 to your computer and use it in GitHub Desktop.
Save knewter/5515951 to your computer and use it in GitHub Desktop.
discussion on #postgresql where people are trying to help me out with the candle data query
--- Log opened Fri May 03 21:44:08 2013
21:44 --> | jadams [[email protected]] has joined #postgresql
21:44 --- | Users 663 nicks [0 ops, 0 halfops, 0 voices, 663 normal]
21:44 --- | Channel #postgresql was synced in 2 seconds
21:44 < jadams> | hey hey
21:44 < supplicant> | hello
21:45 <-- | trbs2 [~trbs@2001:470:d2ad:1:4a5b:39ff:fe7d:1624] has quit (Remote host closed the connection)
21:46 < jadams> | are there any freelance dbas around these parts? Once I've spent another day or so on a query I might could use a dba to help make it go fast :D
21:46 <-- | nicholasf_ [[email protected]] has quit (Remote host closed the connection)
21:46 < jadams> | at present, I've got this working and it's ok-ish https://gist.github.com/knewter/5515761
21:46 <-- | Kassandry [~kassandry@70.89.177-157-BusName-smpls.oregon.hfc.comcastbusiness.net] has quit (Quit: Leaving.)
21:47 < jadams> | not great though. The data's what I want and it's flexible, but it takes like 10 seconds for the query to respond. I need to get it substantially faster. Part of it's that I'm using an sql aggregate function (first/last) and I could use a C extension for those. But removing those two aggregates doesn't make the query any faster really
21:48 < supplicant> | pastebin the explain
21:48 < supplicant> | at paste.depesz.com
21:48 < jadams> | okie dokie
21:48 < supplicant> | I suspect five_minute_intervals is rather large
21:49 --> | lchiessi [[email protected]] has joined #postgresql
21:49 < RhodiumToad> | replacing the aggregates with window functions might help
21:50 <-- | ramr [[email protected]] has quit (Quit: Leaving.)
21:50 < RhodiumToad> | the five_minute_intervals thing seems extremely dubious.
21:50 <-- | lchiessi [[email protected]] has quit (Client Quit)
21:50 < jadams> | I keep getting a 'sorry something went wrong' when I try to paste to paste.depesz.com
21:51 < supplicant> | I'm fairly certain that converting your date into a unix timestamp, then dividing by 300 and grouping by that result will probably be faster
21:51 < RhodiumToad> | explain.depesz.com
21:51 < supplicant> | yeah, my bad on that one
21:51 < jadams> | https://gist.github.com/knewter/5515761
21:51 < jadams> | ahh sorry
21:52 < jadams> | supplicant: I actually have it as a unix timestamp
21:52 < jadams> | in the date field
21:52 < jadams> | the date_as_datetime field is just a conversion of that I do on insert in my client
21:52 < jadams> | because I (foolishly?) thought it would make things faster to use dates
21:53 < jadams> | http://explain.depesz.com/s/Mw3Q
21:53 < RhodiumToad> | are you confusing date and timestamp?
21:53 < jadams> | no, it's called date but it's a timestamp
21:53 < jadams> | I only did that because it mimics the remote service data i'm storing
21:53 < jadams> | they call the column date :-\
21:53 < RhodiumToad> | oh, that plan is going to suck so hard
21:53 < jadams> | it is an epoch integer
21:54 < jadams> | RhodiumToad: did you see the explain I pasted?
21:54 < RhodiumToad> | in pg, you can convert epoch integers to and from timestamp with time zone without loss of information
21:55 < RhodiumToad> | yes, that's what I was referring to as sucking :-)
21:55 < jadams> | ah...I don't even know how to read it to see why it's obvious it sucks
21:55 < RhodiumToad> | an explain analyze would be better
21:55 < jadams> | can do
21:56 <-- | lamneth_ [[email protected]] has left #postgresql ( )
21:57 < RhodiumToad> | anyway, rather than join against a cte for the 5 minute thing, just calculate it
21:57 < jadams> | RhodiumToad: http://explain.depesz.com/s/Mii
21:58 < supplicant> | this plan will also go faster with more work_mem
21:58 < RhodiumToad> | not all that much
21:59 < supplicant> | this is also true
21:59 < RhodiumToad> | again, Amdahl's Law
22:00 < RhodiumToad> | so did I understand correctly that you already have a t.date column with a unixtime?
22:00 < jadams> | you did
22:01 --> | mmitchell [[email protected]] has joined #postgresql
22:01 < RhodiumToad> | so your starting point should be:
22:01 <-- | daemonkeeper [~Arno@debian/developer/arno] has quit (Ping timeout: 245 seconds)
22:01 < RhodiumToad> | select ... from trades t group by t.date/300;
22:01 <-- | booo [[email protected]] has quit (Read error: Operation timed out)
22:01 < RhodiumToad> | the only question mark would be how best to do the first/last stuff
22:02 < jadams> | ther'es a c extension for the aggregate first/last functions
22:02 < jadams> | I can take the first/last bits out for now
22:02 < RhodiumToad> | probably not a good idea; ORDER BY for aggs has a nontrivial penalty
22:02 < jadams> | aha
22:03 --> | _AnK [[email protected]] has joined #postgresql
22:03 * | RhodiumToad wrote that, knows how it works
22:03 <-- | mmitchell [[email protected]] has quit (Remote host closed the connection)
22:04 < RhodiumToad> | so there's a window function solution, and also a self-join solution
22:04 --> | nichola__ [[email protected]] has joined #postgresql
22:04 < RhodiumToad> | I would probably try both before considering C aggregates
22:04 <-- | dcramer_ [[email protected]] has quit (Quit: dcramer_)
22:05 --> | daemonkeeper [~Arno@debian/developer/arno] has joined #postgresql
22:05 --> | booo [[email protected]] has joined #postgresql
22:05 < jadams> | ok. I'm going to try to keep this log in my working notes on this stuff so I can reference back to it later
22:06 --- | nichola__ is now known as nicholasf_
22:07 --- | nicholasf_ is now known as nicholasf
22:10 < jadams> | RhodiumToad: so if I do the group by like that, how do I get the start/end time of the interval?
22:10 < jadams> | NEVERMIND
22:10 < jadams> | I will hush, easy enough to solve
22:12 < jadams> | so the first/last bits aren't TERRIBLY egregious with the t.date/300
22:14 < RhodiumToad> | select start_time, start_time+300, sum(amount) as volume, min(price) as low, max(price) as high, max(first_price) as first, max(last_price) as last from (select date/300 as start_time, price, first_value(price) over w as first_price, last_value(price) over w as last_price from trades window w as (partition by date/300 order by date rows between unbounded preceding and unbounded following)) s group by start_time order by start_tim
22:15 < RhodiumToad> | or something of that ilk
22:15 <-- | tureba [[email protected]] has quit (Quit: leaving)
22:15 --> | MmikeDOMA [[email protected]] has joined #postgresql
22:17 --> | phil_ [[email protected]] has joined #postgresql
22:18 --> | tureba [[email protected]] has joined #postgresql
22:19 <-- | Mmike [[email protected]] has quit (Ping timeout: 255 seconds)
22:19 --> | dmishe [[email protected]] has joined #postgresql
22:20 <-- | blaa [~alvaro@unaffiliated/blaa] has quit (Quit: KVIrc 4.2.0 Equilibrium http://www.kvirc.net/)
22:20 <-- | _Tenchi_ [[email protected]] has quit (Ping timeout: 264 seconds)
22:24 < jadams> | RhodiumToad: that's a very fast query, and on the surface looks like it does the right thing (haven't tried to verify the data as correct yet). Output here (and I tweaked it a little) https://gist.github.com/knewter/5515990
22:24 --- | phil_ is now known as _Tenchi_
22:25 < jadams> | RhodiumToad: my onle question is, how can I determine what time range a given result was over?
22:25 < jadams> | I have some values for start_time and end_time, but they aren't epochs because they were divided by 300
22:25 <-- | cartan [[email protected]] has quit (Quit: Zzzzzzzzzz)
22:25 < RhodiumToad> | multiply them by 300 again
22:25 < jadams> | fair enough...
22:25 < RhodiumToad> | and use to_timestamp(start_time*300) if you want a timestamptz
22:25 < RhodiumToad> | what does the explain analyze look like?
22:27 < jadams> | I'll grab it in a sec for you. Just pasted the query and output here: https://gist.github.com/knewter/5515990
22:27 < jadams> | the start_time and end_time aren't five minutes apart...
22:27 < jadams> | let me paste the explain analyze
22:27 < RhodiumToad> | oh, sorry
22:27 < RhodiumToad> | start_time*300, start_time*300+300
22:28 < jadams> | http://explain.depesz.com/s/FDR
22:28 < jadams> | RhodiumToad: aha, thanks...derp
22:28 < RhodiumToad> | the start_time+300 in the original should have been start_time+1
22:29 --> | eggyknap_ [[email protected]] has joined #postgresql
22:29 < RhodiumToad> | it's better if you can sort by the original raw start_time rather than the timestamptz version of it
22:29 < RhodiumToad> | saves a sort step
22:29 <-- | eggyknap [~jtolley@unaffiliated/eggyknap] has quit (Quit: leaving)
22:30 --- | eggyknap_ is now known as eggyknap
22:30 <-- | eggyknap [[email protected]] has quit (Changing host)
22:30 --> | eggyknap [~jtolley@unaffiliated/eggyknap] has joined #postgresql
22:30 < RhodiumToad> | anyway, two orders of magnitude speedup, not bad going
22:30 --> | drale2k [[email protected]] has joined #postgresql
22:31 < jadams> | here I fixed the start_time replacement (didn't mean to override the start_time with a tz): http://explain.depesz.com/s/0KC
22:32 < RhodiumToad> | that's the query plan I was aiming for
22:32 < jadams> | awesome. The query's here https://gist.github.com/knewter/5515990
22:32 <-- | dmishe [[email protected]] has quit (Quit: ["Textual IRC Client: www.textualapp.com"])
22:33 --> | mk3548208 [[email protected]] has joined #postgresql
22:33 < jadams> | RhodiumToad: many many thanks on this sir.
22:34 < RhodiumToad> | however, you may have noticed that rows are missing from the output where there's no data in the interval
22:34 < jadams> | yeah
22:34 < jadams> | I'm fairly sure the graphing library can just take care of that
22:34 < jadams> | it's time aware
22:34 < jadams> | I knew that was a consequence of losing the temp table
22:35 --> | an [[email protected]] has joined #postgresql
22:35 < RhodiumToad> | it's possible to fill in the gaps if need be (there's about two different ways of doing so)
22:35 --> | Ja [[email protected]] has joined #postgresql
22:35 <-- | Ja [[email protected]] has left #postgresql ( )
22:35 --- | an is now known as Guest83986
22:35 < RhodiumToad> | one is to do an outer join against a generated time series _after_ the grouping (much more efficient to do it after than before)
22:36 < RhodiumToad> | the other is a little trick with window functions and generate_series in the select list
22:37 < RhodiumToad> | the outer join one is probably easier to understand:
22:37 < RhodiumToad> | select * from (select ... group by start_time) s right join generate_series(...) g(start_time) using (start_time) order by start_time;
22:41 --> | josh__ [[email protected]] has joined #postgresql
22:41 --- | josh__ is now known as agliodbs
22:43 <-- | Guest83986 [[email protected]] has quit (Remote host closed the connection)
22:46 <-- | _Tenchi_ [[email protected]] has quit (Ping timeout: 248 seconds)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment