davetapley · October 25, 2012 21:10
diff --git a/gistfile1.txt b/gistfile1.txt
 (13:27:40) dukedave: Lazy post: Any comments on my question here, the single answer, and my comment on that answer: http://dba.stackexchange.com/questions/25890/does-updating-a-partially-indexed-field-add-or-remove-it-from-an-index-in-postgr
 (13:29:05) RhodiumToad: dukedave: your answer seems correct
 (13:29:19) supplicant: I think dukedave is the commenter?
 (13:29:31) dukedave: Yes, I'm the asker and commenter :)
 (13:29:48) RhodiumToad: dukedave: ok, so the answer you received is correct :-)
 (13:29:50) dukedave: The answer certainly makes sense, but I'm still seeing the partial index grow
 (13:29:56) dukedave: Which seems in consistent 
 (13:30:19) RhodiumToad: dukedave: no, not really. it'll depend on the exact pattern of updates. here's why:
 (13:30:44) dukedave: The purpose of making it partial was that it only has non-archived rows in it, which should be a reasonably fixed number
 (13:30:44) RhodiumToad: dukedave: I'm assuming we're inserting in strictly increasing order of ids, since that's what usually happens
 (13:30:53) dukedave: RhodiumToad: Correct
 (13:31:45) dukedave: cron job runs every 10 minutes and sets all rows older than 3 days to archived = true
 (13:32:06) RhodiumToad: dukedave: all rows without exception?
 (13:32:13) dukedave: RhodiumToad: Yes
 (13:32:18) dukedave: (for now)
 (13:32:26) RhodiumToad: dukedave: how many rows are there in the table so far?
 (13:34:04) RhodiumToad: dukedave: (and covering what timeframe?)
 (13:35:23) dukedave: RhodiumToad: 32660323 total, 2434934  non-archived, from 2012-08-08  to the present
 (13:36:02) RhodiumToad: croaky: because while combining multiple indexes is a whole lot better than doing a seqscan, it's still very slow compared to a good multicolumn index
 (13:36:28) RhodiumToad: croaky: and in this case, the multicolumn index gets you to the correct row in one single probe into one index
 (13:37:56) RhodiumToad: also combining indexes doesn't get you ordering
 (13:39:39) RhodiumToad: dukedave: hm. what does pgstattuple tell you about the index? or the output of vacuum verbose (not full)?
 (13:44:39) dukedave: RhodiumToad: Thanks, VAC VERBOSE output here: https://gist.github.com/3955255#comments
 (13:44:45) dukedave: Along with comment explaining
 (13:44:59) dukedave: Ran that yesterday, but going to run again now
 (13:47:32) RhodiumToad: dukedave: is the index really on a created_at field, or is it on user_id as suggested by the name?
 (13:47:44) dukedave: RhodiumToad: It's on used_id
 (13:48:32) dukedave: RhodiumToad: Ah crap, sorry I forgot the question on SO referenced created_at
 (13:49:01) dukedave: I opted against that after playing with the query planner, it made more sense to index on used_id, then let it quicksort on created_at
 (13:49:43) dukedave: Since our use case is: Get user's (last 3 days worth of) events, do stuff with them
 (13:50:56) RhodiumToad: dukedave: well, the index size looks pretty much as expected
 (13:51:36) dukedave: RhodiumToad: My concern is that the index appears to be growing, despite the fact that it's contents should be stable
 (13:52:40) RhodiumToad: dukedave: the reason is that the rows being deleted from the index are typically going to be scattered over many pages.
 (13:53:13) dukedave: RhodiumToad: Ah I see, so it can't free up until the page is empty 
 (13:53:27) RhodiumToad: dukedave: since index pages aren't coalesced, this means you'll have a lot of nearly empty pages
 (13:54:00) RhodiumToad: dukedave: if you're really worried about index space, then one way to fix that is to index on (user_id,created_at)
 (13:54:29) RhodiumToad: dukedave: but frankly I wouldn't necessarily bother
 (13:54:48) RhodiumToad: dukedave: unless you have a relatively small number of distinct user_ids
 (13:55:06) dukedave: RhodiumToad: Because that'll force consecutive events (which will all be archived at the same time) to be on the same page?
 (13:55:12) RhodiumToad: (there's some benefit in a btree index of not having huge numbers of identical entries)
 (13:55:19) RhodiumToad: dukedave: yes
 (13:55:32) dukedave: But, if I have an double index, won't I need the created_at to find anything?
 (13:55:37) RhodiumToad: dukedave: no
 (13:55:52) dukedave: RhodiumToad: That's, totally not what I ever thought :)
 (13:55:54) RhodiumToad: dukedave: an index on (a,b) is almost as good as an index on (a) alone for queries that don't specify b
 (13:56:10) dukedave: RhodiumToad: Did not know that.
 (13:56:16) RhodiumToad: dukedave: and the only reason it's only "almost" as good is because it's slightly larger
 (13:56:41) RhodiumToad: dukedave: (which might be offset in this case by better page deletion behaviour)
 (13:57:36) RhodiumToad: dukedave: note that order of columns matters - an index on (a,b) is no use for a query that only specifies b
 (13:58:40) RhodiumToad: dukedave: (imagine a hardcopy phonebook as being an index on (surname,initials,address) and imagine looking up everyone with a single surname (easy), compared with looking up everyone with a specific initial (requires searching the whole book))
 (13:59:15) dukedave: RhodiumToad: In answer to your question, 'relatively small number of distinct user_ids': We have around 5000 users worth of events, with 2433690 events; so ~480 events per user in the index.
 (13:59:23) dukedave: RhodiumToad: That's a great analogy :)
 (13:59:49) RhodiumToad: dukedave: that's about at the break-even point for whether including created_at would be helpful.
 (14:00:13) RhodiumToad: dukedave: your stats from vacuum suggest that you're typically getting about 280 index entries per page
 (14:00:42) RhodiumToad: dukedave: so most users will likely be using two pages of index, maybe some with more
 (14:01:18) RhodiumToad: dukedave: (and less than three on average, since you have 13379 pages for 5000 users)
 (14:01:54) RhodiumToad: dukedave: if you had significantly more items per user I'd add the created_at field, but as it is, it's probably a bit marginal
	(13:27:40) dukedave: Lazy post: Any comments on my question here, the single answer, and my comment on that answer: http://dba.stackexchange.com/questions/25890/does-updating-a-partially-indexed-field-add-or-remove-it-from-an-index-in-postgr
	(13:29:05) RhodiumToad: dukedave: your answer seems correct
	(13:29:19) supplicant: I think dukedave is the commenter?
	(13:29:31) dukedave: Yes, I'm the asker and commenter :)
	(13:29:48) RhodiumToad: dukedave: ok, so the answer you received is correct :-)
	(13:29:50) dukedave: The answer certainly makes sense, but I'm still seeing the partial index grow
	(13:29:56) dukedave: Which seems in consistent
	(13:30:19) RhodiumToad: dukedave: no, not really. it'll depend on the exact pattern of updates. here's why:
	(13:30:44) dukedave: The purpose of making it partial was that it only has non-archived rows in it, which should be a reasonably fixed number
	(13:30:44) RhodiumToad: dukedave: I'm assuming we're inserting in strictly increasing order of ids, since that's what usually happens
	(13:30:53) dukedave: RhodiumToad: Correct
	(13:31:45) dukedave: cron job runs every 10 minutes and sets all rows older than 3 days to archived = true
	(13:32:06) RhodiumToad: dukedave: all rows without exception?
	(13:32:13) dukedave: RhodiumToad: Yes
	(13:32:18) dukedave: (for now)
	(13:32:26) RhodiumToad: dukedave: how many rows are there in the table so far?
	(13:34:04) RhodiumToad: dukedave: (and covering what timeframe?)
	(13:35:23) dukedave: RhodiumToad: 32660323 total, 2434934 non-archived, from 2012-08-08 to the present
	(13:36:02) RhodiumToad: croaky: because while combining multiple indexes is a whole lot better than doing a seqscan, it's still very slow compared to a good multicolumn index
	(13:36:28) RhodiumToad: croaky: and in this case, the multicolumn index gets you to the correct row in one single probe into one index
	(13:37:56) RhodiumToad: also combining indexes doesn't get you ordering
	(13:39:39) RhodiumToad: dukedave: hm. what does pgstattuple tell you about the index? or the output of vacuum verbose (not full)?
	(13:44:39) dukedave: RhodiumToad: Thanks, VAC VERBOSE output here: https://gist.github.com/3955255#comments
	(13:44:45) dukedave: Along with comment explaining
	(13:44:59) dukedave: Ran that yesterday, but going to run again now
	(13:47:32) RhodiumToad: dukedave: is the index really on a created_at field, or is it on user_id as suggested by the name?
	(13:47:44) dukedave: RhodiumToad: It's on used_id
	(13:48:32) dukedave: RhodiumToad: Ah crap, sorry I forgot the question on SO referenced created_at
	(13:49:01) dukedave: I opted against that after playing with the query planner, it made more sense to index on used_id, then let it quicksort on created_at
	(13:49:43) dukedave: Since our use case is: Get user's (last 3 days worth of) events, do stuff with them
	(13:50:56) RhodiumToad: dukedave: well, the index size looks pretty much as expected
	(13:51:36) dukedave: RhodiumToad: My concern is that the index appears to be growing, despite the fact that it's contents should be stable
	(13:52:40) RhodiumToad: dukedave: the reason is that the rows being deleted from the index are typically going to be scattered over many pages.
	(13:53:13) dukedave: RhodiumToad: Ah I see, so it can't free up until the page is empty
	(13:53:27) RhodiumToad: dukedave: since index pages aren't coalesced, this means you'll have a lot of nearly empty pages
	(13:54:00) RhodiumToad: dukedave: if you're really worried about index space, then one way to fix that is to index on (user_id,created_at)
	(13:54:29) RhodiumToad: dukedave: but frankly I wouldn't necessarily bother
	(13:54:48) RhodiumToad: dukedave: unless you have a relatively small number of distinct user_ids
	(13:55:06) dukedave: RhodiumToad: Because that'll force consecutive events (which will all be archived at the same time) to be on the same page?
	(13:55:12) RhodiumToad: (there's some benefit in a btree index of not having huge numbers of identical entries)
	(13:55:19) RhodiumToad: dukedave: yes
	(13:55:32) dukedave: But, if I have an double index, won't I need the created_at to find anything?
	(13:55:37) RhodiumToad: dukedave: no
	(13:55:52) dukedave: RhodiumToad: That's, totally not what I ever thought :)
	(13:55:54) RhodiumToad: dukedave: an index on (a,b) is almost as good as an index on (a) alone for queries that don't specify b
	(13:56:10) dukedave: RhodiumToad: Did not know that.
	(13:56:16) RhodiumToad: dukedave: and the only reason it's only "almost" as good is because it's slightly larger
	(13:56:41) RhodiumToad: dukedave: (which might be offset in this case by better page deletion behaviour)
	(13:57:36) RhodiumToad: dukedave: note that order of columns matters - an index on (a,b) is no use for a query that only specifies b
	(13:58:40) RhodiumToad: dukedave: (imagine a hardcopy phonebook as being an index on (surname,initials,address) and imagine looking up everyone with a single surname (easy), compared with looking up everyone with a specific initial (requires searching the whole book))
	(13:59:15) dukedave: RhodiumToad: In answer to your question, 'relatively small number of distinct user_ids': We have around 5000 users worth of events, with 2433690 events; so ~480 events per user in the index.
	(13:59:23) dukedave: RhodiumToad: That's a great analogy :)
	(13:59:49) RhodiumToad: dukedave: that's about at the break-even point for whether including created_at would be helpful.
	(14:00:13) RhodiumToad: dukedave: your stats from vacuum suggest that you're typically getting about 280 index entries per page
	(14:00:42) RhodiumToad: dukedave: so most users will likely be using two pages of index, maybe some with more
	(14:01:18) RhodiumToad: dukedave: (and less than three on average, since you have 13379 pages for 5000 users)
	(14:01:54) RhodiumToad: dukedave: if you had significantly more items per user I'd add the created_at field, but as it is, it's probably a bit marginal