Skip to content

Instantly share code, notes, and snippets.

@davetapley
Created October 25, 2012 21:10
Show Gist options
  • Save davetapley/3955433 to your computer and use it in GitHub Desktop.
Save davetapley/3955433 to your computer and use it in GitHub Desktop.
Convo about partial index with Rhodium toad
(13:27:40) dukedave: Lazy post: Any comments on my question here, the single answer, and my comment on that answer: http://dba.stackexchange.com/questions/25890/does-updating-a-partially-indexed-field-add-or-remove-it-from-an-index-in-postgr
(13:29:05) RhodiumToad: dukedave: your answer seems correct
(13:29:19) supplicant: I think dukedave is the commenter?
(13:29:31) dukedave: Yes, I'm the asker and commenter :)
(13:29:48) RhodiumToad: dukedave: ok, so the answer you received is correct :-)
(13:29:50) dukedave: The answer certainly makes sense, but I'm still seeing the partial index grow
(13:29:56) dukedave: Which seems in consistent
(13:30:19) RhodiumToad: dukedave: no, not really. it'll depend on the exact pattern of updates. here's why:
(13:30:44) dukedave: The purpose of making it partial was that it only has non-archived rows in it, which should be a reasonably fixed number
(13:30:44) RhodiumToad: dukedave: I'm assuming we're inserting in strictly increasing order of ids, since that's what usually happens
(13:30:53) dukedave: RhodiumToad: Correct
(13:31:45) dukedave: cron job runs every 10 minutes and sets all rows older than 3 days to archived = true
(13:32:06) RhodiumToad: dukedave: all rows without exception?
(13:32:13) dukedave: RhodiumToad: Yes
(13:32:18) dukedave: (for now)
(13:32:26) RhodiumToad: dukedave: how many rows are there in the table so far?
(13:34:04) RhodiumToad: dukedave: (and covering what timeframe?)
(13:35:23) dukedave: RhodiumToad: 32660323 total, 2434934 non-archived, from 2012-08-08 to the present
(13:36:02) RhodiumToad: croaky: because while combining multiple indexes is a whole lot better than doing a seqscan, it's still very slow compared to a good multicolumn index
(13:36:28) RhodiumToad: croaky: and in this case, the multicolumn index gets you to the correct row in one single probe into one index
(13:37:56) RhodiumToad: also combining indexes doesn't get you ordering
(13:39:39) RhodiumToad: dukedave: hm. what does pgstattuple tell you about the index? or the output of vacuum verbose (not full)?
(13:44:39) dukedave: RhodiumToad: Thanks, VAC VERBOSE output here: https://gist.github.com/3955255#comments
(13:44:45) dukedave: Along with comment explaining
(13:44:59) dukedave: Ran that yesterday, but going to run again now
(13:47:32) RhodiumToad: dukedave: is the index really on a created_at field, or is it on user_id as suggested by the name?
(13:47:44) dukedave: RhodiumToad: It's on used_id
(13:48:32) dukedave: RhodiumToad: Ah crap, sorry I forgot the question on SO referenced created_at
(13:49:01) dukedave: I opted against that after playing with the query planner, it made more sense to index on used_id, then let it quicksort on created_at
(13:49:43) dukedave: Since our use case is: Get user's (last 3 days worth of) events, do stuff with them
(13:50:56) RhodiumToad: dukedave: well, the index size looks pretty much as expected
(13:51:36) dukedave: RhodiumToad: My concern is that the index appears to be growing, despite the fact that it's contents should be stable
(13:52:40) RhodiumToad: dukedave: the reason is that the rows being deleted from the index are typically going to be scattered over many pages.
(13:53:13) dukedave: RhodiumToad: Ah I see, so it can't free up until the page is empty
(13:53:27) RhodiumToad: dukedave: since index pages aren't coalesced, this means you'll have a lot of nearly empty pages
(13:54:00) RhodiumToad: dukedave: if you're really worried about index space, then one way to fix that is to index on (user_id,created_at)
(13:54:29) RhodiumToad: dukedave: but frankly I wouldn't necessarily bother
(13:54:48) RhodiumToad: dukedave: unless you have a relatively small number of distinct user_ids
(13:55:06) dukedave: RhodiumToad: Because that'll force consecutive events (which will all be archived at the same time) to be on the same page?
(13:55:12) RhodiumToad: (there's some benefit in a btree index of not having huge numbers of identical entries)
(13:55:19) RhodiumToad: dukedave: yes
(13:55:32) dukedave: But, if I have an double index, won't I need the created_at to find anything?
(13:55:37) RhodiumToad: dukedave: no
(13:55:52) dukedave: RhodiumToad: That's, totally not what I ever thought :)
(13:55:54) RhodiumToad: dukedave: an index on (a,b) is almost as good as an index on (a) alone for queries that don't specify b
(13:56:10) dukedave: RhodiumToad: Did not know that.
(13:56:16) RhodiumToad: dukedave: and the only reason it's only "almost" as good is because it's slightly larger
(13:56:41) RhodiumToad: dukedave: (which might be offset in this case by better page deletion behaviour)
(13:57:36) RhodiumToad: dukedave: note that order of columns matters - an index on (a,b) is no use for a query that only specifies b
(13:58:40) RhodiumToad: dukedave: (imagine a hardcopy phonebook as being an index on (surname,initials,address) and imagine looking up everyone with a single surname (easy), compared with looking up everyone with a specific initial (requires searching the whole book))
(13:59:15) dukedave: RhodiumToad: In answer to your question, 'relatively small number of distinct user_ids': We have around 5000 users worth of events, with 2433690 events; so ~480 events per user in the index.
(13:59:23) dukedave: RhodiumToad: That's a great analogy :)
(13:59:49) RhodiumToad: dukedave: that's about at the break-even point for whether including created_at would be helpful.
(14:00:13) RhodiumToad: dukedave: your stats from vacuum suggest that you're typically getting about 280 index entries per page
(14:00:42) RhodiumToad: dukedave: so most users will likely be using two pages of index, maybe some with more
(14:01:18) RhodiumToad: dukedave: (and less than three on average, since you have 13379 pages for 5000 users)
(14:01:54) RhodiumToad: dukedave: if you had significantly more items per user I'd add the created_at field, but as it is, it's probably a bit marginal
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment