Created
October 25, 2012 21:10
-
-
Save davetapley/3955433 to your computer and use it in GitHub Desktop.
Convo about partial index with Rhodium toad
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(13:27:40) dukedave: Lazy post: Any comments on my question here, the single answer, and my comment on that answer: http://dba.stackexchange.com/questions/25890/does-updating-a-partially-indexed-field-add-or-remove-it-from-an-index-in-postgr | |
(13:29:05) RhodiumToad: dukedave: your answer seems correct | |
(13:29:19) supplicant: I think dukedave is the commenter? | |
(13:29:31) dukedave: Yes, I'm the asker and commenter :) | |
(13:29:48) RhodiumToad: dukedave: ok, so the answer you received is correct :-) | |
(13:29:50) dukedave: The answer certainly makes sense, but I'm still seeing the partial index grow | |
(13:29:56) dukedave: Which seems in consistent | |
(13:30:19) RhodiumToad: dukedave: no, not really. it'll depend on the exact pattern of updates. here's why: | |
(13:30:44) dukedave: The purpose of making it partial was that it only has non-archived rows in it, which should be a reasonably fixed number | |
(13:30:44) RhodiumToad: dukedave: I'm assuming we're inserting in strictly increasing order of ids, since that's what usually happens | |
(13:30:53) dukedave: RhodiumToad: Correct | |
(13:31:45) dukedave: cron job runs every 10 minutes and sets all rows older than 3 days to archived = true | |
(13:32:06) RhodiumToad: dukedave: all rows without exception? | |
(13:32:13) dukedave: RhodiumToad: Yes | |
(13:32:18) dukedave: (for now) | |
(13:32:26) RhodiumToad: dukedave: how many rows are there in the table so far? | |
(13:34:04) RhodiumToad: dukedave: (and covering what timeframe?) | |
(13:35:23) dukedave: RhodiumToad: 32660323 total, 2434934 non-archived, from 2012-08-08 to the present | |
(13:36:02) RhodiumToad: croaky: because while combining multiple indexes is a whole lot better than doing a seqscan, it's still very slow compared to a good multicolumn index | |
(13:36:28) RhodiumToad: croaky: and in this case, the multicolumn index gets you to the correct row in one single probe into one index | |
(13:37:56) RhodiumToad: also combining indexes doesn't get you ordering | |
(13:39:39) RhodiumToad: dukedave: hm. what does pgstattuple tell you about the index? or the output of vacuum verbose (not full)? | |
(13:44:39) dukedave: RhodiumToad: Thanks, VAC VERBOSE output here: https://gist.github.com/3955255#comments | |
(13:44:45) dukedave: Along with comment explaining | |
(13:44:59) dukedave: Ran that yesterday, but going to run again now | |
(13:47:32) RhodiumToad: dukedave: is the index really on a created_at field, or is it on user_id as suggested by the name? | |
(13:47:44) dukedave: RhodiumToad: It's on used_id | |
(13:48:32) dukedave: RhodiumToad: Ah crap, sorry I forgot the question on SO referenced created_at | |
(13:49:01) dukedave: I opted against that after playing with the query planner, it made more sense to index on used_id, then let it quicksort on created_at | |
(13:49:43) dukedave: Since our use case is: Get user's (last 3 days worth of) events, do stuff with them | |
(13:50:56) RhodiumToad: dukedave: well, the index size looks pretty much as expected | |
(13:51:36) dukedave: RhodiumToad: My concern is that the index appears to be growing, despite the fact that it's contents should be stable | |
(13:52:40) RhodiumToad: dukedave: the reason is that the rows being deleted from the index are typically going to be scattered over many pages. | |
(13:53:13) dukedave: RhodiumToad: Ah I see, so it can't free up until the page is empty | |
(13:53:27) RhodiumToad: dukedave: since index pages aren't coalesced, this means you'll have a lot of nearly empty pages | |
(13:54:00) RhodiumToad: dukedave: if you're really worried about index space, then one way to fix that is to index on (user_id,created_at) | |
(13:54:29) RhodiumToad: dukedave: but frankly I wouldn't necessarily bother | |
(13:54:48) RhodiumToad: dukedave: unless you have a relatively small number of distinct user_ids | |
(13:55:06) dukedave: RhodiumToad: Because that'll force consecutive events (which will all be archived at the same time) to be on the same page? | |
(13:55:12) RhodiumToad: (there's some benefit in a btree index of not having huge numbers of identical entries) | |
(13:55:19) RhodiumToad: dukedave: yes | |
(13:55:32) dukedave: But, if I have an double index, won't I need the created_at to find anything? | |
(13:55:37) RhodiumToad: dukedave: no | |
(13:55:52) dukedave: RhodiumToad: That's, totally not what I ever thought :) | |
(13:55:54) RhodiumToad: dukedave: an index on (a,b) is almost as good as an index on (a) alone for queries that don't specify b | |
(13:56:10) dukedave: RhodiumToad: Did not know that. | |
(13:56:16) RhodiumToad: dukedave: and the only reason it's only "almost" as good is because it's slightly larger | |
(13:56:41) RhodiumToad: dukedave: (which might be offset in this case by better page deletion behaviour) | |
(13:57:36) RhodiumToad: dukedave: note that order of columns matters - an index on (a,b) is no use for a query that only specifies b | |
(13:58:40) RhodiumToad: dukedave: (imagine a hardcopy phonebook as being an index on (surname,initials,address) and imagine looking up everyone with a single surname (easy), compared with looking up everyone with a specific initial (requires searching the whole book)) | |
(13:59:15) dukedave: RhodiumToad: In answer to your question, 'relatively small number of distinct user_ids': We have around 5000 users worth of events, with 2433690 events; so ~480 events per user in the index. | |
(13:59:23) dukedave: RhodiumToad: That's a great analogy :) | |
(13:59:49) RhodiumToad: dukedave: that's about at the break-even point for whether including created_at would be helpful. | |
(14:00:13) RhodiumToad: dukedave: your stats from vacuum suggest that you're typically getting about 280 index entries per page | |
(14:00:42) RhodiumToad: dukedave: so most users will likely be using two pages of index, maybe some with more | |
(14:01:18) RhodiumToad: dukedave: (and less than three on average, since you have 13379 pages for 5000 users) | |
(14:01:54) RhodiumToad: dukedave: if you had significantly more items per user I'd add the created_at field, but as it is, it's probably a bit marginal |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment