Last active
August 16, 2017 17:47
-
-
Save jimfulton/4104b5387cf7dce3754bef342a9f84c5 to your computer and use it in GitHub Desktop.
Comparison of GIN and RUM indexes for ranked searches of a database with ~220K documents and searches returning ~50K documents
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
=> create index pgtextindex_text_vector_idx on pgtextindex using gin (text_vector); | |
CREATE INDEX | |
=> analyze pgtextindex; | |
ANALYZE | |
=> \d pgtextindex | |
Table "public.pgtextindex" | |
Column | Type | Modifiers | |
-------------------+-----------------------------+---------------------- | |
docid | integer | not null | |
community_docid | character varying(100) | | |
content_type | character varying(30) | | |
creation_date | timestamp without time zone | | |
modification_date | timestamp without time zone | | |
coefficient | real | not null default 1.0 | |
marker | character varying[] | | |
text_vector | tsvector | | |
Indexes: | |
"pgtextindex_pkey" PRIMARY KEY, btree (docid) | |
"pgtextindex_community_docid_index" btree (community_docid, content_type, creation_date) | |
"pgtextindex_text_vector_idx" gin (text_vector) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
SELECT docid, | |
coefficient * ( | |
ts_rank_cd('{.1, 0.2, 0.4, 1.0}', | |
text_vector, | |
to_tsquery('english', :q)) | |
) AS rank, | |
count(*) over() | |
FROM pgtextindex | |
WHERE text_vector @@ to_tsquery('english', :q) | |
ORDER BY rank DESC | |
limit 9; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Limit (cost=14260.66..14260.68 rows=9 width=16) (actual time=420.212..420.215 rows=9 loops=1) | |
-> Sort (cost=14260.66..14339.40 rows=31496 width=16) (actual time=420.210..420.210 rows=9 loops=1) | |
Sort Key: ((coefficient * ts_rank_cd('{0.1,0.2,0.4,1}'::real[], text_vector, '''euro'''::tsquery))) DESC | |
Sort Method: top-N heapsort Memory: 25kB | |
-> WindowAgg (cost=324.10..13603.98 rows=31496 width=16) (actual time=59.031..414.574 rows=33988 loops=1) | |
-> Bitmap Heap Scan on pgtextindex (cost=324.10..13052.80 rows=31496 width=252) (actual time=13.476..40.841 rows=33988 loops=1) | |
Recheck Cond: (text_vector @@ '''euro'':*'::tsquery) | |
Heap Blocks: exact=9679 | |
-> Bitmap Index Scan on pgtextindex_text_vector_idx (cost=0.00..316.22 rows=31496 width=0) (actual time=11.823..11.823 rows=33988 loops=1) | |
Index Cond: (text_vector @@ '''euro'':*'::tsquery) | |
Planning time: 0.424 ms | |
Execution time: 420.661 ms |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
=> create index pgtextindex_rum_text_vector_idx on pgtextindex using rum (text_vector rum_tsvector_ops); | |
CREATE INDEX | |
=> analyze pgtextindex; | |
ANALYZE | |
=> \d pgtextindex; | |
Table "public.pgtextindex" | |
Column | Type | Modifiers | |
-------------------+-----------------------------+---------------------- | |
docid | integer | not null | |
community_docid | character varying(100) | | |
content_type | character varying(30) | | |
creation_date | timestamp without time zone | | |
modification_date | timestamp without time zone | | |
coefficient | real | not null default 1.0 | |
marker | character varying[] | | |
text_vector | tsvector | | |
Indexes: | |
"pgtextindex_pkey" PRIMARY KEY, btree (docid) | |
"pgtextindex_community_docid_index" btree (community_docid, content_type, creation_date) | |
"pgtextindex_rum_text_vector_idx" rum (text_vector) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
SELECT docid, (text_vector <=> to_tsquery('english', :q)) / coefficient as rank, count(*) over() | |
FROM pgtextindex | |
WHERE text_vector @@ to_tsquery('english', :q) | |
ORDER BY rank | |
limit 9; |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Limit (cost=14897.28..14897.31 rows=9 width=16) (actual time=470.662..470.665 rows=9 loops=1) | |
-> Sort (cost=14897.28..15002.50 rows=42087 width=16) (actual time=470.652..470.654 rows=9 loops=1) | |
Sort Key: (((text_vector <=> '''euro'':*'::tsquery) / coefficient)) | |
Sort Method: top-N heapsort Memory: 25kB | |
-> WindowAgg (cost=422.17..14019.78 rows=42087 width=16) (actual time=92.669..464.930 rows=33988 loops=1) | |
-> Bitmap Heap Scan on pgtextindex (cost=422.17..13283.26 rows=42087 width=260) (actual time=37.685..72.324 rows=33988 loops=1) | |
Recheck Cond: (text_vector @@ '''euro'':*'::tsquery) | |
Heap Blocks: exact=9679 | |
-> Bitmap Index Scan on pgtextindex_rum_text_vector_idx (cost=0.00..411.65 rows=42087 width=0) (actual time=35.995..35.995 rows=33988 loops=1) | |
Index Cond: (text_vector @@ '''euro'':*'::tsquery) | |
Planning time: 6.881 ms | |
Execution time: 471.571 ms |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Query count GIN RUM | |
euro:* 33988 420ms 430ms | |
afric:* 40169 410ms 430ms | |
open:* 56497 580ms 550ms | |
europe 20452 280ms 270ms | |
africa 33086 360ms 340ms | |
open 56120 560ms 530ms | |
Times are best of many (>30). Basically, for each test, I ran until I didn't see improvements. | |
I also ran explain analyze for each query before doing timing runs. |
I was wondering if prefix search was distorting results, so I added some non-prefix variants. For these RUM seem to a a teeny bit faster, but that could also be noise, as these numbers can vary quite a bit.
The count(*) over
is why not just the RUM indexes are hit. See postgrespro/rum#23 and postgrespro/rum#25
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Some notes on the data:
220K documents. (31% 9 words or less)
Unique words:
P30 9
Median 57
P90 656
p99 1837
p99.9 2928
max 6398
~130K null records (so ~350 records total)
Run on Single-CPU VM w 4G of RAM