M4 Pro MBP w/ 48 GiB RAM
Postgres 17, shared_buffers=2GB, effective_cache_size=8GB
I created a CSV of 5,000,000 rows of cities and countries (duplicates obviously exist - there are 7146 unique rows).
Two tables were created, one with the city having a non-deterministic collation.
postgres=# CREATE COLLATION ignore_accent_case (provider = icu, deterministic = false, locale = 'und-u-ks-level1');
CREATE COLLATION
postgres=# CREATE TABLE geo_strict (id INT PRIMARY KEY GENERATED ALWAYS AS IDENTITY, city VARCHAR(126) NOT NULL, country VARCHAR(126) NOT NULL);
CREATE TABLE
postgres=# CREATE TABLE geo_loose (id INT PRIMARY KEY GENERATED ALWAYS AS IDENTITY, city VARCHAR(126) NOT NULL COLLATE "ignore_accent_case", country VARCHAR(126) NOT NULL COLLATE "ignore_accent_case");
CREATE TABLE
postgres=# \copy geo_strict (city, country) FROM '/Users/sgarland/git/dev_home_projects/genSQL/test.csv' WITH (FORMAT CSV, HEADER);
COPY 5000000
postgres=# \copy geo_loose (city, country) FROM '/Users/sgarland/git/dev_home_projects/genSQL/test.csv' WITH (FORMAT CSV, HEADER);
COPY 5000000
postgres=# CREATE INDEX geo_strict_city_idx ON geo_strict (city);
CREATE INDEX
postgres=# CREATE INDEX geo_loose_city_idx ON geo_loose (city);
CREATE INDEX
postgres=# ANALYZE geo_strict;
ANALYZE
postgres=# ANALYZE geo_loose;
ANALYZE
postgres=# SELECT * FROM geo_strict LIMIT 10;
id | city | country
----+------------+----------
1 | Cambridge | Jamaica
2 | Benares | India
3 | Quibdo | Colombia
4 | Lueneburg | Germany
5 | Meru | Kenya
6 | Heriot Bay | Canada
7 | Envigado | Colombia
8 | Cognac | France
9 | Atotonilco | Mexico
10 | Floridia | Italy
(10 rows)
postgres=# SELECT * FROM geo_loose LIMIT 10;
id | city | country
----+------------+----------
1 | Cambridge | Jamaica
2 | Benares | India
3 | Quibdo | Colombia
4 | Lueneburg | Germany
5 | Meru | Kenya
6 | Heriot Bay | Canada
7 | Envigado | Colombia
8 | Cognac | France
9 | Atotonilco | Mexico
10 | Floridia | Italy
(10 rows)
I created an immutable version of UNACCENT
so it could be used in a functional index.
CREATE OR REPLACE FUNCTION public.immutable_unaccent(regdictionary, text)
RETURNS text
LANGUAGE c IMMUTABLE PARALLEL SAFE STRICT AS
'$libdir/unaccent', 'unaccent_dict';
CREATE OR REPLACE FUNCTION public.f_unaccent(text)
RETURNS text
LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT
RETURN public.immutable_unaccent(regdictionary 'public.unaccent', $1);
As expected, the strict collation is faster, though the difference may not matter in practice.
postgres=# \timing
Timing is on.
postgres=# SELECT COUNT(*) FROM geo_strict WHERE city = 'Altötting';
count
-------
700
(1 row)
Time: 0.471 ms
postgres=# SELECT COUNT(*) FROM geo_loose WHERE city = 'Altötting';
count
-------
700
(1 row)
Time: 0.883 ms
postgres=# SELECT COUNT(*) FROM geo_loose WHERE city IN ('Skummeslövsstrand', 'San José de Mayo', 'Malargüe');
count
-------
2098
(1 row)
Time: 2.084 ms
postgres=# SELECT COUNT(*) FROM geo_loose WHERE city IN ('Skummeslövsstrand', 'San José de Mayo', 'Malargüe');
count
-------
2098
(1 row)
Time: 1.409 ms
postgres=# SELECT COUNT(*) FROM geo_loose WHERE city IN ('Skummeslövsstrand', 'San José de Mayo', 'Malargüe');
count
-------
2098
(1 row)
Time: 1.475 ms
postgres=# SELECT COUNT(*) FROM geo_strict WHERE city IN ('Skummeslövsstrand', 'San José de Mayo', 'Malargüe');
count
-------
2098
(1 row)
Time: 0.825 ms
postgres=# SELECT COUNT(*) FROM geo_strict WHERE city IN ('Skummeslövsstrand', 'San José de Mayo', 'Malargüe');
count
-------
2098
(1 row)
Time: 1.022 ms
postgres=# SELECT COUNT(*) FROM geo_strict WHERE city IN ('Skummeslövsstrand', 'San José de Mayo', 'Malargüe');
count
-------
2098
(1 row)
Time: 1.036 ms
postgres=# SELECT COUNT(*) FROM geo_strict WHERE city IN ('skummeslovsstrand', 'san jose de mayo', 'malargue');
count
-------
0
(1 row)
Time: 0.447 ms
postgres=# SELECT COUNT(*) FROM geo_strict WHERE city IN ('skummeslovsstrand', 'san jose de mayo', 'malargue');
count
-------
0
(1 row)
Time: 0.713 ms
postgres=# SELECT COUNT(*) FROM geo_strict WHERE city IN ('skummeslovsstrand', 'san jose de mayo', 'malargue');
count
-------
0
(1 row)
Time: 0.508 ms
postgres=# SELECT COUNT(*) FROM geo_loose WHERE city IN ('skummeslovsstrand', 'san jose de mayo', 'malargue');
count
-------
2098
(1 row)
Time: 1.633 ms
postgres=# SELECT COUNT(*) FROM geo_loose WHERE city IN ('skummeslovsstrand', 'san jose de mayo', 'malargue');
count
-------
2098
(1 row)
Time: 1.481 ms
postgres=# SELECT COUNT(*) FROM geo_loose WHERE city IN ('skummeslovsstrand', 'san jose de mayo', 'malargue');
count
-------
2098
(1 row)
Time: 1.447 ms
I created a functional index on geo_strict.city
using LOWER()
and F_UNACCENT()
, and then dropped the original index.
It's reasonably fast, though the plain B+tree index on the non-deterministic collation is faster.
postgres=# CREATE INDEX geo_strict_city_func_idx ON geo_strict (LOWER(F_UNACCENT(city)));
CREATE INDEX
Time: 1191.505 ms (00:01.192)
postgres=# DROP INDEX geo_strict_city_idx;
DROP INDEX
Time: 4.991 ms
postgres=# ANALYZE geo_strict;
ANALYZE
Time: 118.389 ms
postgres=# EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM geo_strict WHERE LOWER(F_UNACCENT(city)) IN (LOWER(F_UNACCENT('Skummeslövsstrand')), LOWER(F_UNACCENT('San José de Mayo')), LOWER(F_UNACCENT('Malargüe')));
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2287.53..2287.54 rows=1 width=8) (actual time=9.275..9.276 rows=1 loops=1)
Buffers: shared hit=2061 read=7
-> Bitmap Heap Scan on geo_strict (cost=21.08..2282.22 rows=2127 width=0) (actual time=4.146..9.052 rows=2098 loops=1)
Recheck Cond: (lower(immutable_unaccent('unaccent'::regdictionary, (city)::text)) = ANY ('{skummeslovsstrand,"san jose de mayo",malargue}'::text[]))
Heap Blocks: exact=2059
Buffers: shared hit=2061 read=7
-> Bitmap Index Scan on geo_strict_city_func_idx (cost=0.00..20.55 rows=2127 width=0) (actual time=3.587..3.587 rows=2098 loops=1)
Index Cond: (lower(immutable_unaccent('unaccent'::regdictionary, (city)::text)) = ANY ('{skummeslovsstrand,"san jose de mayo",malargue}'::text[]))
Buffers: shared hit=2 read=7
Planning:
Buffers: shared hit=5
Planning Time: 0.463 ms
Execution Time: 9.311 ms
(13 rows)
Time: 10.481 ms
postgres=# EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM geo_strict WHERE LOWER(F_UNACCENT(city)) IN (LOWER(F_UNACCENT('Skummeslövsstrand')), LOWER(F_UNACCENT('San José de Mayo')), LOWER(F_UNACCENT('Malargüe')));
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2287.53..2287.54 rows=1 width=8) (actual time=3.299..3.300 rows=1 loops=1)
Buffers: shared hit=2068
-> Bitmap Heap Scan on geo_strict (cost=21.08..2282.22 rows=2127 width=0) (actual time=0.407..3.095 rows=2098 loops=1)
Recheck Cond: (lower(immutable_unaccent('unaccent'::regdictionary, (city)::text)) = ANY ('{skummeslovsstrand,"san jose de mayo",malargue}'::text[]))
Heap Blocks: exact=2059
Buffers: shared hit=2068
-> Bitmap Index Scan on geo_strict_city_func_idx (cost=0.00..20.55 rows=2127 width=0) (actual time=0.188..0.188 rows=2098 loops=1)
Index Cond: (lower(immutable_unaccent('unaccent'::regdictionary, (city)::text)) = ANY ('{skummeslovsstrand,"san jose de mayo",malargue}'::text[]))
Buffers: shared hit=9
Planning Time: 0.145 ms
Execution Time: 3.321 ms
(11 rows)
Time: 3.761 ms
I created a trigram index on geo_strict.city
. As expected, it's quite a bit slower than a B+tree, though it doesn't
require you to perform any casts, which may be preferable. Also, of course, it can be used for LIKE
operations.
postgres=# CREATE INDEX geo_strict_city_trgm_idx ON geo_strict USING GIN (city gin_trgm_ops);
CREATE INDEX
Time: 5697.276 ms (00:05.697)
postgres=# EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM geo_strict WHERE city IN ('Skummeslövsstrand', 'San José de Mayo', 'Malargüe');
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=2565.30..2565.31 rows=1 width=8) (actual time=47.659..47.661 rows=1 loops=1)
Buffers: shared hit=3721
-> Bitmap Heap Scan on geo_strict (cost=304.06..2559.96 rows=2133 width=0) (actual time=45.028..47.581 rows=2098 loops=1)
Recheck Cond: ((city)::text = ANY ('{Skummeslövsstrand,"San José de Mayo",Malargüe}'::text[]))
Heap Blocks: exact=2059
Buffers: shared hit=3721
-> Bitmap Index Scan on geo_strict_city_trgm_idx (cost=0.00..303.53 rows=2133 width=0) (actual time=44.799..44.799 rows=2098 loops=1)
Index Cond: ((city)::text = ANY ('{Skummeslövsstrand,"San José de Mayo",Malargüe}'::text[]))
Buffers: shared hit=1662
Planning:
Buffers: shared hit=1
Planning Time: 0.216 ms
Execution Time: 47.698 ms
(13 rows)
The functional B+tree index is quite a bit smaller than the non-functional B+tree index, though that's due solely to the collations, and not the function itself. The trigram index is in-between the two.
postgres=# \dti+ geo_*
List of relations
Schema | Name | Type | Owner | Table | Persistence | Access method | Size | Description
--------+--------------------------+-------+----------+------------+-------------+---------------+--------+-------------
public | geo_loose | table | sgarland | | permanent | heap | 273 MB |
public | geo_loose_city_idx | index | sgarland | geo_loose | permanent | btree | 133 MB |
public | geo_loose_pkey | index | sgarland | geo_loose | permanent | btree | 107 MB |
public | geo_strict | table | sgarland | | permanent | heap | 273 MB |
public | geo_strict_city_func_idx | index | sgarland | geo_strict | permanent | btree | 34 MB |
public | geo_strict_city_trgm_idx | index | sgarland | geo_strict | permanent | gin | 123 MB |
public | geo_strict_pkey | index | sgarland | geo_strict | permanent | btree | 107 MB |
(7 rows)