Setup

Hardware

M4 Pro MBP w/ 48 GiB RAM

Software

Postgres 17, shared_buffers=2GB, effective_cache_size=8GB

Source Data

I created a CSV of 5,000,000 rows of cities and countries (duplicates obviously exist - there are 7146 unique rows).

Tables and B+tree indices

Two tables were created, one with the city having a non-deterministic collation.

postgres=# CREATE COLLATION ignore_accent_case (provider = icu, deterministic = false, locale = 'und-u-ks-level1');
CREATE COLLATION
postgres=# CREATE TABLE geo_strict (id INT PRIMARY KEY GENERATED ALWAYS AS IDENTITY, city VARCHAR(126) NOT NULL, country VARCHAR(126) NOT NULL);
CREATE TABLE
postgres=# CREATE TABLE geo_loose (id INT PRIMARY KEY GENERATED ALWAYS AS IDENTITY, city VARCHAR(126) NOT NULL COLLATE "ignore_accent_case", country VARCHAR(126) NOT NULL COLLATE "ignore_accent_case");
CREATE TABLE
postgres=# \copy geo_strict (city, country) FROM '/Users/sgarland/git/dev_home_projects/genSQL/test.csv' WITH (FORMAT CSV, HEADER);
COPY 5000000
postgres=# \copy geo_loose (city, country) FROM '/Users/sgarland/git/dev_home_projects/genSQL/test.csv' WITH (FORMAT CSV, HEADER);
COPY 5000000
postgres=# CREATE INDEX geo_strict_city_idx ON geo_strict (city);
CREATE INDEX
postgres=# CREATE INDEX geo_loose_city_idx ON geo_loose (city);
CREATE INDEX
postgres=# ANALYZE geo_strict;
ANALYZE
postgres=# ANALYZE geo_loose;
ANALYZE
postgres=# SELECT * FROM geo_strict LIMIT 10;
 id |    city    | country
----+------------+----------
  1 | Cambridge  | Jamaica
  2 | Benares    | India
  3 | Quibdo     | Colombia
  4 | Lueneburg  | Germany
  5 | Meru       | Kenya
  6 | Heriot Bay | Canada
  7 | Envigado   | Colombia
  8 | Cognac     | France
  9 | Atotonilco | Mexico
 10 | Floridia   | Italy
(10 rows)

postgres=# SELECT * FROM geo_loose LIMIT 10;
 id |    city    | country
----+------------+----------
  1 | Cambridge  | Jamaica
  2 | Benares    | India
  3 | Quibdo     | Colombia
  4 | Lueneburg  | Germany
  5 | Meru       | Kenya
  6 | Heriot Bay | Canada
  7 | Envigado   | Colombia
  8 | Cognac     | France
  9 | Atotonilco | Mexico
 10 | Floridia   | Italy
(10 rows)

F_UNACCENT

I created an immutable version of UNACCENT so it could be used in a functional index.

CREATE OR REPLACE FUNCTION public.immutable_unaccent(regdictionary, text)
  RETURNS text
  LANGUAGE c IMMUTABLE PARALLEL SAFE STRICT AS
'$libdir/unaccent', 'unaccent_dict';
 
CREATE OR REPLACE FUNCTION public.f_unaccent(text)
  RETURNS text
  LANGUAGE sql IMMUTABLE PARALLEL SAFE STRICT
RETURN public.immutable_unaccent(regdictionary 'public.unaccent', $1);

Benchmarks

Baseline

As expected, the strict collation is faster, though the difference may not matter in practice.

postgres=# \timing
Timing is on.
postgres=# SELECT COUNT(*) FROM geo_strict WHERE city = 'Altötting';
 count
-------
   700
(1 row)

Time: 0.471 ms
postgres=# SELECT COUNT(*) FROM geo_loose WHERE city = 'Altötting';
 count
-------
   700
(1 row)

Time: 0.883 ms
postgres=# SELECT COUNT(*) FROM geo_loose WHERE city IN ('Skummeslövsstrand', 'San José de Mayo', 'Malargüe');
 count
-------
  2098
(1 row)

Time: 2.084 ms
postgres=# SELECT COUNT(*) FROM geo_loose WHERE city IN ('Skummeslövsstrand', 'San José de Mayo', 'Malargüe');
 count
-------
  2098
(1 row)

Time: 1.409 ms
postgres=# SELECT COUNT(*) FROM geo_loose WHERE city IN ('Skummeslövsstrand', 'San José de Mayo', 'Malargüe');
 count
-------
  2098
(1 row)

Time: 1.475 ms
postgres=# SELECT COUNT(*) FROM geo_strict WHERE city IN ('Skummeslövsstrand', 'San José de Mayo', 'Malargüe');
 count
-------
  2098
(1 row)

Time: 0.825 ms
postgres=# SELECT COUNT(*) FROM geo_strict WHERE city IN ('Skummeslövsstrand', 'San José de Mayo', 'Malargüe');
 count
-------
  2098
(1 row)

Time: 1.022 ms
postgres=# SELECT COUNT(*) FROM geo_strict WHERE city IN ('Skummeslövsstrand', 'San José de Mayo', 'Malargüe');
 count
-------
  2098
(1 row)

Time: 1.036 ms
postgres=# SELECT COUNT(*) FROM geo_strict WHERE city IN ('skummeslovsstrand', 'san jose de mayo', 'malargue');
 count
-------
     0
(1 row)

Time: 0.447 ms
postgres=# SELECT COUNT(*) FROM geo_strict WHERE city IN ('skummeslovsstrand', 'san jose de mayo', 'malargue');
 count
-------
     0
(1 row)

Time: 0.713 ms
postgres=# SELECT COUNT(*) FROM geo_strict WHERE city IN ('skummeslovsstrand', 'san jose de mayo', 'malargue');
 count
-------
     0
(1 row)

Time: 0.508 ms
postgres=# SELECT COUNT(*) FROM geo_loose WHERE city IN ('skummeslovsstrand', 'san jose de mayo', 'malargue');
 count
-------
  2098
(1 row)

Time: 1.633 ms
postgres=# SELECT COUNT(*) FROM geo_loose WHERE city IN ('skummeslovsstrand', 'san jose de mayo', 'malargue');
 count
-------
  2098
(1 row)

Time: 1.481 ms
postgres=# SELECT COUNT(*) FROM geo_loose WHERE city IN ('skummeslovsstrand', 'san jose de mayo', 'malargue');
 count
-------
  2098
(1 row)

Time: 1.447 ms

Functional Index

I created a functional index on geo_strict.city using LOWER() and F_UNACCENT(), and then dropped the original index. It's reasonably fast, though the plain B+tree index on the non-deterministic collation is faster.

postgres=# CREATE INDEX geo_strict_city_func_idx ON geo_strict (LOWER(F_UNACCENT(city)));
CREATE INDEX
Time: 1191.505 ms (00:01.192)
postgres=# DROP INDEX geo_strict_city_idx;
DROP INDEX
Time: 4.991 ms
postgres=# ANALYZE geo_strict;
ANALYZE
Time: 118.389 ms

postgres=# EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM geo_strict WHERE LOWER(F_UNACCENT(city)) IN (LOWER(F_UNACCENT('Skummeslövsstrand')), LOWER(F_UNACCENT('San José de Mayo')), LOWER(F_UNACCENT('Malargüe')));
                                                                            QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2287.53..2287.54 rows=1 width=8) (actual time=9.275..9.276 rows=1 loops=1)
   Buffers: shared hit=2061 read=7
   ->  Bitmap Heap Scan on geo_strict  (cost=21.08..2282.22 rows=2127 width=0) (actual time=4.146..9.052 rows=2098 loops=1)
         Recheck Cond: (lower(immutable_unaccent('unaccent'::regdictionary, (city)::text)) = ANY ('{skummeslovsstrand,"san jose de mayo",malargue}'::text[]))
         Heap Blocks: exact=2059
         Buffers: shared hit=2061 read=7
         ->  Bitmap Index Scan on geo_strict_city_func_idx  (cost=0.00..20.55 rows=2127 width=0) (actual time=3.587..3.587 rows=2098 loops=1)
               Index Cond: (lower(immutable_unaccent('unaccent'::regdictionary, (city)::text)) = ANY ('{skummeslovsstrand,"san jose de mayo",malargue}'::text[]))
               Buffers: shared hit=2 read=7
 Planning:
   Buffers: shared hit=5
 Planning Time: 0.463 ms
 Execution Time: 9.311 ms
(13 rows)

Time: 10.481 ms
postgres=# EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM geo_strict WHERE LOWER(F_UNACCENT(city)) IN (LOWER(F_UNACCENT('Skummeslövsstrand')), LOWER(F_UNACCENT('San José de Mayo')), LOWER(F_UNACCENT('Malargüe')));
                                                                            QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2287.53..2287.54 rows=1 width=8) (actual time=3.299..3.300 rows=1 loops=1)
   Buffers: shared hit=2068
   ->  Bitmap Heap Scan on geo_strict  (cost=21.08..2282.22 rows=2127 width=0) (actual time=0.407..3.095 rows=2098 loops=1)
         Recheck Cond: (lower(immutable_unaccent('unaccent'::regdictionary, (city)::text)) = ANY ('{skummeslovsstrand,"san jose de mayo",malargue}'::text[]))
         Heap Blocks: exact=2059
         Buffers: shared hit=2068
         ->  Bitmap Index Scan on geo_strict_city_func_idx  (cost=0.00..20.55 rows=2127 width=0) (actual time=0.188..0.188 rows=2098 loops=1)
               Index Cond: (lower(immutable_unaccent('unaccent'::regdictionary, (city)::text)) = ANY ('{skummeslovsstrand,"san jose de mayo",malargue}'::text[]))
               Buffers: shared hit=9
 Planning Time: 0.145 ms
 Execution Time: 3.321 ms
(11 rows)

Time: 3.761 ms

GIN Trigram Index

I created a trigram index on geo_strict.city. As expected, it's quite a bit slower than a B+tree, though it doesn't require you to perform any casts, which may be preferable. Also, of course, it can be used for LIKE operations.

postgres=# CREATE INDEX geo_strict_city_trgm_idx ON geo_strict USING GIN (city gin_trgm_ops);
CREATE INDEX
Time: 5697.276 ms (00:05.697)

postgres=# EXPLAIN (ANALYZE, BUFFERS) SELECT COUNT(*) FROM geo_strict WHERE city IN ('Skummeslövsstrand', 'San José de Mayo', 'Malargüe');
                                                                   QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=2565.30..2565.31 rows=1 width=8) (actual time=47.659..47.661 rows=1 loops=1)
   Buffers: shared hit=3721
   ->  Bitmap Heap Scan on geo_strict  (cost=304.06..2559.96 rows=2133 width=0) (actual time=45.028..47.581 rows=2098 loops=1)
         Recheck Cond: ((city)::text = ANY ('{Skummeslövsstrand,"San José de Mayo",Malargüe}'::text[]))
         Heap Blocks: exact=2059
         Buffers: shared hit=3721
         ->  Bitmap Index Scan on geo_strict_city_trgm_idx  (cost=0.00..303.53 rows=2133 width=0) (actual time=44.799..44.799 rows=2098 loops=1)
               Index Cond: ((city)::text = ANY ('{Skummeslövsstrand,"San José de Mayo",Malargüe}'::text[]))
               Buffers: shared hit=1662
 Planning:
   Buffers: shared hit=1
 Planning Time: 0.216 ms
 Execution Time: 47.698 ms
(13 rows)

Size Comparison

The functional B+tree index is quite a bit smaller than the non-functional B+tree index, though that's due solely to the collations, and not the function itself. The trigram index is in-between the two.

postgres=# \dti+ geo_*
                                                   List of relations
 Schema |           Name           | Type  |  Owner   |   Table    | Persistence | Access method |  Size  | Description
--------+--------------------------+-------+----------+------------+-------------+---------------+--------+-------------
 public | geo_loose                | table | sgarland |            | permanent   | heap          | 273 MB |
 public | geo_loose_city_idx       | index | sgarland | geo_loose  | permanent   | btree         | 133 MB |
 public | geo_loose_pkey           | index | sgarland | geo_loose  | permanent   | btree         | 107 MB |
 public | geo_strict               | table | sgarland |            | permanent   | heap          | 273 MB |
 public | geo_strict_city_func_idx | index | sgarland | geo_strict | permanent   | btree         | 34 MB  |
 public | geo_strict_city_trgm_idx | index | sgarland | geo_strict | permanent   | gin           | 123 MB |
 public | geo_strict_pkey          | index | sgarland | geo_strict | permanent   | btree         | 107 MB |
(7 rows)

stephanGarland/collation_postgres_perf.md