rraval/postgres-collation.md

Last active July 30, 2024 07:05

Star (22) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/rraval/ef4e4bdc63e68fe3e83c9f98f56af7a4.js"></script>
Save rraval/ef4e4bdc63e68fe3e83c9f98f56af7a4 to your computer and use it in GitHub Desktop.

PostgreSQL collation is a massive footgun

Raw

This is a slightly stripped down version from our internal bug tracker. The point of posting this publicly is part FYI, part peer review. I'm hoping someone can look at this, disagree, and tell me all the downsides of using the C locale or point out things I've misunderstood. The Recommendations section in particular is contextualized by our database serving a SaaS product for users from many different locales, thus making locale a render level concern. YMMV, caveat emptor, etc.

Collation defines the character ordering for textual data. For Postgres, https://www.postgresql.org/docs/current/static/locale.html:

The locale settings influence the following SQL features:

Sort order in queries using ORDER BY or the standard comparison operators on textual data

The upper, lower, and initcap functions

Pattern matching operators (LIKE, SIMILAR TO, and POSIX-style regular expressions); locales affect both case insensitive matching and the classification of characters by character-class regular expressions

The to_char family of functions

The ability to use indexes with LIKE clauses

Those specific docs fail to mention that collation also affects indices. From https://www.postgresql.org/docs/current/static/sql-createindex.html:

By default, the index uses the collation declared for the column to be indexed or the result collation of the expression to be indexed. Indexes with non-default collations can be useful for queries that involve expressions using non-default collations.

Python and Postgres don't agree on string ordering

In unicode, a has codepoint 97 (U+0061) and { has codepoint 123 (U+007B).

Python has a well defined string ordering (https://docs.python.org/3/tutorial/datastructures.html#comparing-sequences-and-other-types):

The comparison uses lexicographical ordering [...] Lexicographical ordering for strings uses the Unicode code point number to order individual characters.

This means that Python has the same sort order regardless of locale settings:

$ python
Python 3.6.5 (default, May 11 2018, 04:00:52) 
[GCC 8.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 'a' < '{'
True

In production, we use the en_CA.UTF-8 collation. As far as I can tell, this collation ignores specific punctuation characters but compares lexicographically otherwise:

$ psql
db=> SELECT 'a' < '{';
 ?column? 
----------
 f
(1 row)

This means that code that relies on database ordering but Python side comparisons is subtly broken.

This surprising collation behaviour isn't actually Postgres's fault, it just delegates to strcoll, provided by glibc in this case. strcoll is like strcmp in that:

strcoll(a, b) < 0 if a sorted before b in the current locale
strcoll(a, b) == 0 if a has the same sort position as b in the current locale
strcoll(a, b) > 0 if a sorts after b in the current locale

The following C code reproduces the issue:

#include <stdio.h>
#include <string.h>
#include <locale.h>

int main(int argc, const char **argv) {
    if (argc < 4) {
        printf("Usage: coll LANG STR1 STR2\n");
        return 1;
    }

    setlocale(LC_ALL, argv[1]);
    const char *s1 = argv[2];
    const char *s2 = argv[3];

    printf("%s < %s: %d\n", s1, s2, strcoll(s1, s2));
    return 0;
}

$ gcc -o coll coll.c

# C collation is equivalent to strcmp, which is equivalent to
# codepoint comparison, see recommendations below
$ ./coll C 'a' '{'
a < {: -26

# en_CA.UTF8 is what our prod database uses
$ ./coll en_CA.UTF8 'a' '{'
a < {: 1

# en_US.UTF8 is similarly affected
$ ./coll en_US.UTF8 'a' '{'
a < {: 1

Internationalized collation rules change over time

The rules for en_CA.UTF8 (and en_US.UTF8) inherit from iso14651_t1_common. As the name implies, it's based on ISO 14651 (wiki, standard).

However, those rules are occasionally updated (https://fedoraproject.org/wiki/Changes/Glibc_collation_update_and_sync_with_cldr):

The collation data in glibc is extremely out of date, most locales base their collation rules on an iso14651_t1_common file which has not been updated for probably more than 15 years. Therefore, all characters added in later Unicode versions are missing and not sorted at all which causes bugs like Bug 1336308 - Infinite (∞) and empty set (∅) are treated as if they were the same character by sort and uniq

While Unicode is not ISO 14651, they have a similar stance. From http://unicode.org/reports/tr10/:

Collation order is not fixed.

Over time, collation order will vary: there may be fixes needed as more information becomes available about languages; there may be new government or industry standards for the language that require changes; and finally, new characters added to the Unicode Standard will interleave with the previously-defined ones. This means that collations must be carefully versioned.

Nondeterminism in index creation is disaster

Unlike Unicode, which at least recognizes that "careful versioning" is required, glibc (or maybe glibc packagers) seem to arbitrarily update the rules.

TripAdvisor has a cautionary tale (https://www.postgresql.org/message-id/[email protected]) where a streaming replica was running a different version of glibc than the master:

SELECT 'Ｍ' > 'ஐ';

Depending on your charset the first character might look like an ascii 'M'. It is not. The two characters in question are the utf8 representations of http://www.fileformat.info/info/unicode/char/ff2d/index.htm and http://www.fileformat.info/info/unicode/char/0b90/index.htm respectively. Across different machines, running the same version of postgres, and in databases with identical character encodings and collations ('en_US.UTF-8') that select will return different results if the version of glibc is different. This holds whether one pg instance is a pg_basebackup of the other or if you run an initdb and then immediately start postgres.

Granted that story is from 2014 and using PG 9.1.9, and things may have changed since then. Given the yes-this-is-bad-but-hard-to-fix responses from the developers and the current state of https://wiki.postgresql.org/wiki/Todo:ICU, I think this might still be an issue.

For bonus hilarity, see https://wiki.postgresql.org/wiki/Abbreviated_keys_glibc_issue, which features index corruption caused by some buggy implementations of glibc where strxfrm + strcmp != strcoll. That one's caused by bugs in glibc and this is a far more subtle backcompat issue, but serves to illustrate the sheer surface of bugs that can arise.

Recommendations

Convert all Postgres databases to use the UTF8 encoding under the C locale. The C locale does bytewise comparisons. Since UTF8 is encodes codepoints with the higher bits first, this is effectively equivalent to a code point comparison. This means that Postgres and Python now agree on string order AND it's faster to boot (https://www.postgresql.org/docs/current/static/locale.html):

The drawback of using locales other than C or POSIX in PostgreSQL is its performance impact. It slows character handling and prevents ordinary indexes from being used by LIKE. For this reason use locales only if you actually need them.

Putting everything in the C locale is similar to the recommendations for datetimes being stored in UTC in the database. Treat sorting as a render side concern that the client deals with.

Rewrite all server side sorts (for things like pagination) in terms of ints and datetimes. It should be possible to avoid ORDER BY on string columns entirely, and if it isn't, at least the C locale will give the same sort forever.

Pigeo commented Aug 28, 2020

Hello, I understand your considerations, but using the C locale (or any other deterministic locale provided by Postgres) leads to problems for languages with accents.
Consider for ex. French "Briançon" (a city name) vs. "BRIANÇON" (same, but capitalised) or even "BRIANCON" (because the French administration usually capitalises by removing the diacritics, and so that's what we get from the government's open data...)
If you need case insensitive search here (which also implies accent insensitive), neither LOWER(), UPPER(), ILIKE, citext, or any other crafted tweak is going to give you the correct behavior... I wish we would have ILIKE and case insensitive equality operators that could be used for something else than plain English without such bugs (SELECT lower('Ç') returns 'Ç' instead of 'ç' or even 'c'... WTF ???), but nowadays the non deterministic ICU collations seems to be our only hope. (Or do you have another recommandation to advise ?)
I could have taken other funny examples from German (letter "ß" becomes 2 letters "SS" when capitalized, which in turns becomes 2 letters "ss" when lowercased again, therefore being not bijective ; "ä" becoming "AE" or "ae"), Spanish (old alphabetical ordering treating the "ch" as a single letter ordered after all the other "c*" sequences), etc.
Please remember that there are other languages than English in the world! What would you advise for a case insensitive search in those languages ?
Thank you

mnewhall commented Sep 10, 2020

Of course that is a fair point where your database stores non-English strings, but that is not a universally applicable rule, in specific cases it either may not apply or may be outweighed by other considerations.

For one example, the company I work for has a table of 90 million US addresses, so performance matters at that scale. If and when we expand outside the US we will have to consider supporting other alphabets, but that is several years away, if ever. If we can improve db performance even a little bit for five years worth of queries, isn't this optimization worth it for us, even if it means (a reasonable amount of) extra work later to change it? We shouldn't change our systems just based on a hypothetical or to comply with an abstract principle, if there are calculable costs but no calculable benefits. I have worked in i18n / l10n before and am aware of the (potentially huge) costs of fixing ASCII-centric code and data to support more languages, but those costs tend to be heavily weighed in updating code, not db migrations, in my experience. We will have bigger problems than fixing the db collation when that day comes.

A separate example: we have an associated db with a small (narrow) table that mainly stores base64-encoded (ASCII) keys, no strings that might ever contain non-ASCII characters. Surely in this case, we don't need UTF-8 sorting!

cfeied commented Jan 17, 2021 •

edited

Loading

I'm sure there are a vast number of isolated systems (especially in the US) that contain only ASCII or latin-1 data, but a surprising number of systems now receive feeds that may at any moment begin to contain unicode characters not representable in those encodings. Anything coming out of MS Word or Google Documents is at risk for containing such characters.

I can't quantify the fraction of systems operating within an en-locale that interoperate with other systems and need the ability to store and compare non-english characters, but it can't be insignificant. In our scenario, those characters exist but are inconsistently entered, thus the need to search and match against such textual elements using case-insensitivity and accent-insensitivity, at least, seems to be very common.

Another commenter spoke of performance. At one time I built and operated what was at the time the largest existing medical dataset. With petabytes of data, performance certainly mattered to us: the list of all US addresses was one of our smallest tables and we were committed to a response time of 0.125 seconds to the desktop. The overhead (if any) of utf8 vs latin-1 was not a factor in our overall perf profile, and I think that would be true for any well-architected system. Whether ICU will result in slowness vs the C collation I can't say, but even the most parochial US systems will definitely need the ability to use different collations.

In our experience, patients being seen within the USA have medical data originating from disparate systems that may send unexpected non-English characters in text (e.g., an ECG from an outpatient medical record occasionally has Chinese characters within its text fields, because the equipment was manufactured in China). Or consider our pharmaceutical databases: they must include medications that travelers bring with them when they visit the US. The data sources that supply that information come from around the world, and naturally include non-English characters.

Since accenting and capitalization of drug names may vary even within a single feed, a ubiquitous requirement is the ability to search and match using rules that are comparable to ICU level 1: no dependence on accent, case, punctuation, or embedded control characters. In postgresql this adds the (unfortunate from many perspectives) additional requirement to specify nondeterminism.

This functionality seems to work adequately (if clumsily) for our needs in postgresql today. It is cumbersome to have to specify a deterministic collation each time Like or ILike is used, but at least the workaround works. We regard level 1 collation-equivalent strings as being truly equivalent -- true duplicates -- and have no problem with the way indexing handles that today.

We are entirely US-centric, but you can see why we could not use a database that was incapable of storing non-English characters and in which we could not perform case-insensitive and accent-insensitive searches.

Our perspective would be that far from abandoning ICU for collation, the reach of ICU collation (and determinism settings) should be expanded. It should be possible to use an ICU collation (including nondeterminism) as a database (or cluster) default. Actually, tiebreakers should not use byte equivalence for text ordering where the collation specifies otherwise. The nondeterminism flag turns this off for some things, but breaks others. A better solution might be that functions such as e.g., strcmp and friends be made ICU_collation-aware. Ideally LIKE could thereby be allowed to function with data from a column using a collation with nondeterminism.

We also think it is necessary to separate determinism in matching from determinism in sorting when not matching. Having a single flag to control both query matching and output order forces us to accept non-deterministic order of output whenever we want to select a set using case and accent invariance. This needs to be fixed at some point, but I can imagine that it might not be an trivial change. I can imagine specifying two collations rather than one, where the first one strictly follows the ICU level but the second necessarily applies some form of determinism for ordering when not matching.

Off-topic here, but we also agree with those who feel it is high time that leading distros should include an option for a build linking the most current ICU version, rather than only the obsolete ~2015 version currently bound. Full ICU support in postgresql would go a long way to making an ICU migration seem worth the effort.

rraval/postgres-collation.md

Python and Postgres don't agree on string ordering

Internationalized collation rules change over time

Nondeterminism in index creation is disaster

Recommendations

Pigeo commented Aug 28, 2020

Uh oh!

mnewhall commented Sep 10, 2020

Uh oh!

cfeied commented Jan 17, 2021 •

edited

Loading

Uh oh!

rraval/postgres-collation.md

Python and Postgres don't agree on string ordering

Internationalized collation rules change over time

Nondeterminism in index creation is disaster

Recommendations

Pigeo commented Aug 28, 2020

Uh oh!

mnewhall commented Sep 10, 2020

Uh oh!

cfeied commented Jan 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cfeied commented Jan 17, 2021 •

edited

Loading