This is a slightly stripped down version from our internal bug tracker. The point of posting this publicly is part FYI, part peer review. I'm hoping someone can look at this, disagree, and tell me all the downsides of using the C
locale or point out things I've misunderstood. The Recommendations
section in particular is contextualized by our database serving a SaaS product for users from many different locales, thus making locale a render level concern. YMMV, caveat emptor, etc.
Collation defines the character ordering for textual data. For Postgres, https://www.postgresql.org/docs/current/static/locale.html:
The locale settings influence the following SQL features:
- Sort order in queries using ORDER BY or the standard comparison operators on textual data
- The
upper
,lower
, andinitcap
functions- Pattern matching operators (
LIKE
,SIMILAR TO
, and POSIX-style regular expressions); locales affect both case insensitive matching and the classification of characters by character-class regular expressions- The
to_char
family of functions- The ability to use indexes with
LIKE
clauses
Those specific docs fail to mention that collation also affects indices. From https://www.postgresql.org/docs/current/static/sql-createindex.html:
By default, the index uses the collation declared for the column to be indexed or the result collation of the expression to be indexed. Indexes with non-default collations can be useful for queries that involve expressions using non-default collations.
In unicode, a
has codepoint 97 (U+0061) and {
has codepoint 123 (U+007B).
Python has a well defined string ordering (https://docs.python.org/3/tutorial/datastructures.html#comparing-sequences-and-other-types):
The comparison uses lexicographical ordering [...] Lexicographical ordering for strings uses the Unicode code point number to order individual characters.
This means that Python has the same sort order regardless of locale settings:
$ python
Python 3.6.5 (default, May 11 2018, 04:00:52)
[GCC 8.1.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> 'a' < '{'
True
In production, we use the en_CA.UTF-8
collation. As far as I can tell, this collation ignores specific punctuation characters but compares lexicographically otherwise:
$ psql
db=> SELECT 'a' < '{';
?column?
----------
f
(1 row)
This means that code that relies on database ordering but Python side comparisons is subtly broken.
This surprising collation behaviour isn't actually Postgres's fault, it just delegates to strcoll
, provided by glibc
in this case. strcoll
is like strcmp
in that:
strcoll(a, b) < 0
ifa
sorted beforeb
in the current localestrcoll(a, b) == 0
ifa
has the same sort position asb
in the current localestrcoll(a, b) > 0
ifa
sorts afterb
in the current locale
The following C code reproduces the issue:
#include <stdio.h>
#include <string.h>
#include <locale.h>
int main(int argc, const char **argv) {
if (argc < 4) {
printf("Usage: coll LANG STR1 STR2\n");
return 1;
}
setlocale(LC_ALL, argv[1]);
const char *s1 = argv[2];
const char *s2 = argv[3];
printf("%s < %s: %d\n", s1, s2, strcoll(s1, s2));
return 0;
}
$ gcc -o coll coll.c
# C collation is equivalent to strcmp, which is equivalent to
# codepoint comparison, see recommendations below
$ ./coll C 'a' '{'
a < {: -26
# en_CA.UTF8 is what our prod database uses
$ ./coll en_CA.UTF8 'a' '{'
a < {: 1
# en_US.UTF8 is similarly affected
$ ./coll en_US.UTF8 'a' '{'
a < {: 1
The rules for en_CA.UTF8
(and en_US.UTF8
) inherit from iso14651_t1_common
. As the name implies, it's based on ISO 14651 (wiki, standard).
However, those rules are occasionally updated (https://fedoraproject.org/wiki/Changes/Glibc_collation_update_and_sync_with_cldr):
The collation data in glibc is extremely out of date, most locales base their collation rules on an iso14651_t1_common file which has not been updated for probably more than 15 years. Therefore, all characters added in later Unicode versions are missing and not sorted at all which causes bugs like Bug 1336308 - Infinite (∞) and empty set (∅) are treated as if they were the same character by sort and uniq
While Unicode is not ISO 14651, they have a similar stance. From http://unicode.org/reports/tr10/:
Collation order is not fixed.
Over time, collation order will vary: there may be fixes needed as more information becomes available about languages; there may be new government or industry standards for the language that require changes; and finally, new characters added to the Unicode Standard will interleave with the previously-defined ones. This means that collations must be carefully versioned.
Unlike Unicode, which at least recognizes that "careful versioning" is required, glibc (or maybe glibc packagers) seem to arbitrarily update the rules.
TripAdvisor has a cautionary tale (https://www.postgresql.org/message-id/[email protected]) where a streaming replica was running a different version of glibc than the master:
SELECT 'M' > 'ஐ';
Depending on your charset the first character might look like an ascii 'M'. It is not. The two characters in question are the utf8 representations of http://www.fileformat.info/info/unicode/char/ff2d/index.htm and http://www.fileformat.info/info/unicode/char/0b90/index.htm respectively. Across different machines, running the same version of postgres, and in databases with identical character encodings and collations ('en_US.UTF-8') that select will return different results if the version of glibc is different. This holds whether one pg instance is a pg_basebackup of the other or if you run an initdb and then immediately start postgres.
Granted that story is from 2014 and using PG 9.1.9, and things may have changed since then. Given the yes-this-is-bad-but-hard-to-fix responses from the developers and the current state of https://wiki.postgresql.org/wiki/Todo:ICU, I think this might still be an issue.
For bonus hilarity, see https://wiki.postgresql.org/wiki/Abbreviated_keys_glibc_issue, which features index corruption caused by some buggy implementations of glibc where strxfrm + strcmp != strcoll
. That one's caused by bugs in glibc and this is a far more subtle backcompat issue, but serves to illustrate the sheer surface of bugs that can arise.
Convert all Postgres databases to use the UTF8
encoding under the C
locale. The C
locale does bytewise comparisons. Since UTF8 is encodes codepoints with the higher bits first, this is effectively equivalent to a code point comparison. This means that Postgres and Python now agree on string order AND it's faster to boot (https://www.postgresql.org/docs/current/static/locale.html):
The drawback of using locales other than C or POSIX in PostgreSQL is its performance impact. It slows character handling and prevents ordinary indexes from being used by LIKE. For this reason use locales only if you actually need them.
Putting everything in the C
locale is similar to the recommendations for datetimes being stored in UTC in the database. Treat sorting as a render side concern that the client deals with.
Rewrite all server side sorts (for things like pagination) in terms of ints and datetimes. It should be possible to avoid ORDER BY
on string columns entirely, and if it isn't, at least the C
locale will give the same sort forever.
Hello, I understand your considerations, but using the C locale (or any other deterministic locale provided by Postgres) leads to problems for languages with accents.
Consider for ex. French "Briançon" (a city name) vs. "BRIANÇON" (same, but capitalised) or even "BRIANCON" (because the French administration usually capitalises by removing the diacritics, and so that's what we get from the government's open data...)
If you need case insensitive search here (which also implies accent insensitive), neither LOWER(), UPPER(), ILIKE, citext, or any other crafted tweak is going to give you the correct behavior... I wish we would have ILIKE and case insensitive equality operators that could be used for something else than plain English without such bugs (SELECT lower('Ç') returns 'Ç' instead of 'ç' or even 'c'... WTF ???), but nowadays the non deterministic ICU collations seems to be our only hope. (Or do you have another recommandation to advise ?)
I could have taken other funny examples from German (letter "ß" becomes 2 letters "SS" when capitalized, which in turns becomes 2 letters "ss" when lowercased again, therefore being not bijective ; "ä" becoming "AE" or "ae"), Spanish (old alphabetical ordering treating the "ch" as a single letter ordered after all the other "c*" sequences), etc.
Please remember that there are other languages than English in the world! What would you advise for a case insensitive search in those languages ?
Thank you