Read: How to actually, really use proper UTF-8 for everything
MySQL understands and uses character sets and collations when receiving, storing and sending data. In previous versions the default character set and collation were latin1
and latin1_swedish_ci
, but as of version 8.0 these defaults have been changed to the much more sane utf8mb4
and utf8mb4_0900_ci_ai
. It's a good time to go through exactly how these concepts work in MySQL and to make sure we're using the correct values everywhere.
This document assumes MySQL version 8.0 unless otherwise specified.
Every string in MySQL has a charset and a collation, whether explicit or implicit. It helps to understand where these come from for any given string to avoid being surprised or confused by results.
I'm going to assume you know what charsets are, and I'm only going to focus on the One True Charset here, UTF-8. MySQL supports many charsets, and many variations of Unicode and UTF-8 charsets. You can get a list of charsets supported by your MySQL by running:
SHOW CHARACTER SET
The charset named utf8
is only kinda-utf8 in that it only supports 3 bytes per character, not 4. This covers most of the "necessary" characters to write any language (the BMP) but not the supplementary planes (no emojis 🙁). The "real" UTF-8 charset is named utf8mb4
, and this is the one to use if you "just want UTF-8 everywhere". This is also the default charset in MySQL 8.0, as opposed to the cursed latin1
which was used previously. utf8
is an alias for utf8mb3
, which is deprecated, and in the future utf8
will be an alias for utf8mb4
.
Collations describe how to compare and order values in a certain charset. When you do an ORDER BY x
or WHERE x = y
, a collation determines exactly how these values are compared to produce a result. You can get a list of supported collations for your MySQL with:
SHOW COLLATION [WHERE charset = 'utf8mb4']
A collation "belongs to" a charset, and whenever you specify a collation it must belong to the charset of the data in question. Each charset has a default collation which is used if none is given explicitly.
Most (all?) of the time, it's unnecessary to specify charset directly when you specify a collation because the charset will be determined from the collation.
A pseudocode is worth a thousand blabberings:
type CharSet {
name: String,
collations: [Collation],
defaultCollation: Collation
}
type Collation {
name: String,
charset: CharSet
}
At each point in the lifetime of a piece of data in MySQL, it is associated with a charset and a collation. When talking about charsets and collations, it's helpful to understand how objects in MySQL inherit these properties.
Oh, and, from here on I'll use the term charcollation to mean character set and collation. Because, why not, I'm tired of writing both. They are inherently bound to each other and it's usually best to think of them as a pair.
type CharCollation = (Charset, Collation)
Charcollations for a piece of data are either set explicitly or inherited from another object. The inheritance chain for a column is:
That is, at each level the object either has its charset and/or collation set explicitly, or it inherits it from the level above it.
The charcollation values for a database, table or column are static, not dynamic. They are determined at "create-time", whenever the object is created (or altered), and stored along with the object. Any defaults or inheritance is only applied at create/alter time.
If you alter a table's defaults after creating a column that uses these defaults, the column will continue to use the previous defaults because they were stored along with the column when it was created.
Charsets and collations are only inherited together. If you specify either charset or collation, inheritance does not happen. When you create or alter a database, table or column, the values are determined like this:
- If you specify both charset and collation, those values are used
- If you specify only the charset, that charset is used along with its default collation
- If you specify only the collation, that collation is used along with its associated charset (the charset it "belongs to")
- If neither are specified, they are inherited
Specifying charset or collation implicitly specifies the other.
function setCharcollation (obj: Database|Table|Column, charset: CharSet?, collation: Collation?):
(obj.charset, obj.collation) = match (charset, collation)
(cs, cl) -> (cs, cl)
(cs, null) -> (cs, cs.defaultCollation)
(null, cl) -> (cl.charset, cl)
(null, null) -> (obj.parent.charset, obj.parent.collation)
The server has a default character set and collation. These are stored in the system variables character_set_server and collation_server with default values (in MySQL 8.0) being utf8mb4
and utf8mb4_0900_ai_ci
.
Setting these depends on how you're running MySQL, usually you would set them in an option file, "parameter group" (AWS RDS) or similar. To show the currently used values:
SHOW VARIABLES WHERE variable_name IN ('character_set_server', 'collation_server')
A database's charcollation is set when it's created or altered using values determined by the charcollation algorithm. A database inherits from the server defaults.
When creating or altering a database, you can specify charset and collation like this:
-- Assuming Mysql 8.0 default values of utf8mb4/utf8mb4_0900_ai_ci are set
-- as `character_set_server` and `collation_server`
CREATE DATABASE db1;
CREATE DATABASE db2 CHARACTER SET latin1 COLLATE latin1_general_ci;
CREATE DATABASE db3 CHARACTER SET latin1;
CREATE DATABASE db4 COLLATE utf8_general_ci;
-- Set system variables temporarily only for this session
SET SESSION character_set_server = 'utf32';
SET SESSION collation_server = 'utf32_general_ci';
CREATE DATABASE db5;
CREATE DATABASE db6;
ALTER DATABASE db6 COLLATE latin1_spanish_ci;
db1
has charsetutf8mb4
and collationutf8mb4_0900_ai_ci
(inherited from server)db2
has charsetlatin1
and collationlatin1_general_ci
(both set explicitly)db3
has charsetlatin1
and collationlatin1_swedish_ci
(default collation for charset)db4
has charsetutf8
and collationutf8_general_ci
(charset determined from collation)db5
har charsetutf32
and collationutf32_general_ci
(inherited from server)db6
has charsetlatin1
and collationlatin1_spanish_ci
(charset determined from collation)
To see the current values for a database:
SELECT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = 'db';
A table's charcollation is set when it's created or altered using values determined by the charcollation algorithm. A table inherits from the database it belongs to.
Some examples of working with table charcollations:
CREATE DATABASE db CHARACTER SET utf16 COLLATE utf16_bin;
USE db;
CREATE TABLE tbl1 ( ... ) CHARACTER SET utf8 COLLATE utf8_general_ci;
CREATE TABLE tbl2 ( ... ) CHARACTER SET latin1;
CREATE TABLE tbl3 ( ... ) COLLATE utf8mb4_lithuanian_ci;
CREATE TABLE tbl4 ( ... );
CREATE TABLE tbl5 ( ... );
ALTER DATABASE db CHARACTER SET ucs2 COLLATE ucs2_icelandic_ci;
CREATE TABLE tbl6 ( ... );
CREATE TABLE tbl7 ( ... );
ALTER TABLE tbl5 COLLATE utf32_bin;
ALTER TABLE tbl6 CHARSET ujis;
The charcollation values for the tables at the end will be:
tbl1
has charsetutf8
and collationutf8_general_ci
(both explicit)tbl2
has charsetlatin1
and collationlatin1_swedish_ci
(default collation for charset)tbl3
has charsetutf8mb4
and collationutf8mb4_lithuanian_ci
(charset determined from collation)tbl4
has charsetutf16
and collationutf16_bin
(inherited from database)tbl5
has charsetutf32
and collationutf32_bin
(charset determined from collation)tbl6
has charsetujis
and collationujis_japanese_ci
(default collation for charset)tbl7
has charsetucs2
and collationucs2_icelandic_ci
(inherited from database)
To determine the current charcollation for a table:
SELECT ts.table_schema, ts.table_name, cs.character_set_name, ts.table_collation
FROM information_schema.tables AS ts
INNER JOIN information_schema.collations AS cs ON cs.collation_name = ts.table_collation
WHERE ts.table_schema = 'db'
AND ts.table_name = 'tbl'
A column's charcollation is set when it's created or altered using values determined by the charcollation algorithm. A column inherits from the table it belongs to.
Examples of setting charcollations for columns:
CREATE TABLE tbl (
col1 VARCHAR(255) CHARACTER SET latin1 COLLATE latin1_general_ci,
col2 VARCHAR(255) CHARACTER SET latin1,
col3 VARCHAR(255) COLLATE latin1_spanish_ci,
col4 VARCHAR(255)
) CHARACTER SET utf8 COLLATE utf8_general_ci;
ALTER TABLE tbl ADD COLUMN col5 VARCHAR(255) CHARSET utf8mb4;
ALTER TABLE tbl ADD COLUMN col6 VARCHAR(255) CHARSET utf8mb4 COLLATE utf8mb4_persian_ci;
-- This does not change charcollations of any existing columns:
ALTER TABLE tbl CHARACTER SET utf32 COLLATE utf32_bin;
ALTER TABLE tbl ADD COLUMN col7 VARCHAR(255);
The charcollation values for the columns at the end will be:
col1
has charsetlatin1
and collationlatin1_general_ci
(both explicit)col2
has charsetlatin
and collationlatin1_swedish_ci
(default collation for charset)col3
has charsetlatin
and collationlatin1_spanish_ci
(charset determined from collation)col4
has charsetutf8
and collationutf8_general_ci
(inherited from table)col5
has charsetutf8mb4
and collationutf8mb4_0900_ai_ci
(default collation for charset)col6
has charsetutf8mb4
and collationutf8mb4_persian_ci
(both explicit)col7
har charsetutf32
and collationutf32_bin
(inherited from table)
To determine the existing charcollation for a column:
SELECT table_schema, table_name, column_name, character_set_name, collation_name
FROM information_schema.columns
WHERE table_schema = 'db'
-- AND data_type RLIKE 'char|text'
AND table_name = 'tbl'
AND column_name = 'col'
Also useful:
> SHOW CREATE TABLE tbl;
CREATE TABLE `tbl` (
`col1` varchar(255) CHARACTER SET latin1 COLLATE latin1_general_ci DEFAULT NULL,
`col2` varchar(255) CHARACTER SET latin1 COLLATE latin1_swedish_ci DEFAULT NULL,
`col3` varchar(255) CHARACTER SET latin1 COLLATE latin1_spanish_ci DEFAULT NULL,
`col4` varchar(255) CHARACTER SET utf8 DEFAULT NULL,
`col5` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
`col6` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_persian_ci DEFAULT NULL,
`col7` varchar(255) COLLATE utf32_bin DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf32 COLLATE=utf32_bin
This will show the whole table. The charcollation for a column is listed explicitly when it's different from the table's default, otherwise the value for the column is the same as that of the table. Note again that changing a table's default charcollation will not change those of columns that inherited the previous values, as these are stored statically on column creation. If you SHOW CREATE TABLE tbl
after changing tbl
's charcollation, it will omit charsets for columns where they are the same as the new default, and columns whose values matched the previous default will now have the charset displayed.
In addition to the data stored in a table's columns, there is the data that flows between the server and connected clients. Each connection has associated system variables that specify which charcollations apply to data coming from and being sent to the client.
character_set_client
informs the server of the charset of the data coming from the clientcharacter_set_connection
specifies which charset to convert data coming from the client to. That is, data comes in ascharacter_set_client
and then gets converted tocharacter_set_connection
if they're different.collation_connection
specifies which collation incoming strings use. This applies only to the literal strings in statements, and will not apply to any of the data already on the server.character_set_results
specifies the charset to convert data to before sending it back to the client. Can be set toNULL
to tell the server to not perform any conversion.
Setting these values is usually done by the client program. You can use the following statements to set them once a connection has been established:
SET variable_name
will set a specific variable directlySET NAMES charset [COLLATE collation]
sets the variablescharacter_set_client
,character_set_connection
andcharacter_set_results
tocharset
.collation_connection
is set tocollation
if specified, or tocharset
's default collation.SET CHARACTER SET charset
sets the variablescharacter_set_client
andcharacter_set_results
tocharset
, and setscharacter_set_connection
andcollation_connection
tocharacter_set_database
andcollation_database
. That is, it sets the variables for data coming from or going to the client and then makes sure that data is then converted as necessary to that of the current database.
Statements, once they've arrive on the server and any conversion from character_set_client
has been performed, will use the charset and collation specified in character_set_connection
and collation_connection
.
You could say that a string literal inherits from the connection, and the algorithm to determine a string literal's charcollation is like the "charcollation algorithm" used for the other objects, with one difference: If only the collation is specified explicitly, the connection charset is used instead of the collation's charset.
To explicitly specify the charset and/or collation for a string literal using introducers and COLLATE:
SELECT _utf8mb4'🦆' COLLATE utf8mb4_unicode_ci -- utf8mb4/utf8mb4_general_ci
SELECT _latin1'duck' -- latin1/latin1_swedish_ci
SELECT 'pølsemænd' COLLATE utf8_danish_ci -- utf8/utf8_danish_ci
Note that the last statement will fail unless the character_set_connection
is utf8
.
You may run into a number of issues related to using different charsets and collations. Your best bet is probably to stick to the same charset and collation everywhere it makes sense.
Which collation gets used when you mix data using different collations is determined by a rather complex set of rules for coercibility which I'll not get into. One thing to note is that the collation_connection
has a low precedence, and will not be used for much else than string literals by default. If you need to e.g. change the way results are ordered from the default column collation, you must use COLLATE:
SELECT id, title FROM articles ORDER BY title COLLATE utf8mb4_spanish_ci;
.