Skip to content

Instantly share code, notes, and snippets.

@toretore
Last active December 11, 2023 06:47
Show Gist options
  • Save toretore/4337673cedc9634973944945cad4af92 to your computer and use it in GitHub Desktop.
Save toretore/4337673cedc9634973944945cad4af92 to your computer and use it in GitHub Desktop.
MySQL character sets and collations

Understanding, working with and properly configuring character sets and collations in MySQL

Read: How to actually, really use proper UTF-8 for everything

MySQL understands and uses character sets and collations when receiving, storing and sending data. In previous versions the default character set and collation were latin1 and latin1_swedish_ci, but as of version 8.0 these defaults have been changed to the much more sane utf8mb4 and utf8mb4_0900_ci_ai. It's a good time to go through exactly how these concepts work in MySQL and to make sure we're using the correct values everywhere.

This document assumes MySQL version 8.0 unless otherwise specified.

How do character sets and collations work in MySQL?

Every string in MySQL has a charset and a collation, whether explicit or implicit. It helps to understand where these come from for any given string to avoid being surprised or confused by results.

Character sets

I'm going to assume you know what charsets are, and I'm only going to focus on the One True Charset here, UTF-8. MySQL supports many charsets, and many variations of Unicode and UTF-8 charsets. You can get a list of charsets supported by your MySQL by running:

SHOW CHARACTER SET

The charset named utf8 is only kinda-utf8 in that it only supports 3 bytes per character, not 4. This covers most of the "necessary" characters to write any language (the BMP) but not the supplementary planes (no emojis 🙁). The "real" UTF-8 charset is named utf8mb4, and this is the one to use if you "just want UTF-8 everywhere". This is also the default charset in MySQL 8.0, as opposed to the cursed latin1 which was used previously. utf8 is an alias for utf8mb3, which is deprecated, and in the future utf8 will be an alias for utf8mb4.

Collations

Collations describe how to compare and order values in a certain charset. When you do an ORDER BY x or WHERE x = y, a collation determines exactly how these values are compared to produce a result. You can get a list of supported collations for your MySQL with:

SHOW COLLATION [WHERE charset = 'utf8mb4']

A collation "belongs to" a charset, and whenever you specify a collation it must belong to the charset of the data in question. Each charset has a default collation which is used if none is given explicitly.

Most (all?) of the time, it's unnecessary to specify charset directly when you specify a collation because the charset will be determined from the collation.

A pseudocode is worth a thousand blabberings:

type CharSet {
  name: String,
  collations: [Collation],
  defaultCollation: Collation
}

type Collation {
  name: String,
  charset: CharSet
}

How does MySQL determine which charset and collation to use?

At each point in the lifetime of a piece of data in MySQL, it is associated with a charset and a collation. When talking about charsets and collations, it's helpful to understand how objects in MySQL inherit these properties.

Oh, and, from here on I'll use the term charcollation to mean character set and collation. Because, why not, I'm tired of writing both. They are inherently bound to each other and it's usually best to think of them as a pair.

type CharCollation = (Charset, Collation)

Inheritance

Charcollations for a piece of data are either set explicitly or inherited from another object. The inheritance chain for a column is:

That is, at each level the object either has its charset and/or collation set explicitly, or it inherits it from the level above it.

The charcollation values for a database, table or column are static, not dynamic. They are determined at "create-time", whenever the object is created (or altered), and stored along with the object. Any defaults or inheritance is only applied at create/alter time.

If you alter a table's defaults after creating a column that uses these defaults, the column will continue to use the previous defaults because they were stored along with the column when it was created.

The "charcollation algorithm"

Charsets and collations are only inherited together. If you specify either charset or collation, inheritance does not happen. When you create or alter a database, table or column, the values are determined like this:

  • If you specify both charset and collation, those values are used
  • If you specify only the charset, that charset is used along with its default collation
  • If you specify only the collation, that collation is used along with its associated charset (the charset it "belongs to")
  • If neither are specified, they are inherited

Specifying charset or collation implicitly specifies the other.

function setCharcollation (obj: Database|Table|Column, charset: CharSet?, collation: Collation?):
  (obj.charset, obj.collation) = match (charset, collation)
    (cs, cl)     -> (cs, cl)
    (cs, null)   -> (cs, cs.defaultCollation)
    (null, cl)   -> (cl.charset, cl)
    (null, null) -> (obj.parent.charset, obj.parent.collation)

The server has a default character set and collation. These are stored in the system variables character_set_server and collation_server with default values (in MySQL 8.0) being utf8mb4 and utf8mb4_0900_ai_ci.

Setting these depends on how you're running MySQL, usually you would set them in an option file, "parameter group" (AWS RDS) or similar. To show the currently used values:

SHOW VARIABLES WHERE variable_name IN ('character_set_server', 'collation_server')

A database's charcollation is set when it's created or altered using values determined by the charcollation algorithm. A database inherits from the server defaults.

When creating or altering a database, you can specify charset and collation like this:

-- Assuming Mysql 8.0 default values of utf8mb4/utf8mb4_0900_ai_ci are set
-- as `character_set_server` and `collation_server`
CREATE DATABASE db1;
CREATE DATABASE db2 CHARACTER SET latin1 COLLATE latin1_general_ci;
CREATE DATABASE db3 CHARACTER SET latin1;
CREATE DATABASE db4 COLLATE utf8_general_ci;

-- Set system variables temporarily only for this session
SET SESSION character_set_server = 'utf32';
SET SESSION collation_server = 'utf32_general_ci';

CREATE DATABASE db5;
CREATE DATABASE db6;
ALTER DATABASE db6 COLLATE latin1_spanish_ci;
  • db1 has charset utf8mb4 and collation utf8mb4_0900_ai_ci (inherited from server)
  • db2 has charset latin1 and collation latin1_general_ci (both set explicitly)
  • db3 has charset latin1 and collation latin1_swedish_ci (default collation for charset)
  • db4 has charset utf8 and collation utf8_general_ci (charset determined from collation)
  • db5 har charset utf32 and collation utf32_general_ci (inherited from server)
  • db6 has charset latin1 and collation latin1_spanish_ci (charset determined from collation)

To see the current values for a database:

SELECT DEFAULT_CHARACTER_SET_NAME, DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.SCHEMATA WHERE SCHEMA_NAME = 'db';

A table's charcollation is set when it's created or altered using values determined by the charcollation algorithm. A table inherits from the database it belongs to.

Some examples of working with table charcollations:

CREATE DATABASE db CHARACTER SET utf16 COLLATE utf16_bin;
USE db;

CREATE TABLE tbl1 ( ... ) CHARACTER SET utf8 COLLATE utf8_general_ci;
CREATE TABLE tbl2 ( ... ) CHARACTER SET latin1;
CREATE TABLE tbl3 ( ... ) COLLATE utf8mb4_lithuanian_ci;
CREATE TABLE tbl4 ( ... );
CREATE TABLE tbl5 ( ... );

ALTER DATABASE db CHARACTER SET ucs2 COLLATE ucs2_icelandic_ci;

CREATE TABLE tbl6 ( ... );
CREATE TABLE tbl7 ( ... );

ALTER TABLE tbl5 COLLATE utf32_bin;
ALTER TABLE tbl6 CHARSET ujis;

The charcollation values for the tables at the end will be:

  • tbl1 has charset utf8 and collation utf8_general_ci (both explicit)
  • tbl2 has charset latin1 and collation latin1_swedish_ci (default collation for charset)
  • tbl3 has charset utf8mb4 and collation utf8mb4_lithuanian_ci (charset determined from collation)
  • tbl4 has charset utf16 and collation utf16_bin (inherited from database)
  • tbl5 has charset utf32 and collation utf32_bin (charset determined from collation)
  • tbl6 has charset ujis and collation ujis_japanese_ci (default collation for charset)
  • tbl7 has charset ucs2 and collation ucs2_icelandic_ci (inherited from database)

To determine the current charcollation for a table:

SELECT ts.table_schema, ts.table_name, cs.character_set_name, ts.table_collation
FROM information_schema.tables AS ts
INNER JOIN information_schema.collations AS cs ON cs.collation_name = ts.table_collation
WHERE ts.table_schema = 'db'
AND ts.table_name = 'tbl'

A column's charcollation is set when it's created or altered using values determined by the charcollation algorithm. A column inherits from the table it belongs to.

Examples of setting charcollations for columns:

CREATE TABLE tbl (
  col1 VARCHAR(255) CHARACTER SET latin1 COLLATE latin1_general_ci,
  col2 VARCHAR(255) CHARACTER SET latin1,
  col3 VARCHAR(255) COLLATE latin1_spanish_ci,
  col4 VARCHAR(255)
) CHARACTER SET utf8 COLLATE utf8_general_ci;

ALTER TABLE tbl ADD COLUMN col5 VARCHAR(255) CHARSET utf8mb4;
ALTER TABLE tbl ADD COLUMN col6 VARCHAR(255) CHARSET utf8mb4 COLLATE utf8mb4_persian_ci;

-- This does not change charcollations of any existing columns:
ALTER TABLE tbl CHARACTER SET utf32 COLLATE utf32_bin;

ALTER TABLE tbl ADD COLUMN col7 VARCHAR(255);

The charcollation values for the columns at the end will be:

  • col1 has charset latin1 and collation latin1_general_ci (both explicit)
  • col2 has charset latin and collation latin1_swedish_ci (default collation for charset)
  • col3 has charset latin and collation latin1_spanish_ci (charset determined from collation)
  • col4 has charset utf8 and collation utf8_general_ci (inherited from table)
  • col5 has charset utf8mb4 and collation utf8mb4_0900_ai_ci (default collation for charset)
  • col6 has charset utf8mb4 and collation utf8mb4_persian_ci (both explicit)
  • col7 har charset utf32 and collation utf32_bin (inherited from table)

To determine the existing charcollation for a column:

SELECT table_schema, table_name, column_name, character_set_name, collation_name
FROM information_schema.columns
WHERE table_schema = 'db'
-- AND data_type RLIKE 'char|text'
AND table_name = 'tbl'
AND column_name = 'col'

Also useful:

> SHOW CREATE TABLE tbl;

CREATE TABLE `tbl` (
  `col1` varchar(255) CHARACTER SET latin1 COLLATE latin1_general_ci DEFAULT NULL,
  `col2` varchar(255) CHARACTER SET latin1 COLLATE latin1_swedish_ci DEFAULT NULL,
  `col3` varchar(255) CHARACTER SET latin1 COLLATE latin1_spanish_ci DEFAULT NULL,
  `col4` varchar(255) CHARACTER SET utf8 DEFAULT NULL,
  `col5` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL,
  `col6` varchar(255) CHARACTER SET utf8mb4 COLLATE utf8mb4_persian_ci DEFAULT NULL,
  `col7` varchar(255) COLLATE utf32_bin DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf32 COLLATE=utf32_bin

This will show the whole table. The charcollation for a column is listed explicitly when it's different from the table's default, otherwise the value for the column is the same as that of the table. Note again that changing a table's default charcollation will not change those of columns that inherited the previous values, as these are stored statically on column creation. If you SHOW CREATE TABLE tbl after changing tbl's charcollation, it will omit charsets for columns where they are the same as the new default, and columns whose values matched the previous default will now have the charset displayed.

In addition to the data stored in a table's columns, there is the data that flows between the server and connected clients. Each connection has associated system variables that specify which charcollations apply to data coming from and being sent to the client.

  • character_set_client informs the server of the charset of the data coming from the client
  • character_set_connection specifies which charset to convert data coming from the client to. That is, data comes in as character_set_client and then gets converted to character_set_connection if they're different.
  • collation_connection specifies which collation incoming strings use. This applies only to the literal strings in statements, and will not apply to any of the data already on the server.
  • character_set_results specifies the charset to convert data to before sending it back to the client. Can be set to NULL to tell the server to not perform any conversion.

Setting these values is usually done by the client program. You can use the following statements to set them once a connection has been established:

  • SET variable_name will set a specific variable directly
  • SET NAMES charset [COLLATE collation] sets the variables character_set_client, character_set_connection and character_set_results to charset. collation_connection is set to collation if specified, or to charset's default collation.
  • SET CHARACTER SET charset sets the variables character_set_client and character_set_results to charset, and sets character_set_connection and collation_connection to character_set_database and collation_database. That is, it sets the variables for data coming from or going to the client and then makes sure that data is then converted as necessary to that of the current database.

Statements, once they've arrive on the server and any conversion from character_set_client has been performed, will use the charset and collation specified in character_set_connection and collation_connection.

You could say that a string literal inherits from the connection, and the algorithm to determine a string literal's charcollation is like the "charcollation algorithm" used for the other objects, with one difference: If only the collation is specified explicitly, the connection charset is used instead of the collation's charset.

To explicitly specify the charset and/or collation for a string literal using introducers and COLLATE:

SELECT _utf8mb4'🦆' COLLATE utf8mb4_unicode_ci -- utf8mb4/utf8mb4_general_ci
SELECT _latin1'duck' -- latin1/latin1_swedish_ci
SELECT 'pølsemænd' COLLATE utf8_danish_ci -- utf8/utf8_danish_ci

Note that the last statement will fail unless the character_set_connection is utf8.

Collation precedence

You may run into a number of issues related to using different charsets and collations. Your best bet is probably to stick to the same charset and collation everywhere it makes sense.

Which collation gets used when you mix data using different collations is determined by a rather complex set of rules for coercibility which I'll not get into. One thing to note is that the collation_connection has a low precedence, and will not be used for much else than string literals by default. If you need to e.g. change the way results are ordered from the default column collation, you must use COLLATE:

SELECT id, title FROM articles ORDER BY title COLLATE utf8mb4_spanish_ci;

Charset and collation compatibility, conversions, data loss and errors

.

https://dev.mysql.com/doc/refman/8.0/en/charset-syntax.html

How to understand, work with and properly configure character sets and collations in MySQL.

(How to properly configure and use UTF-8 for everything)

  • What are character sets and collations in MySQL?
    • Charset
    • Collation
    • Everything that is text in MySQL has a charset and a collation
      • It comes from somewhere
    • Configuration settings
      • Server
      • Client
    • Inspecting current values
      • SHOW VARIABLES
      • SHOW CREATE

Inheritance:

Precedence:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment