Skip to content

Instantly share code, notes, and snippets.

@birkin
Created May 4, 2023 12:47
Show Gist options
  • Save birkin/c8155a66693034dd0af534bd30cba9de to your computer and use it in GitHub Desktop.
Save birkin/c8155a66693034dd0af534bd30cba9de to your computer and use it in GitHub Desktop.
asked chatgpt to compare two mysq unicodel collations.

background

For the stolen-relations project, we're doing some programmatic data-cleaning. I recently uploaded test sql to an empty database and examined the load in phpMyAdmin.

I noticed that the collation for the few updated tables had changed from the utf8mb4_unicode_520_ci encoding I've ensured all my sql tables use for the last few years. The updated tables are using the collation utf8mb4_0900_ai_ci.

So I asked chatGPT about the differences, and then did a few google searches to verify the info. Over time, I'll move collations to this utf8mb4_0900_ai_ci collation.


prompt

what are the pros and cons of mysql collation "utf8mb4_0900_ai_ci" vs "utf8mb4_unicode_520_ci"?


response

MySQL collations define the rules for comparing and sorting character strings in the database. Both "utf8mb4_0900_ai_ci" and "utf8mb4_unicode_520_ci" are collations for the utf8mb4 character set, which supports the full range of Unicode characters. However, they have different comparison rules based on different Unicode standards.

utf8mb4_0900_ai_ci:

  • Based on the Unicode 9.0 standard.
  • AI (Accent Insensitive): It treats accented characters as equivalent to their unaccented counterparts, which can be useful when comparing text in languages with many accented characters.
  • CI (Case Insensitive): It treats upper and lowercase versions of the same letter as equal, which simplifies text comparisons and searches.

Pros:

  • More up-to-date and generally recommended for new applications.
  • Provides better performance for sorting and searching, as it's optimized for the latest Unicode standards.
  • Offers better support for supplementary characters outside the Basic Multilingual Plane (BMP).

Cons:

  • Might not be fully compatible with older systems or applications that rely on specific behavior from older collations.

utf8mb4_unicode_520_ci:

  • Based on the Unicode 5.2 standard.
  • CI (Case Insensitive): It treats upper and lowercase versions of the same letter as equal, which simplifies text comparisons and searches.

Pros:

  • Can be more compatible with older systems or applications that rely on specific behavior from older collations.

Cons:

  • Less up-to-date compared to the utf8mb4_0900_ai_ci collation, which means it may not support the latest Unicode features or optimizations.
  • Does not offer accent insensitivity, which may lead to incorrect results in some text comparisons or searches involving accented characters.

In summary, utf8mb4_0900_ai_ci is generally the better choice for new applications due to its up-to-date support for Unicode standards, better performance, and accent insensitivity. However, utf8mb4_unicode_520_ci might be more suitable for older applications or systems that require specific behavior from an older collation.

(end)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment