Skip to content

Instantly share code, notes, and snippets.

@dmertl
Created July 20, 2023 20:51
Show Gist options
  • Save dmertl/2e05b7a757fe8156c1d919a25c233722 to your computer and use it in GitHub Desktop.
Save dmertl/2e05b7a757fe8156c1d919a25c233722 to your computer and use it in GitHub Desktop.
Character encoding tricks

If you see several weird characters in a row that look like incorrect character encoding there's a good chance it's UTF-8 interpreted as cp-1252. For example, “ in cp-1252 is 0xE2 0x80 0x9C. Those same 3 bytes in UTF-8, 0xE2809C is .

In MySQL you can do a quick conversion to check:

mysql> SELECT CONVERT(UNHEX(HEX(CONVERT("“" USING LATIN1))) USING utf8mb4);
+--------------------------------------------------------------------+
| CONVERT(UNHEX(HEX(CONVERT("“" USING LATIN1))) USING utf8mb4)     |
+--------------------------------------------------------------------+
| “                                                                  |
+--------------------------------------------------------------------+

https://www.cogsci.ed.ac.uk/~richard/utf-8.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment