Last active
April 1, 2016 08:53
-
-
Save johndrinkwater/062db4bad72c4ee0c1b368302fdc472d to your computer and use it in GitHub Desktop.
Head desking at an infrequent sight on Vox sites
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Unicode is what we all use* to store our characters. Because of legacy reasons, | |
we needed a way to store them so that the basics mapped onto the old system | |
(ASCII) so UTF-8 was created. It is a representation of Unicode with certain | |
constraints on how we store our characters. For something like an em dash ( — ), | |
this is U+2014 in Unicode, and represented as | |
0xE2 0x80 0x94 in UTF‐8 bytes. | |
What I assume is happening in this case, is your writers are using em dashes | |
freely in their writing, copy/pasting the content into your CMS and at this | |
point it is converting the above sequence into: | |
0xC2 0xE2 0xC2 0x80 0xC2 0x94 | |
which would show to the user: â[][] (most platforms will hide control | |
characters like these) | |
At this point they probably just search & replace for â → — which will clean | |
the content for them, but the source still retains the other characters the CMS | |
mung, so now it is stored as: | |
0xE2 0x80 0x94 0xC2 0x80 0xC2 0x94 | |
which is how it is showing for me: http://i.imgur.com/MgTRmfS.png | |
* This masks lots of history and culture and I am sure you can Wikipedia | |
character sets and the history of Operating Systems to learn more if you need | |
to |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment