Skip to content

Instantly share code, notes, and snippets.

@johndrinkwater
Last active April 1, 2016 08:53
Show Gist options
  • Save johndrinkwater/062db4bad72c4ee0c1b368302fdc472d to your computer and use it in GitHub Desktop.
Save johndrinkwater/062db4bad72c4ee0c1b368302fdc472d to your computer and use it in GitHub Desktop.
Head desking at an infrequent sight on Vox sites
Unicode is what we all use* to store our characters. Because of legacy reasons,
we needed a way to store them so that the basics mapped onto the old system
(ASCII) so UTF-8 was created. It is a representation of Unicode with certain
constraints on how we store our characters. For something like an em dash ( — ),
this is U+2014 in Unicode, and represented as
0xE2 0x80 0x94 in UTF‐8 bytes.
What I assume is happening in this case, is your writers are using em dashes
freely in their writing, copy/pasting the content into your CMS and at this
point it is converting the above sequence into:
0xC2 0xE2 0xC2 0x80 0xC2 0x94
which would show to the user: â[][] (most platforms will hide control
characters like these)
At this point they probably just search & replace for â → — which will clean
the content for them, but the source still retains the other characters the CMS
mung, so now it is stored as:
0xE2 0x80 0x94 0xC2 0x80 0xC2 0x94
which is how it is showing for me: http://i.imgur.com/MgTRmfS.png
* This masks lots of history and culture and I am sure you can Wikipedia
character sets and the history of Operating Systems to learn more if you need
to
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment