johndrinkwater · April 1, 2016 08:53
diff --git a/gistfile1.txt b/gistfile1.txt
 Unicode is what we all use* to store our characters. Because of legacy reasons,
 we needed a way to store them so that the basics mapped onto the old system
 (ASCII) so UTF-8 was created. It is a representation of Unicode with certain
 constraints on how we store our characters. For something like an em dash ( — ),
 this is U+2014 in Unicode, and represented as
 0xE2 0x80 0x94 in UTF‐8 bytes.

 What I assume is happening in this case, is your writers are using em dashes
 freely in their writing, copy/pasting the content into your CMS and at this
 point it is converting the above sequence into:
 0xC2 0xE2 0xC2 0x80 0xC2 0x94
 which would show to the user: â[][] (most platforms will hide control
 characters like these)

 At this point they probably just search & replace for â → — which will clean
 the content for them, but the source still retains the other characters the CMS
 mung, so now it is stored as:
 0xE2 0x80 0x94 0xC2 0x80 0xC2 0x94
 which is how it is showing for me: http://i.imgur.com/MgTRmfS.png

 * This masks lots of history and culture and I am sure you can Wikipedia
  character sets and the history of Operating Systems to learn more if you need
  to
	Unicode is what we all use* to store our characters. Because of legacy reasons,
	we needed a way to store them so that the basics mapped onto the old system
	(ASCII) so UTF-8 was created. It is a representation of Unicode with certain
	constraints on how we store our characters. For something like an em dash ( — ),
	this is U+2014 in Unicode, and represented as
	0xE2 0x80 0x94 in UTF‐8 bytes.

	What I assume is happening in this case, is your writers are using em dashes
	freely in their writing, copy/pasting the content into your CMS and at this
	point it is converting the above sequence into:
	0xC2 0xE2 0xC2 0x80 0xC2 0x94
	which would show to the user: â[][] (most platforms will hide control
	characters like these)

	At this point they probably just search & replace for â → — which will clean
	the content for them, but the source still retains the other characters the CMS
	mung, so now it is stored as:
	0xE2 0x80 0x94 0xC2 0x80 0xC2 0x94
	which is how it is showing for me: http://i.imgur.com/MgTRmfS.png

	* This masks lots of history and culture and I am sure you can Wikipedia
	character sets and the history of Operating Systems to learn more if you need
	to