Skip to content

Instantly share code, notes, and snippets.

@dcki
Last active August 29, 2015 14:01
Show Gist options
  • Save dcki/2fee45ccffaa6992d59d to your computer and use it in GitHub Desktop.
Save dcki/2fee45ccffaa6992d59d to your computer and use it in GitHub Desktop.
JavaScript and text encodings
After much consideration (well, a few hours), I've decided that this is all you really need to
know about how JavaScript deals with unicode, utf-8, and utf-16:
1. Suppose you have a unicode character that you want to put in a string in
JavaScript code.
a. One way to do that is to copy the character from wikipedia, for
example, and paste it into your code.
b. The other way is to encode the character's codepoint in utf-16
and put that value in a string like '\uXXXX' where XXXX is the utf-16
encoded value in hex. (Some characters are 4 bytes long when encoded
in utf-16, so you would end up with '\uXXXX\uXXXX'.) I really wish XXXX
was the codepoint value and not the utf-16 encoded value, but it's just
not. On the bright side, for anything that can be encoded in 2 bytes
('\uXXXX'), the utf-16 value is the same as the codepoint value. You only
have to go to the trouble of figuring out the utf-16 encoding if you need
to use one of the more unusual characters that encodes to 4 bytes.
2. It's perfectly fine to save your code file in utf-8. If you use method
1a above, you will have utf-8 data in your code, it's true, but that's
okay. When the code is parsed everything will be converted and represented
internally as utf-16. So as long as the copy-paste handoff works correctly
and your text editor is smart enough to save the whole file in a consistent
encoding and your web server software doesn't mess with it, you shouldn't
have any issues.
3. In JavaScript the length of '\uXXXX' is one. Unintuitively, the length
of '\uXXXX\uXXXX' is always two, even if it represents one character.
As far as JavaScript knows, '\uXXXX\uXXXX' is always two separate
characters. This behavior can make comparison and searching more
complicated.
4. I'm pretty sure that when JavaScript writes text to a web page the text is encoded in utf-8 by default. I expect the same is true by default for text sent and received with AJAX.
More notes and references for digging deeper:
http://en.wikipedia.org/wiki/Letterlike_Symbols
http://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symbols
http://stackoverflow.com/questions/6240055/manually-converting-unicode-codepoints-into-utf-8-and-utf-16#answer-6240184
http://stackoverflow.com/questions/6240055/manually-converting-unicode-codepoints-into-utf-8-and-utf-16#answer-6240819
http://rishida.net/tools/conversion/
𝑒 = MATHEMATICAL ITALIC SMALL E = 0x1d452 codepoint = 0xf09d9192 utf-8 = 0xd835dc52 utf-16 = \ud835\udc52 JavaScript
β„― = SCRIPT SMALL E = 0x212f codepoint = 0xe284af utf-8 encoding = 0x212f utf-16 encoding = \u212f JavaScript
Say I want to check if input equals either of the two e's I've seen in JavaScript. What does that look like? Answer:
<!doctype html>
<html>
<head>
<meta charset="utf-8">
</head>
<body>
<script>
var byid = function(id) {
return document.getElementById(id);
};
var mathematical_italic_small_e_literal = '𝑒';
console.log(mathematical_italic_small_e_literal);
var mathematical_italic_small_e_literal_string = 'mathematical_italic_small_e_literal';
console.log(mathematical_italic_small_e_literal_string);
var mathematical_italic_small_e_js_code = '\ud835\udc52';
console.log(mathematical_italic_small_e_js_code);
var mathematical_italic_small_e_js_code_string = '\\ud835\\udc52';
console.log(mathematical_italic_small_e_js_code_string);
document.write('<p>Written by script: ' + mathematical_italic_small_e_literal_string + ' ' + mathematical_italic_small_e_literal + ' ' + mathematical_italic_small_e_js_code + ' ' + mathematical_italic_small_e_js_code_string + '</p>');
var script_small_e_literal = 'β„―';
console.log(script_small_e_literal);
var script_small_e_literal_string = 'script_small_e_literal';
console.log(script_small_e_literal_string);
var script_small_e_js_code = '\u212f';
console.log(script_small_e_js_code);
var script_small_e_js_code_string = '\\u212f';
console.log(script_small_e_js_code_string);
document.write('<p>Written by script: ' + script_small_e_literal_string + ' ' + script_small_e_literal + ' ' + script_small_e_js_code + ' ' + script_small_e_js_code_string + '</p>');
</script>
<p>plain html: 𝑒</p>
<p>plain html: β„―</p>
<form id="form">
<input type="text" id="input">
</form>
<script>
var checkInput = function() {
var val = byid('input').value;
if (val === mathematical_italic_small_e_literal) { alert('Matches ' + mathematical_italic_small_e_literal_string); }
if (val === mathematical_italic_small_e_js_code) { alert('Matches ' + mathematical_italic_small_e_js_code_string); }
if (val === script_small_e_literal) { alert('Matches ' + script_small_e_literal_string); }
if (val === script_small_e_js_code) { alert('Matches ' + script_small_e_js_code_string); }
// Cancel form submit.
return false;
}
byid('form').onsubmit=checkInput;
</script>
</body>
</html>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment