Last active
August 29, 2015 14:01
-
-
Save dcki/2fee45ccffaa6992d59d to your computer and use it in GitHub Desktop.
JavaScript and text encodings
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
After much consideration (well, a few hours), I've decided that this is all you really need to | |
know about how JavaScript deals with unicode, utf-8, and utf-16: | |
1. Suppose you have a unicode character that you want to put in a string in | |
JavaScript code. | |
a. One way to do that is to copy the character from wikipedia, for | |
example, and paste it into your code. | |
b. The other way is to encode the character's codepoint in utf-16 | |
and put that value in a string like '\uXXXX' where XXXX is the utf-16 | |
encoded value in hex. (Some characters are 4 bytes long when encoded | |
in utf-16, so you would end up with '\uXXXX\uXXXX'.) I really wish XXXX | |
was the codepoint value and not the utf-16 encoded value, but it's just | |
not. On the bright side, for anything that can be encoded in 2 bytes | |
('\uXXXX'), the utf-16 value is the same as the codepoint value. You only | |
have to go to the trouble of figuring out the utf-16 encoding if you need | |
to use one of the more unusual characters that encodes to 4 bytes. | |
2. It's perfectly fine to save your code file in utf-8. If you use method | |
1a above, you will have utf-8 data in your code, it's true, but that's | |
okay. When the code is parsed everything will be converted and represented | |
internally as utf-16. So as long as the copy-paste handoff works correctly | |
and your text editor is smart enough to save the whole file in a consistent | |
encoding and your web server software doesn't mess with it, you shouldn't | |
have any issues. | |
3. In JavaScript the length of '\uXXXX' is one. Unintuitively, the length | |
of '\uXXXX\uXXXX' is always two, even if it represents one character. | |
As far as JavaScript knows, '\uXXXX\uXXXX' is always two separate | |
characters. This behavior can make comparison and searching more | |
complicated. | |
4. I'm pretty sure that when JavaScript writes text to a web page the text is encoded in utf-8 by default. I expect the same is true by default for text sent and received with AJAX. | |
More notes and references for digging deeper: | |
http://en.wikipedia.org/wiki/Letterlike_Symbols | |
http://en.wikipedia.org/wiki/Mathematical_Alphanumeric_Symbols | |
http://stackoverflow.com/questions/6240055/manually-converting-unicode-codepoints-into-utf-8-and-utf-16#answer-6240184 | |
http://stackoverflow.com/questions/6240055/manually-converting-unicode-codepoints-into-utf-8-and-utf-16#answer-6240819 | |
http://rishida.net/tools/conversion/ | |
π = MATHEMATICAL ITALIC SMALL E = 0x1d452 codepoint = 0xf09d9192 utf-8 = 0xd835dc52 utf-16 = \ud835\udc52 JavaScript | |
β― = SCRIPT SMALL E = 0x212f codepoint = 0xe284af utf-8 encoding = 0x212f utf-16 encoding = \u212f JavaScript | |
Say I want to check if input equals either of the two e's I've seen in JavaScript. What does that look like? Answer: | |
<!doctype html> | |
<html> | |
<head> | |
<meta charset="utf-8"> | |
</head> | |
<body> | |
<script> | |
var byid = function(id) { | |
return document.getElementById(id); | |
}; | |
var mathematical_italic_small_e_literal = 'π'; | |
console.log(mathematical_italic_small_e_literal); | |
var mathematical_italic_small_e_literal_string = 'mathematical_italic_small_e_literal'; | |
console.log(mathematical_italic_small_e_literal_string); | |
var mathematical_italic_small_e_js_code = '\ud835\udc52'; | |
console.log(mathematical_italic_small_e_js_code); | |
var mathematical_italic_small_e_js_code_string = '\\ud835\\udc52'; | |
console.log(mathematical_italic_small_e_js_code_string); | |
document.write('<p>Written by script: ' + mathematical_italic_small_e_literal_string + ' ' + mathematical_italic_small_e_literal + ' ' + mathematical_italic_small_e_js_code + ' ' + mathematical_italic_small_e_js_code_string + '</p>'); | |
var script_small_e_literal = 'β―'; | |
console.log(script_small_e_literal); | |
var script_small_e_literal_string = 'script_small_e_literal'; | |
console.log(script_small_e_literal_string); | |
var script_small_e_js_code = '\u212f'; | |
console.log(script_small_e_js_code); | |
var script_small_e_js_code_string = '\\u212f'; | |
console.log(script_small_e_js_code_string); | |
document.write('<p>Written by script: ' + script_small_e_literal_string + ' ' + script_small_e_literal + ' ' + script_small_e_js_code + ' ' + script_small_e_js_code_string + '</p>'); | |
</script> | |
<p>plain html: π</p> | |
<p>plain html: β―</p> | |
<form id="form"> | |
<input type="text" id="input"> | |
</form> | |
<script> | |
var checkInput = function() { | |
var val = byid('input').value; | |
if (val === mathematical_italic_small_e_literal) { alert('Matches ' + mathematical_italic_small_e_literal_string); } | |
if (val === mathematical_italic_small_e_js_code) { alert('Matches ' + mathematical_italic_small_e_js_code_string); } | |
if (val === script_small_e_literal) { alert('Matches ' + script_small_e_literal_string); } | |
if (val === script_small_e_js_code) { alert('Matches ' + script_small_e_js_code_string); } | |
// Cancel form submit. | |
return false; | |
} | |
byid('form').onsubmit=checkInput; | |
</script> | |
</body> | |
</html> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment