TL;DR: You can simulate 8-bit safe "bytestrings" in JavaScript by restricting each character to code points U+0000 to U+00FF, and there are even tricks to easily convert between ordinary UTF-16 JS strings and UTF-8 encoded as such a JavaScript "bytestring".
- one weird trick to convert between ordinary UTF-16 JS strings and "bytestrings", using the deprecated function
unescape()
:binary_str = unescape(encodeURI(utf16_str))
utf16_str = decodeURIComponent(escape(binary_str))
- the way this works is,
encodeURI()
/encodeURIComponent()
first convert to UTF-8, then percent-encode each byte into 3 ASCII characters - whereas
escape()
directly percent-encodes each UTF-16 code unit into 3 or 6 characters, depending on whether it's >U+00FF or not - so
unescape(encodeURI(utf16_str))
is parsing the 3-character percent-encoded byte values as code points, producing "bytestrings" of code points all U+0000 to U+00FF - and
decodeURIComponent(escape(binary_str))
is parsing the 3-character percent-encoded U+0000 to U+00FF code points as UTF-8 byte values, recreating the original Unicode stringescape()
will escape slightly more punctuation thanencodeURI()
but less thanencodeURIComponent()
, sodecodeURIComponent()
is necessary to reverse that
- universal browser support (
escape()
is IE3+,encodeURI()
is IE5.5+; both predate Google Chrome) - credits to this StackOverflow answer
- modern APIs:
TextEncoder
encodes strings as UTF-8Uint8Array
s- you then have to loop thru the
Uint8Array
to create the 'bytestring', so this can be slower thanunescape(encodeURI())
for alphanumeric-only strings (short, medium, long strings), but if there's a lot of Unicode or emoji,TextEncoder
is faster (short, medium, long strings) - there's a trick with
TextDecoder()
that's even faster for long strings (but a little slower for really short strings)- I would love to know if there's a trick to get the
Uint8Array
toUint8Array
copy even faster somehow- okay I found two tricks (do
Uint16Array.from(someUint8Array)
or, faster,someUint16Ar.set(someUint8Ar)
) but they're both actually slower than the for-loop?! WTF?
- okay I found two tricks (do
- I would love to know if there's a trick to get the
- for more benchmarks, search jsben.ch for
unescape
- you then have to loop thru the
- even without
TextEncoder
, you can get the UTF-8 bytelength of a string in browsers usingBlob
, and in Node usingBuffer
(See also: a blogpost covering this material in slightly more detail, a longer and more comprehensive blogpost about Unicode and JavaScript)
Unicode is the set of international standards for interpreting bytes (which computers natively manipulate) as text (for humans to read and manipulate).
The term "character" is a little ambiguous since in C/C++/Java/etc it always means 1 byte, but there are many encodings for many human languages where a character in the language can take up >1 byte. So Unicode avoids the term "character" altogether and instead defines:
- code points, which are ID numbers, usually for characters (e.g.
a
is assigned Unicode code point U+0061), but may also be something like an acute accent◌́
(U+0301) which isn't really a character in itself, it's a combining diacritic that only becomes a recognizable character in combination with a letter likea
to formá
. Note that even though code points are written in hexadecimal, they do not denote any particular byte values/bit pattern, nor do they have intrinsic bit width - code units, which are a fixed-width unit of bytes, for example a UTF-8 code unit is a byte, a UTF-16 code unit is a 2-byte word, and a UTF-32 code unit is a 4-byte word
- even more character-like things like "grapheme clusters", but they aren't relevant to this document.
UTF-8, UTF-16, and UCS-2 are ways to represent Unicode code points using byte values. Note the separation of concerns: once upon a time byte values were mapped directly to characters in extremely fragile, forwards-incompatible "character sets" and "code pages" and the like; instead, Unicode maps byte values to code points and code points to characters (or sometimes, groups of code points to characters). Byte-value-to-code-point mappings like UTF-8 were defined decades ago and are essentially unchanged today, and all the code that deals with those bytes can remain blissfully unaware of all the additional code points that Unicode keeps assigning to new and terrible emojis every year 😜😘🤣.
UTF-8 is what everybody wants to use and tries to use as much as possible. It's variable-width, with each Unicode code point taking up 1-4 bytes depending on the code point. Code points in ASCII only take 1 byte, code points in Latin-1 take 2 bytes, emoji take 4 bytes, etc.
UCS-2 is an obsolete 2-byte format for Unicode. Originally, Unicode was planned to have at most 216 = 65,536 code points, which map directly to 2-byte values. A whole generation of programming languages (JS, Java, Python, Ruby, all of .NET, etc) used UCS-2 for their native string type. Eventually it became clear 65,536 wouldn't be enough (especially once China made GB 18030 a legally mandated standard, which had a bunch of characters that UCS-2 didn't have enough room left for), and UTF-16 was born.
UTF-16 is a legacy format created for backwards-compatibility with UCS-2. To represent code points greater than 65,536, it is variable-width like UTF-8, but it can only be 2 or 4 bytes. Out of the first 65,536 Unicode code points (the BMP), the 55,503 that have been assigned characters all map to the same byte values in UCS-2 and UTF-16 (which is themselves: in big-endian, a
= U+0061 maps to 00 61
, etc). Another 2,048 code points in the BMP were reserved from being assigned characters and instead were declared "surrogates", and pairs of them are used to represent code points greater than 65,536. For example, code point U+1F603 (the smiley emoji 😃) is the 4-byte value d8 3d de 03
(in big-endian), where d8 3d
and de 03
are each 2-byte surrogate code units.
Note that that 4-byte value is simultaneously 1 UTF-16 code point, 2 UTF-16 code units, and 2 UCS-2 code units, providing backwards-compatibility with code that works with UCS-2 code units.
In particular, JavaScript strings have UCS-2 semantics, not UTF-16 semantics. So '😃'.length === 2
, even though it only contains one human-recognizable character (a smiley emoji), because that string consists of 2 UCS-2 code units. Moreover, JS strings treat surrogate code points as normal characters whereas in UTF-16 they're only valid as part of a valid surrogate pair. For example, the result of '😃'.charAt(0)
is a perfectly valid string value of length 1 as far as JS is concerned, but is invalid UTF-16 as far as Unicode is concerned.
(Though JS string values have strictly UCS-2 semantics, there are built-in JS functions that are UTF-16-aware, like TextEncoder
and encodeURI()
, which is important for the unescape()
trick.)
What if you're working in JS but want to pretend you're working with UTF-8, like a modern language? Here's a trick: each UCS-2 code unit can represent up to 65,536 values, but a byte of a UTF-8 string can only be one of 256 values. Indeed, every code point U+0000 thru U+00FF are assigned characters. For example, the UTF-8 of '😃'
is f0 9f 98 83
, which we can represent with a length-4 JS string of code points U+00F0, U+009F, U+0098, U+0083: '�'
.
Of course if you want to console.log() it or convert to real UTF-8 or whatnot you have to convert back to native JS Unicode strings, but during manipulation, you can pretend that JS strings are 8-bit safe bytestrings!
MDN calls these "binary strings", credits to this StackOverflow comment for the tip.