Skip to content

Instantly share code, notes, and snippets.

@mathiasbynens
Created September 26, 2011 19:50
Show Gist options
  • Save mathiasbynens/1243213 to your computer and use it in GitHub Desktop.
Save mathiasbynens/1243213 to your computer and use it in GitHub Desktop.
Escape all characters in a string using both Unicode and hexadecimal escape sequences
// Ever needed to escape '\n' as '\\n'? This function does that for any character,
// using hex and/or Unicode escape sequences (whichever are shortest).
// Demo: http://mothereff.in/js-escapes
function unicodeEscape(str) {
return str.replace(/[\s\S]/g, function(character) {
var escape = character.charCodeAt().toString(16),
longhand = escape.length > 2;
return '\\' + (longhand ? 'u' : 'x') + ('0000' + escape).slice(longhand ? -4 : -2);
});
}
@mathiasbynens
Copy link
Author

@kitcambridge It just “dawned on me” that we could use [^] instead of [\s\S] if IE < 9 support is not an issue. Performance tests here: http://jsperf.com/match-any-char-regex

@mathiasbynens
Copy link
Author

mathiasbynens commented Nov 29, 2011

Okay, so we use Unicode escapes (e.g. \u1234) and hexadecimal escapes (e.g. \x12)… What about octal escapes (e.g. \123)?

I quickly tested this in Node.js:

(function() {
    var str = '',
        charCode,
        escape1,
        escape2,
        hexadecimal,
        octal;
    for (charCode = 0; charCode <= 65535; charCode++) {
        hexadecimal = charCode.toString(16);
        octal = charCode.toString(8);
        escape1 = charCode < 256
            ? '\\x' + (charCode > 15 ? '' : '0') + hexadecimal
            : '\\u' + ('0000' + hexadecimal).slice(-4);
        escape2 = octal < 378 ? '\\' + octal : false;
        // log all characters for which octal escapes are the shortest solution
        if (escape2 && escape2.length < escape1.length) {
            console.log(charCode, String.fromCharCode(charCode), escape1, escape2);
        }
    }
}());

Octal escapes can only be used for charCodes smaller than 256, and the test results show that they’re only shorter than Unicode/hex escapes for charCodes < 64:

 0  '\u0000'    '\\x00' '\\0'
 1  '\u0001'    '\\x01' '\\1'
 2  '\u0002'    '\\x02' '\\2'
 3  '\u0003'    '\\x03' '\\3'
 4  '\u0004'    '\\x04' '\\4'
 5  '\u0005'    '\\x05' '\\5'
 6  '\u0006'    '\\x06' '\\6'
 7  '\u0007'    '\\x07' '\\7'
 8  '\b'        '\\x08' '\\10'
 9  '\t'        '\\x09' '\\11'
10  '\n'        '\\x0a' '\\12'
11  '\u000b'    '\\x0b' '\\13'
12  '\f'        '\\x0c' '\\14'
13  '\r'        '\\x0d' '\\15'
14  '\u000e'    '\\x0e' '\\16'
15  '\u000f'    '\\x0f' '\\17'
16  '\u0010'    '\\x10' '\\20'
17  '\u0011'    '\\x11' '\\21'
18  '\u0012'    '\\x12' '\\22'
19  '\u0013'    '\\x13' '\\23'
20  '\u0014'    '\\x14' '\\24'
21  '\u0015'    '\\x15' '\\25'
22  '\u0016'    '\\x16' '\\26'
23  '\u0017'    '\\x17' '\\27'
24  '\u0018'    '\\x18' '\\30'
25  '\u0019'    '\\x19' '\\31'
26  '\u001a'    '\\x1a' '\\32'
27  '\u001b'    '\\x1b' '\\33'
28  '\u001c'    '\\x1c' '\\34'
29  '\u001d'    '\\x1d' '\\35'
30  '\u001e'    '\\x1e' '\\36'
31  '\u001f'    '\\x1f' '\\37'
32  ' '         '\\x20' '\\40'
33  '!'         '\\x21' '\\41'
34  '"'         '\\x22' '\\42'
35  '#'         '\\x23' '\\43'
36  '$'         '\\x24' '\\44'
37  '%'         '\\x25' '\\45'
38  '&'         '\\x26' '\\46'
39  '\'         '\\x27' '\\47'
40  '('         '\\x28' '\\50'
41  ')'         '\\x29' '\\51'
42  '*'         '\\x2a' '\\52'
43  '+'         '\\x2b' '\\53'
44  ','         '\\x2c' '\\54'
45  '-'         '\\x2d' '\\55'
46  '.'         '\\x2e' '\\56'
47  '/'         '\\x2f' '\\57'
48  '0'         '\\x30' '\\60'
49  '1'         '\\x31' '\\61'
50  '2'         '\\x32' '\\62'
51  '3'         '\\x33' '\\63'
52  '4'         '\\x34' '\\64'
53  '5'         '\\x35' '\\65'
54  '6'         '\\x36' '\\66'
55  '7'         '\\x37' '\\67'
56  '8'         '\\x38' '\\70'
57  '9'         '\\x39' '\\71'
58  ':'         '\\x3a' '\\72'
59  ';'         '\\x3b' '\\73'
60  '<'         '\\x3c' '\\74'
61  '='         '\\x3d' '\\75'
62  '>'         '\\x3e' '\\76'
63  '?'         '\\x3f' '\\77'

Of course, it’s problematic if you have e.g. '\0' immediately followed by another digit, e.g. 1, as it will alter the escape rather than append a new character:

'\0' == '\x00' // true
'\01' == '\x001' // false

Update: We probably shouldn’t use them:

Past editions of ECMAScript have included additional syntax and semantics for specifying octal literals and octal escape sequences. These have been removed from this edition of ECMAScript. This non-normative annex presents uniform syntax and semantics for octal literals and octal escape sequences for compatibility with some older ECMAScript programs.

Copy link

ghost commented Nov 29, 2011

@mathiasbynens Yes, it's best to avoid octal escape sequences...the OctalEscapeSequence production is deprecated in ES 5, and produces a syntax error in strict mode:

A conforming implementation, when processing strict mode code (see 10.1.1), may not extend the syntax of EscapeSequence to include OctalEscapeSequence as described in B.1.2. —Annex C

@brandonros
Copy link

I'm throwing this up here hoping to help somebody else down the road.

I had to restore partial keys from a Redis dump, and this function almost helped. Here is what I came up with.

Make sure to create the redis client with like this:

var client = redis.createClient(global['redis_port'], global['redis_host'], { return_buffers: true });

var fs = require('fs');

var redis = require('../lib/redis.js');

function e(buf) {
    var res = '';

    for (var i = 0; i < Buffer.byteLength(buf); ++i) {
        var c = buf[i].toString(16);
        if (c.length == 1) {
            c = '0' + c;
        }

        res += '\\x' + c;
    }

    return res;
}

function generate_dump() {
    var keys = fs.readFileSync('keys.txt').toString().split('\n');

    return keys.reduce(function (prev, key) {
        return prev.then(function () {
            return redis.dump(key)
                .then(function (res) {
                    if (!res) {
                        console.log('missing key', key);

                        return;
                    }

                    fs.appendFileSync('dump.txt', 'RESTORE ' + key + ' 0 "' + e(res) + '"\n');
                });
        });
    }, Promise.resolve());
}

redis.init()
.then(function () {
    return generate_dump();
})
.then(function () {
    console.log('done');
})
.catch(function (err) {
    console.log(err['stack']);
});

@adamvleggett
Copy link

If the goal is to do this with minimal code size, the following works well and minifies to ~100 bytes:

function escapeUnicode(str) {
    return str.replace(/[^\0-~]/g, function(ch) {
        return "\\u" + ("000" + ch.charCodeAt().toString(16)).slice(-4);
    });
}

@F1LT3R
Copy link

F1LT3R commented Dec 15, 2016

Fantastic! Thanks for this @mathiasbynens!

@mervick
Copy link

mervick commented Nov 13, 2018

Replace only unicode characters

function escapeUnicode(str) {
  return str.replace(/[\u00A0-\uffff]/gu, function (c) {
    return "\\u" + ("000" + c.charCodeAt().toString(16)).slice(-4)
  });
}

I use this for convert utf8 content of js files to latin1

@rafaelvanat
Copy link

Very interesting work guys, thanks for sharing.
@mervick was especially useful for my use case, any restriction to use it? Thanks!

@mervick
Copy link

mervick commented Dec 19, 2019

@rafaelvanat I used that in my project more then year, and so far there have been no problems

@josephrocca
Copy link

josephrocca commented Jun 18, 2020

@mervick @rafaelvanat If I use that function like this:

escapeUnicode("abc𝔸𝔹ℂ")

Then I get:

abc𝔸𝔹\u2102

The following function fixes this by matching all non-ASCII characters after splitting the string in a "unicode-safe" way (using [...str]). It then splits each Unicode character up into its code-points, and gets the escape code for each (rather than just grabbing the first char code of each Unicode character):

function escapeUnicode(str) {
  return [...str].map(c => /^[\x00-\x7F]$/.test(c) ? c : c.split("").map(a => "\\u" + a.charCodeAt().toString(16).padStart(4, "0")).join("")).join("");
}

This gives the correct result:

abc\ud835\udd38\ud835\udd39\u2102

This seems to work fine in all my tests so far, but if I find any bugs I'll add fixes in this gist. Performance doesn't matter for my use-case, so I haven't benchmarked or optimised it at all.

@mathiasbynens
Copy link
Author

Check out jsesc which solves this problem in a more robust manner.

@josephrocca
Copy link

josephrocca commented Jun 19, 2020

@mathiasbynens It looks great! I did try to use it but unfortunately I'm not up to date with all the browserify/bundling stuff and just need a vanilla JS script (e.g. no use of Buffer) to include in a module import and wasn't able to work out how to do that with jsesc (though I admit I only poked around for a few minutes before deciding to write the function above). Also, out of pure curiosity I'd be interested in cases where the above function fails - I couldn't find any failing cases in my tests.

@mathiasbynens
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment