Last active
December 18, 2015 14:19
-
-
Save yfyf/5796345 to your computer and use it in GitHub Desktop.
Unicode handling
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
%% If your shell is setup correctly then inputing unicode chars should work like this: | |
1> A = "žžžūvis". | |
[382,382,382,363,118,105,115] | |
%% Contains integers > 255, | |
%% hence this is a list of codepoints, good! | |
%% None of these will work: | |
17> <<"žžžūvis">>. | |
** exception error: bad argument | |
18> <<"žžžūvis"/utf8>>. | |
** exception error: bad argument | |
%% Do this instead: | |
6> B = unicode:characters_to_binary(A). | |
<<"žžžūvis">> | |
%% This might seem bad, but it's just the funny formatting of the Erlang shell, | |
%% it's actually correct! To get it printed correctly use ~ts: | |
15> io:format("~ts~n", [B]). | |
žžžūvis | |
ok | |
%% To inspect the actual bytes, do use the `w` flag, | |
%% which prints the "raw" data, instead of trying to procude printable chars | |
8> io:format("~w~n", [B]). | |
<<197,190,197,190,197,190,197,171,118,105,115>> | |
%% Yay, a UTF-8 encoded binary (no integers > 255 present). | |
%% The usual way to mess up things: | |
9> BAD = binary_to_list(B). | |
"žžžūvis" | |
10> io:format("~w~n", [BAD]). | |
[197,190,197,190,197,190,197,171,118,105,115] | |
%% Oh no, no longer a list of codepoints, but a list of UTF-8 bytes! | |
%% This will fail explicitly rather misguiding you. | |
11> BAD2 = list_to_binary(A). | |
** exception error: bad argument | |
in function list_to_binary/1 | |
called as list_to_binary([382,382,382,363,118,105,115]) | |
%% Same with format: | |
16> io:format("~s~n", [A]). | |
** exception error: bad argument | |
in function io:format/3 | |
called as io:format(<0.25.0>,"~s~n",[[382,382,382,363,118,105,115]]) | |
%% Another great way to mess up is UTF-8 encode a list of UTF-8 bytes | |
%% instead of a list of Unicode codepoints: | |
12> BeyondHorrible = unicode:characters_to_binary(BAD). | |
<<195,133,194,190,195,133,194,190,195,133,194,190,195,133, | |
194,171,118,105,115>> | |
13> io:format("~s~n", [BeyondHorrible]). | |
à ¾à ¾à ¾à «vis | |
ok |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment