Created
February 12, 2015 23:21
-
-
Save dbolser/d002bde517d088fe4c25 to your computer and use it in GitHub Desktop.
BOM
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
21:13 < dbolser_> On another issue... I'm usign LWP::Simple to grab this: | |
https://letstalkbitcoin.com/api/v1/forum/threads, which is | |
"Content-Type:application/json", however, when I decode_json | |
(using JSON), I get the error: malformed JSON string, neither | |
array, object, number, string or atom, at character offset 0 | |
(before "\x{ef}\x{bb}\x{bf}{"...") at ./get_and_load_data.plx | |
line 24 | |
21:14 < dngor> Maybe it's compressed. | |
21:14 < mauke> no, UTF-8 BOM | |
21:14 < mauke> a.k.a. malformed JSON | |
21:14 < dbolser_> https://gist.github.com/anonymous/a24ff7317bdd7dda54b8 | |
21:14 < dbolser_> mauke: you mean it's a server side issue? | |
21:15 < mauke> "issue" ... I guess | |
21:15 < mauke> do you know what a BOM is? | |
21:15 < dbolser_> no | |
21:15 < dngor> Something you can't talk about at airports or in municipal | |
buildings. | |
21:15 < dbolser_> FREEDOM!... | |
21:16 < thrig> also, gunpowder tea | |
21:16 < mauke> ok, this is going to be fun | |
21:16 < mauke> dbolser_: do you know what unicode is? | |
21:16 < dngor> I curl'd it through head -c and hexdump -C and I see what you | |
mean about the BOM. | |
21:16 < dbolser_> mauke: only vaguely... as something I have to work around | |
when things stop being ascii | |
21:17 < dngor> tl;dr: $content =~ s/^[^{]*// first. | |
21:17 < dbolser_> ahhh... | |
21:17 < mauke> $content =~ s/^\x{ef}\x{bb}\x{bf}//; # better | |
21:17 * dbolser_ runs off ignorant but happy | |
21:17 < mauke> and maybe report a bug to them | |
21:17 < mauke> because their "JSON" api returns shit | |
21:18 < dbolser_> mauke: what words should I pretend to understand in my bug | |
report? | |
21:18 < dngor> And I suppose pray that the payload isn't otherwise corrupt. | |
21:18 < mauke> dbolser_: unicode is a character set. it assigns numbers to | |
characters | |
21:18 < mauke> it's a superset of ascii, so 'A' = 65 in both ascii and unicode | |
21:19 < dbolser_> what do you know! it works | |
21:19 < dbolser_> ok | |
21:19 < mauke> the difference is that ascii only has 128 characters (7 bits) | |
but unicode has a lot more (21 bits) | |
21:19 < blooney> 21 bits? | |
21:19 < mauke> so the problem is: how do you actually turn those numbers into | |
bytes so you can store them in files? | |
21:20 < cfedde> yeah. funny number. | |
21:20 < blooney> I thought that it was all in bytes | |
21:20 < mauke> this is where encodings come in | |
21:20 < dbolser_> ok... so far.. I think... | |
21:20 * dbolser_ goes to put daugher back to bed... she doesnt sleep! | |
21:20 < mauke> UTF-32 pads every 21-bit number with zeroes until you have a | |
32-bit number, which is 4 bytes | |
21:21 < blooney> I mean, I was pretty sure that they just took the eight bit | |
that was used in other encoding and pushed there their weird | |
logic to indicate multi-byte characters and that stuff | |
21:21 < mauke> which you can then write to a file | |
21:21 < blooney> oh damn | |
21:21 < kerframil> dbolser: tell them to read the section on encoding in rfc | |
4627, and mention that utf-8 is always little endian | |
21:21 * blooney now has to rethink everything | |
21:22 < mauke> UTF-16 is a bit more complicated. characters that fit in 16 bits | |
are kept as is; other characters are encoded as "surrogate pairs" | |
21:22 < cfedde> or just read the wikipedia page unless you need the gross | |
details. | |
21:22 < mauke> that is, there's a special range of unicode codepoints that are | |
not used for characters | |
21:22 < cfedde> utf-8 is pretty much the winner. for a number of reasons. | |
21:22 < mauke> but whatever | |
21:22 < mauke> UTF-8 is both trickier and simpler | |
21:23 < Grinnz_> just ask IRC | |
21:23 < mauke> 7-bit characters (i.e. ascii) are stored as is | |
21:23 < Grinnz_> well, IRC clients :) | |
21:23 < cfedde> At one end it it is "just ascii" but it gets silly after than. | |
21:23 < cfedde> that | |
21:23 < ttkai> mmm ascii | |
21:23 < mauke> other characters are stored according to some variable-width | |
encoding scheme; details omitted | |
21:24 < Grinnz_> IRC clients generally send that windows version of latin1, but | |
utf-8 encodes it if there's characters > 256 | |
21:24 < Grinnz_> so the decoding is fun | |
21:24 < mauke> the issue with UTF-32 and UTF-16 is that they deal with 4 byte / | |
2 byte entities, but there are two different ways to store them | |
in files | |
21:24 < mauke> big endian and little endian! | |
21:24 < Grinnz_> oh god endianness | |
21:25 < mauke> so let's say your character has the number 43794 in unicode | |
21:25 < mauke> that's 0xAB12 in hex | |
21:25 < cfedde> things get messy when you try to preserve backward | |
compatability while supporting extension. | |
21:25 < blooney> why can't we just decide which endianness everyone will use? | |
21:25 < blue_sky> Grinnz_: female endians | |
21:26 < mauke> serializing that to bytes can give you either {AB, 12} or {12, | |
AB}, depending on which endianness you're using | |
21:26 < cfedde> blooney: history. | |
21:26 < mauke> so there are two variants, UTF-16LE and UTF-16BE (same for | |
UTF-32) | |
21:27 < mauke> so the next problem is, given a document that is in "UTF-16", | |
how do you tell which endianness was used? | |
21:27 < average> mauke: I recently opened the Unicode book and I was horrified | |
by the many variants | |
21:27 < cfedde> It would have been nice if the authors of the encoding had put | |
in a marker for this. | |
21:27 < average> mauke: about your question with the endiannes to use, there | |
was some specific byte for that | |
21:27 < average> mauke: like cfedde says, the marker | |
21:27 < mauke> the trick that was used is to prepend the character 0xFEFF to | |
the document | |
21:27 < average> BOM | |
21:27 < average> I think it was called BOM byte | |
21:28 < mauke> 0xFEFF is a "zero width no-break space", i.e. an invisible space | |
21:28 < average> http://en.wikipedia.org/wiki/Byte_order_mark | |
21:28 < mauke> so when you're reading the document and you see the bytes { FE, | |
FF } you know it's big endian | |
21:28 < blue_sky> average: mauke isn't exactly being obtuse in his explanation, | |
let him get on with it. | |
21:28 < mauke> and if it's { FF, FE }, it's little endian | |
21:28 < blooney> "The Unicode Standard permits the BOM in UTF-8" | |
21:29 < _AxS_> kerframil: pink_mist: thanks! | |
21:29 < mauke> 0xFFEF is an invalid codepoint so there's no ambiguity | |
21:29 < cfedde> hed go BOM | |
21:30 < mauke> 0xFEFF at the start of the document is called a "byte order | |
mark" (BOM) | |
21:30 < mauke> and it's a hack | |
21:30 < dbolser_> OK | |
21:30 < Juerd> It keeps popping up :( | |
21:30 < mauke> ok, so what happens if you add the character 0xFEFF to a | |
document, but then encode it as UTF-8? | |
21:30 < anno> _AxS_: there's no need to switch to the Slic3r package. | |
$Slic3r::var accesses it from anywhere | |
21:30 < dbolser_> I think I'm just going to paste this whole thread to the | |
website dev... | |
21:30 * average had to deal with this sort of thing recently, then realized | |
there were libraries already handling this type of thing, so he just | |
used those.. | |
21:30 < mauke> the result is a string starting with the bytes {EF, BB, BF} | |
21:31 < mauke> it's valid UTF-8 and all | |
21:31 < _AxS_> anno: the issue i was having is that I couldn't find where that | |
path (the 'var' path) was set; for some reasn grep failed me. | |
I'm trying to override that as i don't want to put these image | |
files in a subdir of /usr/bin | |
21:31 < dbolser_> ahh, but it throws the whole doc off by one byte | |
21:31 < dbolser_> ? | |
21:31 < mauke> it's just pointless as a BOM because UTF-8 has no byte order | |
issues. there are no variants and no ambiguity | |
21:32 < Juerd> dbolser_: One codepoint, several bytes. | |
21:32 < blooney> mauke: "The Unicode Standard permits the BOM in UTF-8" | |
21:32 < mauke> dbolser_: the problem is that it's invalid in JSON | |
21:32 < pink_mist> blooney: so? it's still utterly useless in utf-8 | |
21:32 < dbolser_> I can imagine! | |
21:33 < blooney> pink_mist: umm, but it's a standard... | |
21:33 < anno> _AxS_: yes, i know. kerframil's suggestion should work, but is a | |
bit long-winded | |
21:33 < pink_mist> blooney: what? no it isn't. it's just permitted. | |
21:33 < tm604> blooney: The Unicode standard permits many things that aren't | |
valid in JSON | |
21:33 < blooney> pink_mist: I mean it is permitted by standard. And if it is, | |
tools should not break when they see it | |
21:34 < mauke> blooney: nothing is breaking | |
21:34 < Juerd> blooney: "a" is valid Unicod.e Just not valid JSON. | |
21:34 < Juerd> Without the quotes. | |
21:34 < Juerd> Otherwise it would be valid JSON :P | |
21:34 < blooney> ooh right | |
21:34 < pink_mist> haha | |
21:34 < blooney> ok then, kinda makes sense | |
21:34 < mauke> JSON only allows tabs, spaces, line feed, carriage return | |
between tokens | |
21:34 < _AxS_> anno: i'm actually going to patch the 'our $var' setting in the | |
.pm directly before I install it. It uses FindBin, and swapping | |
it to use ::RealBin instead of ::Bin will work just fine | |
21:34 < mauke> so the json decoder skips those and checks what the next | |
character is | |
21:35 < Altreus> wait, the BOM counts as a character? | |
21:35 < sproingie> yes and no. it's zero-width. | |
21:35 < mauke> and instead of [ or { it sees a "zero width no-break space", so | |
it reports a syntax error | |
21:35 < Juerd> Altreus: "Character" is a confusing term. Usually in Unicode | |
stuff, character means codepoint. | |
21:35 < Altreus> I would have thought turning utf8 into chars would remove the | |
BOM | |
21:35 < sproingie> it counts as a code unit, not a glyph | |
21:35 < mauke> Altreus: in UTF-8, yes. because UTF-8 has no BOM | |
21:35 < sproingie> er codepoint that is | |
21:35 < Juerd> See also "control characters" in ASCII. You may not consider | |
them characters, but they're just called that anyway. | |
21:35 < Altreus> that's well confusing :P | |
21:36 < Altreus> I'm just going to never use it | |
21:36 < mauke> correct | |
21:36 < thrig> some of them are quite alarming | |
21:36 < Juerd> Altreus: Yes, the term "character" is a source of a lot of pain | |
and confusion. | |
21:36 < mauke> BOMs also break unix scripts | |
21:36 * blue_sky is taking Altreus' side on UTF | |
21:36 < Altreus> 7? | |
21:36 < sproingie> a "character" is an abstract glyph in unicode-ese | |
21:36 < Altreus> 8 is OP. Nerf UTF8 | |
21:37 < mst> mauke: but they make a great excuse for humming the start of the | |
Toccata from Fugue in D minor | |
21:37 < Altreus> yea but utf8 is the layer above unicode | |
21:37 < mst> BOM BOM BOM .... BOM BOM BOM BOM *BOMMMM* *BOM* | |
21:37 < Juerd> Altreus: Perl 6 will have a configurable definition of | |
"character". You can tell it whether you want graphemes, | |
codepoints, bytes, ... | |
21:37 < sproingie> BOMbast | |
21:37 < mst> Juerd: because what unicodes needs is even more ways to do it | |
wrong :D | |
21:37 < Juerd> In Perl 5, typically, a character is a codepoint, and in that | |
way, a BOM is definitely a character. | |
21:37 < anno> Juerd: nice | |
21:37 < Altreus> Isn't it tocatta *and* fugue | |
21:37 * sproingie just listened to the Pirates of the Carribean soundtrack, | |
now there's nice bombastic tunes | |
21:38 < mauke> grapheme clusterbomb | |
21:38 < sproingie> strangely it's by Klaus Bedelt, i always thought it was Hans | |
Zimmer | |
21:38 < Juerd> mst: Perl 6 will at the same time make doing it right a lot | |
easier though :) | |
21:38 < Altreus> sproingie: he invented the walking aid | |
21:38 < Juerd> mst: But yea, I guess much more rope will be provided than ever | |
before. | |
21:38 < mst> Juerd: I'm sure I'll still find a way to fuck it up | |
21:38 < dbolser_> mauke: many thanks |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment