dbolser · February 12, 2015 23:21
diff --git a/Thanks mauke! b/Thanks mauke!
 21:13 < dbolser_> On another issue... I'm usign LWP::Simple to grab this: 
                  https://letstalkbitcoin.com/api/v1/forum/threads, which is 
                  "Content-Type:application/json", however, when I decode_json 
                  (using JSON), I get the error: malformed JSON string, neither 
                  array, object, number, string or atom, at character offset 0 
                  (before "\x{ef}\x{bb}\x{bf}{"...") at ./get_and_load_data.plx 
                  line 24
 21:14 < dngor> Maybe it's compressed.
 21:14 < mauke> no, UTF-8 BOM
 21:14 < mauke> a.k.a. malformed JSON
 21:14 < dbolser_> https://gist.github.com/anonymous/a24ff7317bdd7dda54b8
 21:14 < dbolser_> mauke: you mean it's a server side issue?
 21:15 < mauke> "issue" ... I guess
 21:15 < mauke> do you know what a BOM is?
 21:15 < dbolser_> no
 21:15 < dngor> Something you can't talk about at airports or in municipal 
               buildings.
 21:15 < dbolser_> FREEDOM!... 
 21:16 < thrig> also, gunpowder tea
 21:16 < mauke> ok, this is going to be fun
 21:16 < mauke> dbolser_: do you know what unicode is?
 21:16 < dngor> I curl'd it through head -c and hexdump -C and I see what you 
               mean about the BOM.
 21:16 < dbolser_> mauke: only vaguely... as something I have to work around 
                  when things stop being ascii
 21:17 < dngor> tl;dr: $content =~ s/^[^{]*// first.
 21:17 < dbolser_> ahhh...
 21:17 < mauke> $content =~ s/^\x{ef}\x{bb}\x{bf}//;  # better
 21:17  * dbolser_ runs off ignorant but happy
 21:17 < mauke> and maybe report a bug to them
 21:17 < mauke> because their "JSON" api returns shit
 21:18 < dbolser_> mauke: what words should I pretend to understand in my bug 
                  report?
 21:18 < dngor> And I suppose pray that the payload isn't otherwise corrupt.
 21:18 < mauke> dbolser_: unicode is a character set. it assigns numbers to 
               characters
 21:18 < mauke> it's a superset of ascii, so 'A' = 65 in both ascii and unicode
 21:19 < dbolser_> what do you know! it works
 21:19 < dbolser_> ok
 21:19 < mauke> the difference is that ascii only has 128 characters (7 bits) 
               but unicode has a lot more (21 bits)
 21:19 < blooney> 21 bits?
 21:19 < mauke> so the problem is: how do you actually turn those numbers into 
               bytes so you can store them in files?
 21:20 < cfedde> yeah. funny number.
 21:20 < blooney> I thought that it was all in bytes
 21:20 < mauke> this is where encodings come in
 21:20 < dbolser_> ok... so far.. I think...
 21:20  * dbolser_ goes to put daugher back to bed... she doesnt sleep!
 21:20 < mauke> UTF-32 pads every 21-bit number with zeroes until you have a 
               32-bit number, which is 4 bytes
 21:21 < blooney> I mean, I was pretty sure that they just took the eight bit 
                 that was used in other encoding and pushed there their weird 
                 logic to indicate multi-byte characters and that stuff
 21:21 < mauke> which you can then write to a file
 21:21 < blooney> oh damn
 21:21 < kerframil> dbolser: tell them to read the section on encoding in rfc 
                   4627, and mention that utf-8 is always little endian
 21:21  * blooney now has to rethink everything
 21:22 < mauke> UTF-16 is a bit more complicated. characters that fit in 16 bits 
               are kept as is; other characters are encoded as "surrogate pairs"
 21:22 < cfedde> or just read the wikipedia page unless you need the gross 
                details.
 21:22 < mauke> that is, there's a special range of unicode codepoints that are 
               not used for characters
 21:22 < cfedde> utf-8 is pretty much the winner. for a number of reasons.
 21:22 < mauke> but whatever
 21:22 < mauke> UTF-8 is both trickier and simpler
 21:23 < Grinnz_> just ask IRC
 21:23 < mauke> 7-bit characters (i.e. ascii) are stored as is
 21:23 < Grinnz_> well, IRC clients :)
 21:23 < cfedde> At one end it it is "just ascii" but it gets silly after than.
 21:23 < cfedde> that
 21:23 < ttkai> mmm ascii
 21:23 < mauke> other characters are stored according to some variable-width 
               encoding scheme; details omitted
 21:24 < Grinnz_> IRC clients generally send that windows version of latin1, but 
                 utf-8 encodes it if there's characters > 256
 21:24 < Grinnz_> so the decoding is fun
 21:24 < mauke> the issue with UTF-32 and UTF-16 is that they deal with 4 byte / 
               2 byte entities, but there are two different ways to store them 
               in files
 21:24 < mauke> big endian and little endian!
 21:24 < Grinnz_> oh god endianness
 21:25 < mauke> so let's say your character has the number 43794 in unicode
 21:25 < mauke> that's 0xAB12 in hex
 21:25 < cfedde> things get messy when you try to preserve backward 
                compatability while supporting extension.
 21:25 < blooney> why can't we just decide which endianness everyone will use?
 21:25 < blue_sky> Grinnz_: female endians
 21:26 < mauke> serializing that to bytes can give you either {AB, 12} or {12, 
               AB}, depending on which endianness you're using
 21:26 < cfedde> blooney: history.
 21:26 < mauke> so there are two variants, UTF-16LE and UTF-16BE (same for 
               UTF-32)
 21:27 < mauke> so the next problem is, given a document that is in "UTF-16", 
               how do you tell which endianness was used?
 21:27 < average> mauke: I recently opened the Unicode book and I was horrified 
                 by the many variants
 21:27 < cfedde> It would have been nice if the authors of the encoding had put 
                in a marker for this.
 21:27 < average> mauke: about your question with the endiannes to use, there 
                 was some specific byte for that
 21:27 < average> mauke: like cfedde says, the marker
 21:27 < mauke> the trick that was used is to prepend the character 0xFEFF to 
               the document
 21:27 < average> BOM
 21:27 < average> I think it was called BOM byte
 21:28 < mauke> 0xFEFF is a "zero width no-break space", i.e. an invisible space
 21:28 < average> http://en.wikipedia.org/wiki/Byte_order_mark
 21:28 < mauke> so when you're reading the document and you see the bytes { FE, 
               FF } you know it's big endian
 21:28 < blue_sky> average: mauke isn't exactly being obtuse in his explanation, 
                  let him get on with it.
 21:28 < mauke> and if it's { FF, FE }, it's little endian
 21:28 < blooney> "The Unicode Standard permits the BOM in UTF-8"
 21:29 < _AxS_> kerframil: pink_mist: thanks!
 21:29 < mauke> 0xFFEF is an invalid codepoint so there's no ambiguity
 21:29 < cfedde> hed go BOM
 21:30 < mauke> 0xFEFF at the start of the document is called a "byte order 
               mark" (BOM)
 21:30 < mauke> and it's a hack
 21:30 < dbolser_> OK
 21:30 < Juerd> It keeps popping up :(
 21:30 < mauke> ok, so what happens if you add the character 0xFEFF to a 
               document, but then encode it as UTF-8?
 21:30 < anno> _AxS_: there's no need to switch to the Slic3r package. 
              $Slic3r::var accesses it from anywhere
 21:30 < dbolser_> I think I'm just going to paste this whole thread to the 
                  website dev...
 21:30  * average had to deal with this sort of thing recently, then realized 
          there were libraries already handling this type of thing, so he just 
          used those..
 21:30 < mauke> the result is a string starting with the bytes {EF, BB, BF}
 21:31 < mauke> it's valid UTF-8 and all
 21:31 < _AxS_> anno: the issue i was having is that I couldn't find where that 
               path (the 'var' path) was set; for some reasn grep failed me.  
               I'm trying to override that as i don't want to put these image 
               files in a subdir of /usr/bin
 21:31 < dbolser_> ahh, but it throws the whole doc off by one byte
 21:31 < dbolser_> ?
 21:31 < mauke> it's just pointless as a BOM because UTF-8 has no byte order 
               issues. there are no variants and no ambiguity
 21:32 < Juerd> dbolser_: One codepoint, several bytes.
 21:32 < blooney> mauke: "The Unicode Standard permits the BOM in UTF-8"
 21:32 < mauke> dbolser_: the problem is that it's invalid in JSON
 21:32 < pink_mist> blooney: so? it's still utterly useless in utf-8
 21:32 < dbolser_> I can imagine!
 21:33 < blooney> pink_mist: umm, but it's a standard...
 21:33 < anno> _AxS_: yes, i know. kerframil's suggestion should work, but is a 
              bit long-winded
 21:33 < pink_mist> blooney: what? no it isn't. it's just permitted.
 21:33 < tm604> blooney: The Unicode standard permits many things that aren't 
               valid in JSON
 21:33 < blooney> pink_mist: I mean it is permitted by standard. And if it is, 
                 tools should not break when they see it
 21:34 < mauke> blooney: nothing is breaking
 21:34 < Juerd> blooney: "a" is valid Unicod.e Just not valid JSON.
 21:34 < Juerd> Without the quotes.
 21:34 < Juerd> Otherwise it would be valid JSON :P
 21:34 < blooney> ooh right
 21:34 < pink_mist> haha
 21:34 < blooney> ok then, kinda makes sense
 21:34 < mauke> JSON only allows tabs, spaces, line feed, carriage return 
               between tokens
 21:34 < _AxS_> anno: i'm actually going to patch the 'our $var' setting in the 
               .pm directly before I install it.  It uses FindBin, and swapping 
               it to use ::RealBin instead of ::Bin will work just fine
 21:34 < mauke> so the json decoder skips those and checks what the next 
               character is
 21:35 < Altreus> wait, the BOM counts as a character?
 21:35 < sproingie> yes and no.  it's zero-width.
 21:35 < mauke> and instead of [ or { it sees a "zero width no-break space", so 
               it reports a syntax error
 21:35 < Juerd> Altreus: "Character" is a confusing term. Usually in Unicode 
               stuff, character means codepoint.
 21:35 < Altreus> I would have thought turning utf8 into chars would remove the 
                 BOM
 21:35 < sproingie> it counts as a code unit, not a glyph
 21:35 < mauke> Altreus: in UTF-8, yes. because UTF-8 has no BOM
 21:35 < sproingie> er codepoint that is
 21:35 < Juerd> See also "control characters" in ASCII. You may not consider 
               them characters, but they're just called that anyway.
 21:35 < Altreus> that's well confusing :P
 21:36 < Altreus> I'm just going to never use it
 21:36 < mauke> correct
 21:36 < thrig> some of them are quite alarming
 21:36 < Juerd> Altreus: Yes, the term "character" is a source of a lot of pain 
               and confusion.
 21:36 < mauke> BOMs also break unix scripts
 21:36  * blue_sky is taking Altreus' side on UTF 
 21:36 < Altreus> 7?
 21:36 < sproingie> a "character" is an abstract glyph in unicode-ese
 21:36 < Altreus> 8 is OP. Nerf UTF8
 21:37 < mst> mauke: but they make a great excuse for humming the start of the 
             Toccata from Fugue in D minor
 21:37 < Altreus> yea but utf8 is the layer above unicode
 21:37 < mst> BOM BOM BOM .... BOM BOM BOM BOM *BOMMMM* *BOM*
 21:37 < Juerd> Altreus: Perl 6 will have a configurable definition of 
               "character". You can tell it whether you want graphemes, 
               codepoints, bytes, ...
 21:37 < sproingie> BOMbast
 21:37 < mst> Juerd: because what unicodes needs is even more ways to do it 
             wrong :D
 21:37 < Juerd> In Perl 5, typically, a character is a codepoint, and in that 
               way, a BOM is definitely a character.
 21:37 < anno> Juerd: nice
 21:37 < Altreus> Isn't it tocatta *and* fugue
 21:37  * sproingie just listened to the Pirates of the Carribean soundtrack, 
          now there's nice bombastic tunes
 21:38 < mauke> grapheme clusterbomb
 21:38 < sproingie> strangely it's by Klaus Bedelt, i always thought it was Hans 
                   Zimmer
 21:38 < Juerd> mst: Perl 6 will at the same time make doing it right a lot 
               easier though :)
 21:38 < Altreus> sproingie: he invented the walking aid
 21:38 < Juerd> mst: But yea, I guess much more rope will be provided than ever 
               before.
 21:38 < mst> Juerd: I'm sure I'll still find a way to fuck it up
 21:38 < dbolser_> mauke: many thanks
	21:13 < dbolser_> On another issue... I'm usign LWP::Simple to grab this:
	https://letstalkbitcoin.com/api/v1/forum/threads, which is
	"Content-Type:application/json", however, when I decode_json
	(using JSON), I get the error: malformed JSON string, neither
	array, object, number, string or atom, at character offset 0
	(before "\x{ef}\x{bb}\x{bf}{"...") at ./get_and_load_data.plx
	line 24
	21:14 < dngor> Maybe it's compressed.
	21:14 < mauke> no, UTF-8 BOM
	21:14 < mauke> a.k.a. malformed JSON
	21:14 < dbolser_> https://gist.github.com/anonymous/a24ff7317bdd7dda54b8
	21:14 < dbolser_> mauke: you mean it's a server side issue?
	21:15 < mauke> "issue" ... I guess
	21:15 < mauke> do you know what a BOM is?
	21:15 < dbolser_> no
	21:15 < dngor> Something you can't talk about at airports or in municipal
	buildings.
	21:15 < dbolser_> FREEDOM!...
	21:16 < thrig> also, gunpowder tea
	21:16 < mauke> ok, this is going to be fun
	21:16 < mauke> dbolser_: do you know what unicode is?
	21:16 < dngor> I curl'd it through head -c and hexdump -C and I see what you
	mean about the BOM.
	21:16 < dbolser_> mauke: only vaguely... as something I have to work around
	when things stop being ascii
	21:17 < dngor> tl;dr: $content =~ s/^[^{]*// first.
	21:17 < dbolser_> ahhh...
	21:17 < mauke> $content =~ s/^\x{ef}\x{bb}\x{bf}//; # better
	21:17 * dbolser_ runs off ignorant but happy
	21:17 < mauke> and maybe report a bug to them
	21:17 < mauke> because their "JSON" api returns shit
	21:18 < dbolser_> mauke: what words should I pretend to understand in my bug
	report?
	21:18 < dngor> And I suppose pray that the payload isn't otherwise corrupt.
	21:18 < mauke> dbolser_: unicode is a character set. it assigns numbers to
	characters
	21:18 < mauke> it's a superset of ascii, so 'A' = 65 in both ascii and unicode
	21:19 < dbolser_> what do you know! it works
	21:19 < dbolser_> ok
	21:19 < mauke> the difference is that ascii only has 128 characters (7 bits)
	but unicode has a lot more (21 bits)
	21:19 < blooney> 21 bits?
	21:19 < mauke> so the problem is: how do you actually turn those numbers into
	bytes so you can store them in files?
	21:20 < cfedde> yeah. funny number.
	21:20 < blooney> I thought that it was all in bytes
	21:20 < mauke> this is where encodings come in
	21:20 < dbolser_> ok... so far.. I think...
	21:20 * dbolser_ goes to put daugher back to bed... she doesnt sleep!
	21:20 < mauke> UTF-32 pads every 21-bit number with zeroes until you have a
	32-bit number, which is 4 bytes
	21:21 < blooney> I mean, I was pretty sure that they just took the eight bit
	that was used in other encoding and pushed there their weird
	logic to indicate multi-byte characters and that stuff
	21:21 < mauke> which you can then write to a file
	21:21 < blooney> oh damn
	21:21 < kerframil> dbolser: tell them to read the section on encoding in rfc
	4627, and mention that utf-8 is always little endian
	21:21 * blooney now has to rethink everything
	21:22 < mauke> UTF-16 is a bit more complicated. characters that fit in 16 bits
	are kept as is; other characters are encoded as "surrogate pairs"
	21:22 < cfedde> or just read the wikipedia page unless you need the gross
	details.
	21:22 < mauke> that is, there's a special range of unicode codepoints that are
	not used for characters
	21:22 < cfedde> utf-8 is pretty much the winner. for a number of reasons.
	21:22 < mauke> but whatever
	21:22 < mauke> UTF-8 is both trickier and simpler
	21:23 < Grinnz_> just ask IRC
	21:23 < mauke> 7-bit characters (i.e. ascii) are stored as is
	21:23 < Grinnz_> well, IRC clients :)
	21:23 < cfedde> At one end it it is "just ascii" but it gets silly after than.
	21:23 < cfedde> that
	21:23 < ttkai> mmm ascii
	21:23 < mauke> other characters are stored according to some variable-width
	encoding scheme; details omitted
	21:24 < Grinnz_> IRC clients generally send that windows version of latin1, but
	utf-8 encodes it if there's characters > 256
	21:24 < Grinnz_> so the decoding is fun
	21:24 < mauke> the issue with UTF-32 and UTF-16 is that they deal with 4 byte /
	2 byte entities, but there are two different ways to store them
	in files
	21:24 < mauke> big endian and little endian!
	21:24 < Grinnz_> oh god endianness
	21:25 < mauke> so let's say your character has the number 43794 in unicode
	21:25 < mauke> that's 0xAB12 in hex
	21:25 < cfedde> things get messy when you try to preserve backward
	compatability while supporting extension.
	21:25 < blooney> why can't we just decide which endianness everyone will use?
	21:25 < blue_sky> Grinnz_: female endians
	21:26 < mauke> serializing that to bytes can give you either {AB, 12} or {12,
	AB}, depending on which endianness you're using
	21:26 < cfedde> blooney: history.
	21:26 < mauke> so there are two variants, UTF-16LE and UTF-16BE (same for
	UTF-32)
	21:27 < mauke> so the next problem is, given a document that is in "UTF-16",
	how do you tell which endianness was used?
	21:27 < average> mauke: I recently opened the Unicode book and I was horrified
	by the many variants
	21:27 < cfedde> It would have been nice if the authors of the encoding had put
	in a marker for this.
	21:27 < average> mauke: about your question with the endiannes to use, there
	was some specific byte for that
	21:27 < average> mauke: like cfedde says, the marker
	21:27 < mauke> the trick that was used is to prepend the character 0xFEFF to
	the document
	21:27 < average> BOM
	21:27 < average> I think it was called BOM byte
	21:28 < mauke> 0xFEFF is a "zero width no-break space", i.e. an invisible space
	21:28 < average> http://en.wikipedia.org/wiki/Byte_order_mark
	21:28 < mauke> so when you're reading the document and you see the bytes { FE,
	FF } you know it's big endian
	21:28 < blue_sky> average: mauke isn't exactly being obtuse in his explanation,
	let him get on with it.
	21:28 < mauke> and if it's { FF, FE }, it's little endian
	21:28 < blooney> "The Unicode Standard permits the BOM in UTF-8"
	21:29 < _AxS_> kerframil: pink_mist: thanks!
	21:29 < mauke> 0xFFEF is an invalid codepoint so there's no ambiguity
	21:29 < cfedde> hed go BOM
	21:30 < mauke> 0xFEFF at the start of the document is called a "byte order
	mark" (BOM)
	21:30 < mauke> and it's a hack
	21:30 < dbolser_> OK
	21:30 < Juerd> It keeps popping up :(
	21:30 < mauke> ok, so what happens if you add the character 0xFEFF to a
	document, but then encode it as UTF-8?
	21:30 < anno> _AxS_: there's no need to switch to the Slic3r package.
	$Slic3r::var accesses it from anywhere
	21:30 < dbolser_> I think I'm just going to paste this whole thread to the
	website dev...
	21:30 * average had to deal with this sort of thing recently, then realized
	there were libraries already handling this type of thing, so he just
	used those..
	21:30 < mauke> the result is a string starting with the bytes {EF, BB, BF}
	21:31 < mauke> it's valid UTF-8 and all
	21:31 < _AxS_> anno: the issue i was having is that I couldn't find where that
	path (the 'var' path) was set; for some reasn grep failed me.
	I'm trying to override that as i don't want to put these image
	files in a subdir of /usr/bin
	21:31 < dbolser_> ahh, but it throws the whole doc off by one byte
	21:31 < dbolser_> ?
	21:31 < mauke> it's just pointless as a BOM because UTF-8 has no byte order
	issues. there are no variants and no ambiguity
	21:32 < Juerd> dbolser_: One codepoint, several bytes.
	21:32 < blooney> mauke: "The Unicode Standard permits the BOM in UTF-8"
	21:32 < mauke> dbolser_: the problem is that it's invalid in JSON
	21:32 < pink_mist> blooney: so? it's still utterly useless in utf-8
	21:32 < dbolser_> I can imagine!
	21:33 < blooney> pink_mist: umm, but it's a standard...
	21:33 < anno> _AxS_: yes, i know. kerframil's suggestion should work, but is a
	bit long-winded
	21:33 < pink_mist> blooney: what? no it isn't. it's just permitted.
	21:33 < tm604> blooney: The Unicode standard permits many things that aren't
	valid in JSON
	21:33 < blooney> pink_mist: I mean it is permitted by standard. And if it is,
	tools should not break when they see it
	21:34 < mauke> blooney: nothing is breaking
	21:34 < Juerd> blooney: "a" is valid Unicod.e Just not valid JSON.
	21:34 < Juerd> Without the quotes.
	21:34 < Juerd> Otherwise it would be valid JSON :P
	21:34 < blooney> ooh right
	21:34 < pink_mist> haha
	21:34 < blooney> ok then, kinda makes sense
	21:34 < mauke> JSON only allows tabs, spaces, line feed, carriage return
	between tokens
	21:34 < _AxS_> anno: i'm actually going to patch the 'our $var' setting in the
	.pm directly before I install it. It uses FindBin, and swapping
	it to use ::RealBin instead of ::Bin will work just fine
	21:34 < mauke> so the json decoder skips those and checks what the next
	character is
	21:35 < Altreus> wait, the BOM counts as a character?
	21:35 < sproingie> yes and no. it's zero-width.
	21:35 < mauke> and instead of [ or { it sees a "zero width no-break space", so
	it reports a syntax error
	21:35 < Juerd> Altreus: "Character" is a confusing term. Usually in Unicode
	stuff, character means codepoint.
	21:35 < Altreus> I would have thought turning utf8 into chars would remove the
	BOM
	21:35 < sproingie> it counts as a code unit, not a glyph
	21:35 < mauke> Altreus: in UTF-8, yes. because UTF-8 has no BOM
	21:35 < sproingie> er codepoint that is
	21:35 < Juerd> See also "control characters" in ASCII. You may not consider
	them characters, but they're just called that anyway.
	21:35 < Altreus> that's well confusing :P
	21:36 < Altreus> I'm just going to never use it
	21:36 < mauke> correct
	21:36 < thrig> some of them are quite alarming
	21:36 < Juerd> Altreus: Yes, the term "character" is a source of a lot of pain
	and confusion.
	21:36 < mauke> BOMs also break unix scripts
	21:36 * blue_sky is taking Altreus' side on UTF
	21:36 < Altreus> 7?
	21:36 < sproingie> a "character" is an abstract glyph in unicode-ese
	21:36 < Altreus> 8 is OP. Nerf UTF8
	21:37 < mst> mauke: but they make a great excuse for humming the start of the
	Toccata from Fugue in D minor
	21:37 < Altreus> yea but utf8 is the layer above unicode
	21:37 < mst> BOM BOM BOM .... BOM BOM BOM BOM BOMMMM BOM
	21:37 < Juerd> Altreus: Perl 6 will have a configurable definition of
	"character". You can tell it whether you want graphemes,
	codepoints, bytes, ...
	21:37 < sproingie> BOMbast
	21:37 < mst> Juerd: because what unicodes needs is even more ways to do it
	wrong :D
	21:37 < Juerd> In Perl 5, typically, a character is a codepoint, and in that
	way, a BOM is definitely a character.
	21:37 < anno> Juerd: nice
	21:37 < Altreus> Isn't it tocatta and fugue
	21:37 * sproingie just listened to the Pirates of the Carribean soundtrack,
	now there's nice bombastic tunes
	21:38 < mauke> grapheme clusterbomb
	21:38 < sproingie> strangely it's by Klaus Bedelt, i always thought it was Hans
	Zimmer
	21:38 < Juerd> mst: Perl 6 will at the same time make doing it right a lot
	easier though :)
	21:38 < Altreus> sproingie: he invented the walking aid
	21:38 < Juerd> mst: But yea, I guess much more rope will be provided than ever
	before.
	21:38 < mst> Juerd: I'm sure I'll still find a way to fuck it up
	21:38 < dbolser_> mauke: many thanks