plentz/gist:1873224

Created February 21, 2012 02:51

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/plentz/1873224.js"></script>
Save plentz/1873224 to your computer and use it in GitHub Desktop.

Raw

https://github.com/plentz/jruby_report/blob/master/ok_json_test.rb

~/Projects/opensource/jruby_report (master) $ ruby -I. ok_json_test.rb 
Run options: --seed 31809

# Running tests:

"\xEF\xBF\xBD"
..F

Finished tests in 0.049639s, 60.4364 tests/s, 60.4364 assertions/s.

  1) Failure:
test_json_encode(OkJsonTest) [ok_json_test.rb:14]:
Expected: "{\"message\":\"á\"}"
  Actual: "{\"message\":\"\\ufffd\"}"

3 tests, 3 assertions, 1 failures, 0 errors, 0 skips

to this


~/Projects/opensource/jruby_report (master) $ ruby -I. ok_json_test.rb 
Run options: --seed 3567

# Running tests:

.FF

Finished tests in 0.028424s, 105.5446 tests/s, 105.5446 assertions/s.

  1) Failure:
test_decode_bad(OkJsonTest) [ok_json_test.rb:24]:
Expected: "\xEF\xBF\xBD"
  Actual: "�"

  2) Failure:
test_json_encode(OkJsonTest) [ok_json_test.rb:14]:
Expected: "{\"message\":\"á\"}"
  Actual: "{\"message\":\"\\u00e1\"}"

3 tests, 3 assertions, 2 failures, 0 errors, 0 skips

~/Projects/opensource/jruby_report (master) $ ruby -v
ruby 1.9.3p125 (2012-02-16 revision 34643) [x86_64-darwin11.3.0]

kr commented Feb 21, 2012

My tweet wasn't a very good way to reply. I can be more thorough here:

test_json_encode is working correctly now, it's just that the test is being
more picky than it needs to be. "\u00e1" is a valid json string and means
the same thing as "á".
test_json_bad has a bug in the test that matches a bug in okjson. I fixed the
okjson bug, so now the test fails. OkJson.decode used to return ASCII-8BIT
strings containing UTF-8 bytes; now it returns true UTF-8 strings, as it should.
[0xef, 0xbf, 0xbd].pack('C*') encoding is ASCII-8BIT. It could be
rewritten as "\xEF\xBF\xBD" and I believe the test will pass. (Notably here,
the actual string data is the same. Only the metadata has changed.)

kr commented Feb 21, 2012

Even better would be use "\uFFFD" in test_json_bad, since I think
that more clearly expresses the intent.

Author

plentz commented Feb 22, 2012

great @kr! I've updated the tests as you said, but there's one think that still make me think it's the wrong behavior. The test_decode_bad should'nt pass this way?:

  def test_decode_bad
    json = "{\"message\":\"\\ufffd\"}"
    assert_equal("á", OkJson.decode(json)['message'])
  end

When we decode an json, I think that the output will be the "á", or am I wrong? Why this way the test fails?

kr commented Feb 22, 2012

Ah, sorry, I guess my comment was unclear. This string in test_decode_bad:

{"message":"\ufffd"}

is actually valid json representing U+FFFD (REPLACEMENT CHARACTER). This
same character is used by UTF-8 decoders (including okjson) to represent invalid
data that was found in the string during decoding. The UTF-8 representation of this
codepoint is 0xEF 0xBF 0xBD, so in ruby it's "\xEF\xBF\xBD". (By contrast, U+00E1
(LATIN SMALL LETTER A WITH ACUTE) in UTF-8 is 0xC3 0xA1.)

The test was almost correct before. The string data was right, but the metadata
(the encoding on the string) was wrong. So I meant to suggest changing

assert_equal([0xef, 0xbf, 0xbd].pack('C*'), OkJson.decode(json)['message'])

assert_equal("\xEF\xBF\xBD", OkJson.decode(json)['message'])

Another way to represent this idea would be:

s = [0xef, 0xbf, 0xbd].pack('C*')
s.force_encoding('UTF-8')
assert_equal(s, OkJson.decode(json)['message'])

(Also, I take back the suggestion to use "\uFFFD", because it doesn't work in ruby < 1.9.)

Author

plentz commented Feb 22, 2012

great! I misunderstood your comment and updated the code. Btw, I added a couple of tests to flori/json and found something weird. Compare these 2 tests:

To me, looks like that flori/json test is just "righter". (asserting against á instead of \u00e1). I googled for something, but did'nt found a spec that says wich one is the recommended,

Author

plentz commented Feb 22, 2012

Forgot what I've said. Just found this: http://tools.ietf.org/html/rfc4627#section-2.5

Any character *may* be escaped.

So both are correct. Right?

kr commented Feb 22, 2012

Yes, both are correct.

Author

plentz commented Feb 22, 2012

@kr, sorry to bother you, but I think you would like to read this intridea/multi_json#25 (comment) (btw, thanks for helping me till now :)

kr commented Feb 23, 2012

Hey, no problem. That sounds about right. I've thought about this a few times before,
but it didn't seem like a big deal.

I just made kr/okjson#4 so I don't forget about this.

plentz/gist:1873224

kr commented Feb 21, 2012

Uh oh!

kr commented Feb 21, 2012

Uh oh!

plentz commented Feb 22, 2012

Uh oh!

kr commented Feb 22, 2012

Uh oh!

plentz commented Feb 22, 2012

Uh oh!

plentz commented Feb 22, 2012

Uh oh!

kr commented Feb 22, 2012

Uh oh!

plentz commented Feb 22, 2012

Uh oh!

kr commented Feb 23, 2012

Uh oh!