Skip to content

Instantly share code, notes, and snippets.

@colinsurprenant
Created March 16, 2014 19:16
Show Gist options
  • Save colinsurprenant/9588338 to your computer and use it in GitHub Desktop.
Save colinsurprenant/9588338 to your computer and use it in GitHub Desktop.
Ruby encoding inconsistencies
# encoding: utf-8
puts "\nusing #{RUBY_DESCRIPTION}"
puts '"123"' + "\t\t=> " + "123".encoding.to_s
puts '"#{123}"' + "\t=> " + "#{123}".encoding.to_s
puts '"#{123.to_s}"' + "\t=> " + "#{123.to_s}".encoding.to_s
puts '123.to_s' + "\t=> " + 123.to_s.encoding.to_s
using ruby 1.9.3p545 (2014-02-24 revision 45159) [x86_64-darwin13.1.0]
"123" => UTF-8
"#{123}" => US-ASCII
"#{123.to_s}" => US-ASCII
123.to_s => US-ASCII
using ruby 2.0.0p451 (2014-02-24 revision 45167) [x86_64-darwin13.1.0]
"123" => UTF-8
"#{123}" => US-ASCII
"#{123.to_s}" => US-ASCII
123.to_s => US-ASCII
using ruby 2.1.1p76 (2014-02-24 revision 45161) [x86_64-darwin13.0]
"123" => UTF-8
"#{123}" => US-ASCII
"#{123.to_s}" => US-ASCII
123.to_s => US-ASCII
using jruby 1.7.11 (1.9.3p392) 2014-02-24 86339bb on Java HotSpot(TM) 64-Bit Server VM 1.7.0_11-b21 [darwin-x86_64]
"123" => UTF-8
"#{123}" => UTF-8
"#{123.to_s}" => UTF-8
123.to_s => US-ASCII
using rubinius 2.1.1 (2.1.0 be67ed17 2013-10-18 JI) [x86_64-darwin12.5.0]
"123" => UTF-8
"#{123}" => US-ASCII
"#{123.to_s}" => US-ASCII
123.to_s => US-ASCII
using rubinius 2.2.6 (2.1.0 68d916a5 2014-03-10 JI) [x86_64-darwin13.1.0]
"123" => UTF-8
"#{123}" => US-ASCII
"#{123.to_s}" => US-ASCII
123.to_s => US-ASCII
@jorgelbg
Copy link

Agreed! but in the case of Fixnum, what would the gain of respecting the encoding be? Basically to represent numbers ASCII, it's enough.

@headius
Copy link

headius commented Mar 17, 2014

This would be worth filing as a JRuby issue, at least for us to investigate why we differ. My guess is that MRI is more aggressive in normalizing encodings to US-ASCII when combining multiple 7-bit strings together.

@colinsurprenant
Copy link
Author

Thanks @headius I will, but generally speaking wouldn't it make more sense to honour the encoding setting when generating strings, in the case of Fixnum#to_s but I would argue for any "native" to_s?

@jorgelbg if you mean storage-wise, "123" US-ASCII or UTF-8 encoded will be 3 bytes in both cases. Unless I look at it from the wrong angle, its all about consistency. When correct encoding is necessary in your app, having the expected string encoding will avoid having to go into encoding verification/change/transcoding to uniformise your strings.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment