Skip to content

Instantly share code, notes, and snippets.

@simeonwillbanks
Last active October 1, 2015 14:38
Show Gist options
  • Save simeonwillbanks/2009343 to your computer and use it in GitHub Desktop.
Save simeonwillbanks/2009343 to your computer and use it in GitHub Desktop.
Multibyte Testing: figure out the best way to handle 'ArgumentError: invalid byte sequence in UTF-8' exceptions
=> "Résumé"
>> invalid = str + "\xc3\x28" # add accent and ( which invalidates string, http://www.utf8-chartable.de/unicode-utf8-table.pl?utf8=string-literal&unicodeinhtml=hex
=> "Résumé\xC3("
>> str.is_utf8?
=> true
>> invalid.is_utf8?
=> false
>> result = invalid.force_encoding(Encoding::ASCII_8BIT).encode(Encoding::UTF_8, :invalid => :replace, :undef => :replace, :replace => '') # converting to ascii destroys valid utf8
=> "Rsum("
>> result.is_utf8?
=> true
>> result = invalid.force_encoding(Encoding::ASCII_8BIT).encode(Encoding::UTF_8, :invalid => :replace, :undef => :replace)
) # default replace
=> "R��sum���("
>> result.is_utf8?
=> true
>> result = invalid.encode(Encoding::UTF_8, :invalid => :replace, :undef => :replace, :replace => '') # cant figure out what to do
=> "Résumé\xC3("
>> result.is_utf8?
=> false
>> result = invalid.unpack("C*").pack("U*") # also destroys valid chars
=> "RésuméÃ("
>> result.is_utf8?
=> true
>> result = invalid.mb_chars.tidy_bytes # http://api.rubyonrails.org/classes/ActiveSupport/Multibyte/Chars.html#method-i-tidy_bytes
NoMethodError: undefined method `tidy_bytes' for "Résumé\xC3(":String
>> result = invalid.mb_chars.tidy_bytes(true) # force tidy all
NoMethodError: undefined method `tidy_bytes' for "Résumé\xC3(":String
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment