Skip to content

Instantly share code, notes, and snippets.

@mdchaney
Last active June 25, 2024 04:16
Show Gist options
  • Save mdchaney/98902bbd5c3625130d842c23a704998e to your computer and use it in GitHub Desktop.
Save mdchaney/98902bbd5c3625130d842c23a704998e to your computer and use it in GitHub Desktop.
Fix encoding of dubious string in Ruby
module FixEncoding
def FixEncoding.fix_encoding(str)
# The "b" method returns a copied string with encoding ASCII-8BIT
str = str.b
# Strip UTF-8 BOM if it's at start of file
if str =~ /\A\xEF\xBB\xBF/n
str = str.gsub(/\A\xEF\xBB\xBF/n, '')
end
if str =~ /([\xc0-\xff][\x80-\xbf]{1,3})+/n
# String has actual UTF-8 characters
str.force_encoding('UTF-8')
elsif !str.ascii_only?
# Get rid of Microsoft stupid quotes
if str =~ /[\x82\x8b\x91\x92\x9b\xb4\x84\x93\x94]/n
str = str.tr("\x82\x8b\x91\x92\x9b\xb4\x84\x93\x94".b, "''''''\"\"\"")
end
# There was no UTF-8, but there are high characters. Assume to
# be Latin-1, and then convert to UTF-8
str.force_encoding('ISO-8859-1').encode('UTF-8')
else
# No high characters, just mark as UTF-8
str.force_encoding('UTF-8')
end
end
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment