Last active
June 25, 2024 04:16
-
-
Save mdchaney/98902bbd5c3625130d842c23a704998e to your computer and use it in GitHub Desktop.
Fix encoding of dubious string in Ruby
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
module FixEncoding | |
def FixEncoding.fix_encoding(str) | |
# The "b" method returns a copied string with encoding ASCII-8BIT | |
str = str.b | |
# Strip UTF-8 BOM if it's at start of file | |
if str =~ /\A\xEF\xBB\xBF/n | |
str = str.gsub(/\A\xEF\xBB\xBF/n, '') | |
end | |
if str =~ /([\xc0-\xff][\x80-\xbf]{1,3})+/n | |
# String has actual UTF-8 characters | |
str.force_encoding('UTF-8') | |
elsif !str.ascii_only? | |
# Get rid of Microsoft stupid quotes | |
if str =~ /[\x82\x8b\x91\x92\x9b\xb4\x84\x93\x94]/n | |
str = str.tr("\x82\x8b\x91\x92\x9b\xb4\x84\x93\x94".b, "''''''\"\"\"") | |
end | |
# There was no UTF-8, but there are high characters. Assume to | |
# be Latin-1, and then convert to UTF-8 | |
str.force_encoding('ISO-8859-1').encode('UTF-8') | |
else | |
# No high characters, just mark as UTF-8 | |
str.force_encoding('UTF-8') | |
end | |
end | |
end |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment